Nyc Taxi Data Regression Model Building
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
NYC Taxi Data Regression Model
This is an Azure Machine Learning Pipelines version of two-part tutorial (Part 1, Part 2) available for Azure Machine Learning.
You can combine the two part tutorial into one using AzureML Pipelines as Pipelines provide a way to stitch together various steps involved (like data preparation and training in this case) in a machine learning workflow.
In this notebook, you learn how to prepare data for regression modeling by using open source library pandas. You run various transformations to filter and combine two different NYC taxi datasets. Once you prepare the NYC taxi data for regression modeling, then you will use AutoMLStep available with Azure Machine Learning Pipelines to define your machine learning goals and constraints as well as to launch the automated machine learning process. The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.
After you complete building the model, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.
Prerequisite
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.
Prepare data for regression modeling
First, we will prepare data for regression modeling. We will leverage the convenience of Azure Open Datasets along with the power of Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. Perform pip install azureml-opendatasets to get the open dataset package. The Open Datasets package contains a class representing each data source (NycTlcGreen and NycTlcYellow) to easily filter date parameters before downloading.
Load data
Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_df_raw, randomly sample 500 records from each month to avoid bloating the dataframe. Then preview the data. To keep this process short, we are sampling data of only 1 month.
Note: Open Datasets has mirroring classes for working in Spark environments where data size and memory aren't a concern.
See the data
Download data locally and then upload to Azure Blob
This is a one-time process to save the dave in the default datastore.
Create and register datasets
By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. You can learn more about the what subsetting capabilities are supported by referring to our documentation. The data remains in its existing location, so no extra storage cost is incurred.
Register the taxi datasets with the workspace so that you can reuse them in other experiments or share with your colleagues who have access to your workspace.
Setup Compute
Create new or use an existing compute
Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.
Define RunConfig for the compute
We will also use pandas, scikit-learn and automl, pyarrow for the pipeline steps. Defining the runconfig for that.
Prepare data
Now we will prepare for regression modeling by using pandas. We run various transformations to filter and combine two different NYC taxi datasets.
We achieve this by creating a separate step for each transformation as this allows us to reuse the steps and saves us from running all over again in case of any change. We will keep data preparation scripts in one subfolder and training scripts in another.
The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the
source_directoryfor the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in thesource_directorywould trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in thesource_directoryof the step.
Define Useful Columns
Here we are defining a set of "useful" columns for both Green and Yellow taxi data.
Cleanse Green taxi data
Cleanse Yellow taxi data
Merge cleansed Green and Yellow datasets
We are creating a single data source by merging the cleansed versions of Green and Yellow taxi data.
Filter data
This step filters out coordinates for locations that are outside the city border. We use a TypeConverter object to change the latitude and longitude fields to decimal type.
Normalize data
In this step, we split the pickup and dropoff datetime values into the respective date and time columns and then we rename the columns to use meaningful names.
Transform data
Transform the normalized taxi data to final required format. This steps does the following:
- Split the pickup and dropoff date further into the day of the week, day of the month, and month values.
- After new features are generated, use the drop_columns() function to delete the original fields as the newly generated features are preferred.
- Rename the rest of the fields to use meaningful descriptions.
Split the data into train and test sets
This function segregates the data into dataset for model training and dataset for testing.
Use automated machine learning to build regression model
Now we will use automated machine learning to build the regression model. We will use AutoMLStep in AML Pipelines for this part. Perform pip install azureml-sdk[automl]to get the automated machine learning package. These functions use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip.
Automatically train a model
Create experiment
Define settings for autogeneration and tuning
Here we define the experiment parameter and model settings for autogeneration and tuning. We can specify automl_settings as **kwargs as well.
Use your defined training settings as a parameter to an AutoMLConfig object. Additionally, specify your training data and the type of model, which is regression in this case.
Note: When using AmlCompute, we can't pass Numpy arrays directly to the fit method.
Define AutoMLStep
Build and run the pipeline
Explore the results
View cleansed taxi data
View the combined taxi data profile
View the filtered taxi data profile
View normalized taxi data
View transformed taxi data
View training data used by AutoML
View the details of the AutoML run
Retreive the best model
Uncomment the below cell to retrieve the best model
Test the model
Get test data
Uncomment the below cell to get test data
Test the best fitted model
Uncomment the below cell to test the best fitted model