Notebooks
A
Azure
Nyc Taxi Data Regression Model Building

Nyc Taxi Data Regression Model Building

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningnyc-taxi-data-regression-model-buildingazure-machine-learningdeep-learningazuremlazure-ml-notebooksmachine-learning-pipelinesazure

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.

Impressions

NYC Taxi Data Regression Model

This is an Azure Machine Learning Pipelines version of two-part tutorial (Part 1, Part 2) available for Azure Machine Learning.

You can combine the two part tutorial into one using AzureML Pipelines as Pipelines provide a way to stitch together various steps involved (like data preparation and training in this case) in a machine learning workflow.

In this notebook, you learn how to prepare data for regression modeling by using open source library pandas. You run various transformations to filter and combine two different NYC taxi datasets. Once you prepare the NYC taxi data for regression modeling, then you will use AutoMLStep available with Azure Machine Learning Pipelines to define your machine learning goals and constraints as well as to launch the automated machine learning process. The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

After you complete building the model, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.

Prerequisite

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.

Prepare data for regression modeling

First, we will prepare data for regression modeling. We will leverage the convenience of Azure Open Datasets along with the power of Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. Perform pip install azureml-opendatasets to get the open dataset package. The Open Datasets package contains a class representing each data source (NycTlcGreen and NycTlcYellow) to easily filter date parameters before downloading.

Load data

Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_df_raw, randomly sample 500 records from each month to avoid bloating the dataframe. Then preview the data. To keep this process short, we are sampling data of only 1 month.

Note: Open Datasets has mirroring classes for working in Spark environments where data size and memory aren't a concern.

[ ]
[ ]
[ ]

See the data

[ ]

Download data locally and then upload to Azure Blob

This is a one-time process to save the dave in the default datastore.

[ ]
[ ]

Create and register datasets

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. You can learn more about the what subsetting capabilities are supported by referring to our documentation. The data remains in its existing location, so no extra storage cost is incurred.

[ ]

Register the taxi datasets with the workspace so that you can reuse them in other experiments or share with your colleagues who have access to your workspace.

[ ]

Setup Compute

Create new or use an existing compute

Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

[ ]

Define RunConfig for the compute

We will also use pandas, scikit-learn and automl, pyarrow for the pipeline steps. Defining the runconfig for that.

[ ]

Prepare data

Now we will prepare for regression modeling by using pandas. We run various transformations to filter and combine two different NYC taxi datasets.

We achieve this by creating a separate step for each transformation as this allows us to reuse the steps and saves us from running all over again in case of any change. We will keep data preparation scripts in one subfolder and training scripts in another.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the source_directory for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the source_directory would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the source_directory of the step.

Define Useful Columns

Here we are defining a set of "useful" columns for both Green and Yellow taxi data.

[ ]

Cleanse Green taxi data

[ ]

Cleanse Yellow taxi data

[ ]

Merge cleansed Green and Yellow datasets

We are creating a single data source by merging the cleansed versions of Green and Yellow taxi data.

[ ]

Filter data

This step filters out coordinates for locations that are outside the city border. We use a TypeConverter object to change the latitude and longitude fields to decimal type.

[ ]

Normalize data

In this step, we split the pickup and dropoff datetime values into the respective date and time columns and then we rename the columns to use meaningful names.

[ ]

Transform data

Transform the normalized taxi data to final required format. This steps does the following:

  • Split the pickup and dropoff date further into the day of the week, day of the month, and month values.
  • After new features are generated, use the drop_columns() function to delete the original fields as the newly generated features are preferred.
  • Rename the rest of the fields to use meaningful descriptions.
[ ]

Split the data into train and test sets

This function segregates the data into dataset for model training and dataset for testing.

[ ]

Use automated machine learning to build regression model

Now we will use automated machine learning to build the regression model. We will use AutoMLStep in AML Pipelines for this part. Perform pip install azureml-sdk[automl]to get the automated machine learning package. These functions use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip.

Automatically train a model

Create experiment

[ ]

Define settings for autogeneration and tuning

Here we define the experiment parameter and model settings for autogeneration and tuning. We can specify automl_settings as **kwargs as well.

Use your defined training settings as a parameter to an AutoMLConfig object. Additionally, specify your training data and the type of model, which is regression in this case.

Note: When using AmlCompute, we can't pass Numpy arrays directly to the fit method.

[ ]

Define AutoMLStep

[ ]

Build and run the pipeline

[ ]
[ ]

Explore the results

[ ]

View cleansed taxi data

[ ]

View the combined taxi data profile

[ ]

View the filtered taxi data profile

[ ]

View normalized taxi data

[ ]

View transformed taxi data

[ ]

View training data used by AutoML

[ ]

View the details of the AutoML run

[ ]

Retreive the best model

Uncomment the below cell to retrieve the best model

[ ]

Test the model

Get test data

Uncomment the below cell to get test data

[ ]

Test the best fitted model

Uncomment the below cell to test the best fitted model

[ ]
[ ]