Notebooks
A
Azure
Tabular Timeseries Dataset Filtering

Tabular Timeseries Dataset Filtering

how-to-use-azuremlazure-mldata-sciencenotebookwork-with-datamachine-learningtimeseries-datasetsazure-machine-learningdeep-learningazuremlazure-ml-notebooksazuredatasets-tutorial

Tabular Time Series Related API Demo with NOAA Weather Data

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.

In this notebook, you will learn how to use the Tabular Time Series related API to filter the data by time windows for sample data uploaded to Azure blob storage.

The detailed APIs to be demoed in this script are:

  • Create Tabular Dataset instance
  • Assign timestamp column and partition timestamp column for Tabular Dataset to activate Time Series related APIs
  • Clear timestamp column and partition timestamp column
  • Filter in data before a specific time
  • Filter in data after a specific time
  • Filter in data in a specific time range
  • Filter in data for recent time range

Besides above APIs, you'll also see:

  • Create and load a Workspace
  • Load weather data into Azure blob storage
  • Create and register weather data as a Tabular dataset
  • Re-load Tabular Dataset from your Workspace

Import Dependencies

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, run the cells below to install the Azure Machine Learning Python SDK and create an Azure ML Workspace that's required for this demo.

Prepare Environment

Print out your version of the Azure ML Python SDK. Version 1.0.60 or above is required for TabularDataset with timeseries attribute.

[ ]

Import Packages

[ ]

Set up Configuraton and Create Azure ML Workspace

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the configuration notebook first if you haven't already to establish your connection to the Azure ML Workspace.

[ ]

Load Data to Blob Storage

This demo uses 2019 weather data under within weather-data folder. You can replace this data with your own.

Upload data to blob storage so it can be used as a Dataset.

[ ]

Create & Register Tabular Dataset with time-series trait from Blob

The API on Tabular datasets with time-series trait is specially designed to handle Tabular time-series data and time related operations more efficiently. By registering your time-series dataset, you are publishing your dataset to your workspace so that it is accessible to anyone with the same subscription id.

Create Tabular Dataset instance from blob storage datapath.

TIP: you can set virtual columns in the partition_format. I.e. if you partition the weather data by state and city, the path can be '/{STATE}/{CITY}/{partition_time:yyy/MM}/data.parquet'. STATE and CITY would then appear as virtual columns in the dataset, allowing for efficient filtering by these timestamps.

[ ]

Assign "datetime" column as timestamp and "partition_time" from folder path as partition_timestamp for Tabular Dataset to activate Time Series related APIs. The column to be assigned should be a Date type, otherwise the assigning will fail.

[ ]

Register the dataset for easy access from anywhere in Azure ML and to keep track of versions, lineage.

[ ]

Reload the Dataset from Workspace

[ ]

Filter Data by Time Windows

Once your data has been loaded into the notebook, you can query by time using the time_before(), time_after(), time_between(), and time_recent() functions.The filter is optimized to only load those data files within the partition_timestamp range when partition_timestamp is specified.

include_boundary is default to be true for all the time series related filters, please pass include_boundary=False to exclude boundary.

Before Time Input

[ ]

After Time Input

[ ]

Before & After Time Inputs

You can chain time functions together.

[ ]

Time Range Input

[ ]

Time Recent Input

This function takes in a datetime.timedelta and returns a dataset containing the data from datetime.now()-timedelta() to datetime.now().

NOTE: This will return an empty dataframe there is no data within the last 2 days.

[ ]

Drop and keep columns

You can also choose to drop or keep certain columns.

Drop Columns

If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset.

[ ]

The exception is expected because dataset loses timeseries capabilities to do time travel.

[ ]

Drop will return dataset with timeseries capabilities if modify column list to exclude timestamp columns.

[ ]
[ ]

Keep Columns

If a timeseries column is not included, the timeseries capabilities will be dropped for the returned dataset.

[ ]

The exception is expected because dataset loses timeseries capabilities to do time travel.

[ ]

Keep will return dataset with timeseries capabilities if modify column list to include timestamp columns.

[ ]
[ ]

Resetting Timestamp Columns

Rules for reseting are:

  • You cannot assign 'None' to timestamp while assign a valid column name to partition_timestamp because partition_timestamp is optional while timestamp is mandatory for Tabular time series data.
  • If you assign 'None' to timestamp, then both timestamp and partition_timestamp will all be cleared.
  • If you assign only 'None' to partition_timestamp, then only partition_timestamp will be cleared.
[ ]

Impressions