Notebooks
A
Amazon Web Services
Detect Stalled Training Job And Actions

Detect Stalled Training Job And Actions

data-scienceinferencearchivedtensorflow_action_on_ruleamazon-sagemaker-examplesreinforcement-learningmachine-learningawsexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


This notebook shows you how to use the StalledTrainingRule built-in rule. This rule can take an action to stop your training job or send you an email/SMS, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.

How the StalledTrainingRule Built-in Rule Works

Amazon Sagemaker Debugger captures tensors that you want to watch from training jobs on AWS Deep Learning Containers or your local machine. If you use one of the Debugger-integrated Deep Learning Containers, you don't need to make any changes to your training script to use the functionality of built-in rules. For information about Debugger-supported SageMaker frameworks and versions, see Debugger-supported framework versions for zero script change.

If you want to run a training script that uses partially supported framework by Debugger or your own custom container, you need to manually register the Debugger hook to your training script. The smdebug library provides tools to help the hook registration, and the sample script provided in the src folder includes the hook registration code as comment lines. For more information about how to manually register the Debugger hooks for this case, see the training script at ./src/simple_stalled_training.py, and documentation at smdebug TensorFlow hook, smdebug PyTorch hook, smdebug MXNet hook, and smdebug XGBoost hook.

The Debugger StalledTrainingRule watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the StopTrainingJob API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger StalledTrainingRule to watch the losses pre-built tensor collection.

Install custom packages

These packages were built manually with the changes needed to run rules with actions, since the changes have not been released yet. Remember to refresh the kernel after installing these packages

[ ]

Import SageMaker Python SDK

[ ]

Import SageMaker Debugger classes for rule configuration

[ ]

Create the actions to be used in the rules

The following code cells include:

  • a code line to create the action objects
  • a stalled training job rule configuration object that uses these actions
  • a SageMaker TensorFlow estimator configuration with the Debugger rules parameter to run the built-in rule

Valid action objects are individual actions (StopTraining, Email, SMS) or an ActionList with a combination of these.

Note: Debugger collects loss tensors by default every 500 steps.

[ ]
[ ]
[ ]
[ ]

Monitoring Training and Rule Evaluation Status

Once you execute the estimator.fit() API, SageMaker initiates a training job in the background, and Debugger initiates a StalledTrainingRule rule evaluation job in parallel. Because the training scripts has a few lines of code at the end to force a sleep mode for 10 minutes, the RuleEvaluationStatus for StalledTrainingRule will change to IssuesFound in 2 minutes after the sleep mode is on and trigger the StopTrainingJob API.

Print the training job name

The following cell outputs the training job name and its training status running in the background.

[ ]

Output the current job status and the rule evaluation status

The following cell tracks the status of training job until the SecondaryStatus changes to Stopped or Completed. While training, Debugger collects output tensors from the training job and monitors the training job with the rules.

[ ]
[ ]

Get a direct Amazon CloudWatch URL to find the current rule processing job log

The following script returns a CloudWatch URL. Copy the URL and Paste it to a browser. This will directly lead you to the rule job log page.

[ ]

Conclusion

This notebook showed how you can use the Debugger StalledTrainingRule built-in rule for your training job to take action on rule evaluation status changes. To find more information about Debugger, see Amazon SageMaker Debugger Developer Guide and the smdebug GitHub documentation.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable