This article is the 9th day article of mixi Group Advent Calendar 2019.

Overview

(Summary of 3 lines)

I want to monitor the accuracy of machine learning training jobs and compare models in a nice way.
Amazon SageMaker provides CloudWatch metrics charts, but it's hard to see ...
It was good to collect metrics with SageMaker SDK or your own code and put the metric data and graph in the same place (S3) as the model.

Challenge: innocent:

Amazon SageMaker provides CloudWatch Metrics-based charts for training job metrics monitoring (https://aws.amazon.com/jp/blogs/news/easily-monitor-and-visualize-metrics-while -training-models-on-amazon-sagemaker /) and now also appear in the job details in the management console It's easy to set up, but personally, it's a tough impression for algorithm metrics monitoring, such as log smoothness (output frequency), scale, and unit notation.

The horizontal axis is fixed by time
Drawing in epoch units is not possible
Time axis graph is distorted due to instance performance etc.
You can manually manipulate the time range to focus, but it is inaccurate
Plots are at least 1 minute apart and the drawing is rough
The vertical axis is fixed around the measured value of the metric, and the scales are not aligned.
Difficult to visually compare jobs
Strict operation feeling ...

Action 1: Use the SageMaker SDK API (Python): relaxed:

Use the SageMaker SDK's TrainingJobAnalytics (https://sagemaker.readthedocs.io/en/stable/analytics.html#sagemaker.analytics.TrainingJobAnalytics) to get the data and control the drawing yourself The data source is still CloudWatchLogs (not fundamentally resolved), but ** readability can be significantly improved **

You can have it drawn in Jupyter Notebook during training, or you can draw it in the code of the Estimator caller at regular or end time and save it in place.

`analytics.py`



metric_names = ['train:loss','validation:loss']

metrics_dataframe = sagemaker.analytics.TrainingJobAnalytics(
    training_job_name=training_job_name,
    metric_names=metric_names,
    period=60, #1 min is the limit value
).dataframe()

#Formatting dataframe
...

plt = metrics_dataframe_fixed.plot(
    kind='line', 
    figsize=(20,15), 
    fontsize=18,
    x='timestamp', 
    y=[metric_names[0],metric_names[1]], 
    xlim=[0, 2000],
    ylim=[0.1, 0.5],
    style=['b.-','r+-'], 
    rot=45,
 )
plt.figure.savefig('metrics_training_job_xxx.png')
plt.clf()

What you can do

Convert metric data to DataFrame to adjust drawing to your liking
Vertical and horizontal scales and sizes can be specified
The drawing start position can be aligned
It can be managed on the code, so you can play around with it.
You can take fixed points during training and notify regularly
If you put the drawn graph together in the model storage area (S3), it will help the model evaluation.

You can also use this method in SageMaker built-in algorithm

Things impossible

The data itself remains the same (CloudWatch-based)
The horizontal axis can only be time
Plot interval remains the same (maximum 1 minute interval)

Action 2: Draw a graph in your own entry point or your own algorithm: blush:

SageMaker has officially 4 ways, but ML framework provided by Amazon Container and [Case using original container](https://docs.aws.amazon.com/ja_jp/ With sagemaker / latest / dg / your-algorithms.html), you can periodically graph and output the situation during training with your own program code and send it to S3.

I'm afraid I'm using SageMaker but not using the monitoring features provided, but if I can't get it in the format I want, I have to take it inside the container (because I write the entry point script and my own ML algorithm myself). , The effort to add graph drawing to the code you know is not so big)

What you can do

Can be freely defined from metric data
Can be output in epoch units, etc.
You can draw as you like
Vertical and horizontal scales and sizes can be specified
The drawing start position can be aligned
It can be managed on the code, so you can play around with it.
You can take fixed points during training and notify regularly
If you put the drawn graph together in the model storage area (S3), it will help the model evaluation.

Things impossible

Don't touch the code This method cannot be used with Built-in algorithm

Process flow

"Where to send the graph drawn in the container and how to share the graph placement destination (S3 path) inside and outside the container" is surprisingly difficult, but the following method can be used as an example.

Define the ML model output destination for each training job on S3
Also define a conditions place in the same location and put a JSON file with information about the model
Define a place for metrics in the same place, and use it as a place to put metrics data and drawn graphs.
Pass the conditions path as ʻinputs` to the Estimator and start training
Run the training algorithm inside the SageMaker container
In the training job, refer to the JSON in conditions and assemble the model output destination path
Execute training, output the progress log and draw a graph
Upload the drawn graph data (fixed point) to the S3 graph placement destination metrics.
Process the graph uploaded to S3 as you like
Always monitor, send notifications, etc.

Code example

`train_task.py`



#conditions generation, training_job_Record name
dict_conditions = { "training_job_name" : training_job_name }
s3_conditions_path = '/model/{}/conditions/training_job_config.json'.format(training_job_name)
boto3.resource('s3').Object(bucket,s3_conditions_path).put(Body=json.dumps(dict_conditions))

#Hand over conditions to sagemaker training job
Estimator.fit(
    job_name=training_job_name,
    inputs={'train_data':s3_train_data_path,'conditions':s3_conditions_path},
)

`train_entrypoint.py`



# Estimator.get the training job name from the conditions passed from the fit caller
#(The path corresponding to the dict key of the passed inputs is generated and the file is placed)
input_conditions = '/opt/ml/input/data/conditions/training_job_config.json'
with open(input_conditions) as f:
    conditions = json.load(f)
    training_job_name = input_conditions['training_job_name']

#Graph path definition
graph_name = 'training_history_{}.png'.format(metrics)
graph_outpath = '{}/{}'.format(output_path,graph_name)
s3_graph_outpath = '/model/{}/metrics/{}'.format(training_job_name,graph_name)

#Draw and save graph (keras example)
history = model.fit(...)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.figure.savefig(graph_outpath)
plt.clf()

#S3 Send graph to bucket to save training job results (update)
boto3.resource('s3').Bucket(bucket).upload_file(graph_outpath,s3_graph_outpath)

As shown in the code, make it possible to pass training_job_name with json in the part that calls Estimator of sagemaker, and from the shared information, the metric output destination for each training job is in the specified format (s3: // {bucket } / Model / {training_job_name} /metrics/{graph_name}.png)

Digression: Draw with TensorBoard

In Action 2, you can write the code freely, so you can output the log for TensorBoard, synchronize it with the specified bucket of S3, and draw it by referring to the log on S3 from TensorBoard launched with Notebook Instance etc. Masu

`train_entrypoint_keras.py`



tensorboard_log_outpath = '{}/{}'.format(output_path,tensorboard_log_name)

tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir=tensorboard_log_outpath, 
    histogram_freq=1)
callbacks = [tensorboard_callback]

model.fit(..., callbacks=callbacks)

boto3.resource('s3').Bucket(bucket).upload_file(
    tensorboard_log_outpath, s3_tensorboard_log_outpath)

`notebook.py`


tensorboard --logdir={s3_tensorboard_log_outpath}

It is possible to draw with other tools of your choice, but I think it is better to take a well-balanced method based on the management cost.

Summary: blush:

As I mentioned several times along the way, the options you can take differ depending on How to use SageMaker.

Built-in algorithm or Marketplace /sagemaker/latest/dg/sagemaker-marketplace.html)
Use Action 1
You can draw with TrainingJobAnalytics to get metric data and improve readability.
Data is constrained by the CloudWatch specification
Original Endpoint or Original Algorithm /sagemaker/latest/dg/your-algorithms.html)
Use Action 1 or Action 2
Can collect metric data and draw graphs within its own algorithm
Can be linked with other tools via S3 etc.

For both measures 1 and 2, I think it is easier to manage by ** uploading the metric data (Dataframe and log) and the drawn graph image to the same S3 as the model storage area **.

I want to organize the metrics to be compared with the same definition so that they can be judged at a glance.

reference

Easily monitor and visualize metrics during model training with Amazon SageMaker (https://aws.amazon.com/jp/blogs/news/easily-monitor-and-visualize-metrics-while-training) -models-on-amazon-sagemaker /)
Amazon SageMaker Developer Guide, Monitoring and Analyzing Training Jobs Using Metrics
Amazon SageMaker Developer Guide, Defining Metrics
SageMaker Python SDK : Analytics
Keras: Visualization of training history

[PYTHON] How to improve model metric monitoring in Amazon SageMaker

Overview

Challenge: innocent:

Action 1: Use the SageMaker SDK API (Python): relaxed:

analytics.py

What you can do

Things impossible

Action 2: Draw a graph in your own entry point or your own algorithm: blush:

What you can do

Things impossible

Process flow

Code example

train_task.py

train_entrypoint.py

Digression: Draw with TensorBoard

train_entrypoint_keras.py

notebook.py

Summary: blush:

reference

`analytics.py`

`train_task.py`

`train_entrypoint.py`

`train_entrypoint_keras.py`

`notebook.py`