Review: AWS AI and Machine Learning stacks up – InfoWorld
Amazon Web Services claims to have the broadest and most complete set of machine learning capabilities. I honestly don’t know how the company can claim those superlatives with a straight face: Yes, the AWS machine learning offerings are broad and fairly complete and rather impressive, but so are those of Google Cloud and Microsoft Azure.
Amazon SageMaker Clarify is the new add-on to the Amazon SageMaker machine learning ecosystem for Responsible AI. SageMaker Clarify integrates with SageMaker at three points: in the new Data Wrangler to detect data biases at import time, such as imbalanced classes in the training set, in the Experiments tab of SageMaker Studio to detect biases in the model after training and to explain the importance of features, and in the SageMaker Model Monitor, to detect bias shifts in a deployed model over time.
Historically, AWS has presented its services as cloud-only. That is starting to change, at least for big enterprises that can afford to buy racks of proprietary appliances such as AWS Outposts. It’s also changing in AWS’s industrial offerings, such as Amazon Monitron and AWS Panorama, which include some edge devices.
AWS Machine Learning Services
When I reviewed Amazon SageMaker in 2018, I thought it was quite good and that it had “significantly improved the utility of AWS for data scientists.” Little did I know then how much traction it would get and how much it would expand in scope.
When I looked at SageMaker again in April 2020, it was in a preview phase with seven major improvements and expansions, and I said that it was “good enough to use for end-to-end machine learning and deep learning: data preparation, model training, model deployment, and model monitoring.” I also said that the user experience still needed a little work.
There are now twelve parts in Amazon SageMaker: Studio, Autopilot, Ground Truth, JumpStart, Data Wrangler, Feature Store, Clarify, Debugger, Model Monitor, Distributed Training, Pipelines, and Edge Manager. Several of the new SageMaker features, such as Data Wrangler, are major improvements.
Amazon SageMaker Studio
Amazon SageMaker Studio is an integrated machine learning environment where you can build, train, deploy, and analyze your models all in the same application. The IDE is based on JupiterLab, and now supports both Python and R natively in notebook kernels. It has specific support for seven frameworks: Apache MXNet, Apache Spark, Chainer, PyTorch, Scikit-learn, SparkML Serving, and TensorFlow.
SageMaker Studio seems to be a wrapper around SageMaker Notebooks with a few additional features, including SageMaker JumpStart and a different launcher. Both take you to JupyterLab notebooks for actual calculations.
I showed lots of notebook examples in my April 2020 review, but only for Python notebooks. Since then, there are more samples in the repository. Plus, the sample repository is easier to reach from a notebook, and there is now support for R kernels in the notebooks, as shown in the screenshot below. Unlike Microsoft Azure Machine Learning notebooks, SageMaker does not support RStudio.
Amazon SageMaker Autopilot
In the April review I showed a SageMaker Autopilot sample, which took four hours to run. Looking at another Autopilot sample in the repository, for customer churn prediction, I see that it has been improved by adding a model explainability section. This is a welcome addition, as explainability is one facet of Responsible AI, although not the whole story. (See Amazon SageMaker Clarify, below.)
According to the notes in the sample, the enabling improvement for this in the SageMaker Python SDK, introduced in June 2020, was to allow Autopilot-generated models to be configured to return probabilities of each inference. Unfortunately, that means you need to retrain any Autopilot models produced on previous versions of the SDK if you want to add explainability to them.
Amazon SageMaker Ground Truth
As I discussed in April 2020, SageMaker Ground Truth is a semi-supervised learning process for data labeling that combines human annotations with automatic annotations. I don’t see any notable changes in the service since then.
Amazon SageMaker JumpStart
SageMaker JumpStart is a new “Getting Started” feature of SageMaker Studio, which should help newcomers to SageMaker. As you can see in the screenshot below, there are two new colored icons at the bottom of the left sidebar: the upper one brings up a list of solutions, model endpoints, or training jobs created with SageMaker JumpStart, and the lower one, SageMaker Components and Registries, brings up a list of projects, data wrangler flows, pipelines, experiments, trials, models, or endpoints, or access to the feature store.
The Browse JumpStart button in the SageMaker JumpStart Launched Assets panel brings up the browser tab at the right. The browser lets you look through end-to-end solutions tied to other AWS services, text models, vision models, built-in SageMaker algorithms, example notebooks, blogs, and video tutorials.
When you click on a solution square in the browser, you bring up a documentation screen for the solution, which includes a button to launch the actual solution. When you click on a model square that has a fine-tuning option, you should see both Deploy and Train buttons on the documentation screen for the model, when I brought up the BERT Large Cased text model the Train button was disabled, and had a note that said “Unfortunately, fine-tuning is not yet available for this model.”
Amazon SageMaker Data Wrangler
Amazon claims that the new SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning from weeks to minutes. It essentially gives you an interactive workspace where you can import data and try data transformations; on export you can generate a processing notebook.
Supported data transformations include joining and concatenating datasets; custom transforms and formulas; encoding categorical variables; featurizing text and date/time variables; formatting strings; handling outliers and missing values; managing rows, columns, and vectors; processing numeric variables; search and edit; parse value as type; and validate strings. Custom transformations support PySpark, Pandas, and PySpark (SQL) directly, and other Python libraries with
I’m not sure I buy the “weeks to minutes” claim, unless you already know what you’re doing and could write the code snippets yourself off the top of your head. I’d believe that most people could handle data preparation with SageMaker Data Wrangler in a few hours, given some knowledge of Pandas, PySpark, and machine learning data basics, as well as a feel for statistics.
I went through a demo of SageMaker Data Wrangler using a Titanic passenger survival dataset. It took me most of an afternoon, but cost under $2 despite using some sizable VMs for processing.
Amazon SageMaker Feature Store
Metadata and data sharing are two of the missing links in most machine learning data. SageMaker Feature Store allows you to fix that: It is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning features. As mentioned in the previous section, one way to generate a feature group in Feature Store is to save the output from a SageMaker Data Wrangler flow. Another way is to use a streaming data source such as Amazon Kinesis Data Firehose. Feature Store allows you to standardize your features (for example by converting them all to the same units of measure) and to use them consistently (for example by using the same data for training and inference).
Amazon SageMaker Clarify
SageMaker Clarify is Amazon’s Responsible AI offering. It integrates with SageMaker at three points: in SageMaker Data Wrangler to detect data biases, such as imbalanced classes in the training set; in the Experiments tab of SageMaker Studio to detect biases in the model and to explain the importance of features; and in the SageMaker Model Monitor, to detect bias shifts over time.
Amazon SageMaker Debugger
The SageMaker Debugger is a misnomer, but it’s a useful facility for monitoring and profiling training metrics and system resources during machine learning and deep learning training. It allows you to detect common training errors such as inadequate RAM or GPU memory, gradient values exploding or going to zero, over-utilized CPU or GPU, and error metrics starting to rise during training. When it detects specific conditions, it can stop the training or notify you, depending on how you have set up your rules.