Amazon Web Services Expert

MLS-C01

AWS Certified Machine Learning - Specialty

The AWS Certified Machine Learning - Specialty (MLS-C01) validates expertise in building, training, tuning, and deploying machine learning models on AWS. This specialty certification is designed for data scientists and ML engineers with at least two years of hands-on experience developing, architecting, and running ML/deep learning workloads on AWS.

The exam covers four domains: Data Engineering (20%), Exploratory Data Analysis (24%), Modeling (36%), and Machine Learning Implementation and Operations (20%). Candidates must demonstrate deep understanding of Amazon SageMaker, including training, tuning, and deploying models, as well as knowledge of ML algorithms (linear regression, logistic regression, decision trees, random forests, XGBoost, neural networks, CNNs, RNNs, and reinforcement learning).

Key skills tested include creating data repositories, implementing data ingestion and transformation solutions, performing data visualization and feature engineering, selecting appropriate ML algorithms and frameworks, training and tuning ML models, evaluating model performance using metrics like accuracy, precision, recall, F1 score, AUC-ROC, deploying models to production endpoints, and implementing A/B testing and model monitoring.

Note: This certification is being retired on March 31, 2026. AWS recommends the newer AWS Certified Machine Learning Engineer - Associate (MLA-C01) as a replacement path.

Updated May 2024 AI & Machine Learning

65

Questions

6

Practice Tests

75%

Pass Score

191

Views

0

Total Attempts

0%

Avg. Score

0%

Pass Rate

0

Discussions

Practice Tests Flash Cards Exam Topics Cheat Sheet

€5.00

MLS-C01 Practice Exam 1

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

€5.00

MLS-C01 Practice Exam 2

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

€5.00

MLS-C01 Practice Exam 3

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

€5.00

MLS-C01 Practice Exam 4

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

€5.00

MLS-C01 Practice Exam 5

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

€5.00

MLS-C01 Practice Exam 6

Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.

65 Q 180 minutes 75%

Test Drive

Unlock All Content for MLS-C01

6 Practice Test(s) + Flash Cards — 3 months access

€39.99 €26.99 Save 30%

or included with Monthly subscription / Content Bundle

MLS-C01 Cheat Sheet

Quick reference guide - 5 sections

Free Access

AWS Certified Machine Learning - Specialty (MLS-C01)

The MLS-C01 exam validates your ability to design, implement, deploy, and maintain machine learning solutions on AWS. This Specialty-level certification is intended for data scientists, ML engineers, and developers who have at least two years of hands-on experience developing, architecting, and running ML or deep learning workloads on the AWS Cloud. The exam covers the entire ML lifecycle from data preparation through model deployment and operational monitoring, with heavy emphasis on Amazon SageMaker and its ecosystem of tools. Candidates must demonstrate expertise in selecting appropriate ML algorithms, engineering features, training and tuning models, and deploying production-ready ML solutions that meet business requirements for accuracy, latency, cost, and scalability.

Exam Details

Exam Code	MLS-C01
Duration	180 minutes
Number of Questions	65 questions (50 scored + 15 unscored)
Passing Score	750 / 1000
Cost	$300 USD
Validity	3 years
Question Types	Multiple choice (single & multiple select), scenario-based
Testing Options	Pearson VUE testing center or online proctored
Recommended Experience	2+ years hands-on experience with ML/deep learning workloads on AWS
Certification Level	Specialty

Domain Weights

Domain	Weight
Domain 1: Data Engineering	20%
Domain 2: Exploratory Data Analysis	24%
Domain 3: Modeling	36%
Domain 4: Machine Learning Implementation and Operations	20%

Study Tips

Domain 3 (Modeling) carries the highest weight at 36%; dedicate the most study time to understanding SageMaker built-in algorithms, deep learning frameworks, hyperparameter tuning, and model evaluation metrics
Amazon SageMaker is the centerpiece of this exam; you must master every component including training jobs, hosting endpoints, built-in algorithms, Autopilot, Ground Truth, Data Wrangler, Feature Store, Pipelines, Model Monitor, and Clarify
Know the SageMaker built-in algorithms table cold; for each algorithm understand its type (supervised/unsupervised), input format (CSV, RecordIO, JSON), use case, key hyperparameters, and whether it supports GPU training
Data engineering questions focus on building data pipelines; understand S3, Kinesis (Streams, Firehose, Analytics), Glue, EMR, and how data flows from ingestion to training-ready datasets
Feature engineering is heavily tested; know techniques for handling missing data, encoding categorical variables, scaling numerical features, creating time-based features, and applying dimensionality reduction
Understand the complete ML lifecycle end-to-end from data collection and labeling through model deployment and monitoring; the exam tests your ability to connect these stages into production pipelines
Deep learning questions test your knowledge of CNNs for image tasks, RNNs/LSTMs for sequential data, and Transformers for NLP; know when to use each architecture and how to train them on SageMaker
Practice with real AWS scenarios; the exam is heavily scenario-based and expects you to recommend the most appropriate ML approach while considering accuracy, latency, cost, and operational complexity

Question Strategy Tips

Read the last sentence first to identify what is being asked, then read the full scenario with that context in mind; many questions embed the key requirement in the final line
Look for keywords like "real-time inference", "batch transform", "streaming data", "labeled data", "unlabeled data", or "anomaly detection" which point toward specific SageMaker features and algorithms
When two answers seem correct, choose the AWS-managed or SageMaker-native solution over custom implementations unless the question specifically requires a custom approach
SageMaker built-in algorithms are almost always preferred over custom frameworks when the use case matches; the exam favors managed solutions that reduce operational overhead
Pay attention to data format requirements; some algorithms require RecordIO/Protobuf for optimal performance while others accept CSV or JSON; the wrong format choice can be a distractor
Questions about model performance issues require you to distinguish between underfitting and overfitting; know the remediation techniques for each including more data, regularization, cross-validation, and architecture changes
Flag complex questions and return later; do not spend more than 2.5 minutes per question on the first pass
Use the full 180 minutes; ML scenarios require careful reading to identify constraints around data volume, latency requirements, cost budgets, and accuracy thresholds

Key Differences from Other AWS Exams

Machine Learning Specialty requires strong knowledge of ML theory including statistics, linear algebra, probability distributions, gradient descent, and evaluation metrics that are not tested in other AWS exams
Unlike Solutions Architect which focuses on infrastructure design, MLS-C01 expects you to understand the mathematical intuition behind algorithms and know when to apply specific techniques based on data characteristics
This exam tests data preprocessing and feature engineering at a much deeper level than any other AWS certification; you must understand normalization, standardization, one-hot encoding, TF-IDF, word embeddings, and image augmentation
SageMaker is tested more comprehensively than any single service in other Specialty exams; you must know the full service ecosystem including notebook instances, processing jobs, training jobs, tuning jobs, endpoints, and pipelines
The exam expects you to understand deep learning concepts including activation functions, backpropagation, learning rate schedules, batch normalization, dropout, and transfer learning
Model evaluation and selection questions require knowledge of precision, recall, F1 score, AUC-ROC, RMSE, MAE, and confusion matrices; know which metric is appropriate for each business scenario

Recommended Preparation Path

Step 1 - Foundation: Build a solid understanding of ML fundamentals including supervised learning, unsupervised learning, reinforcement learning, bias-variance tradeoff, and the complete ML lifecycle before diving into AWS-specific services
Step 2 - SageMaker Deep Dive: Study every SageMaker component including built-in algorithms, training infrastructure, hosting options, Ground Truth, Autopilot, Data Wrangler, Feature Store, Pipelines, Model Monitor, and Clarify
Step 3 - Data Services: Learn the AWS data engineering stack including S3 data lakes, Kinesis streaming, Glue ETL, EMR for big data processing, and Athena for ad-hoc querying; understand how to build end-to-end data pipelines
Step 4 - Hands-On Labs: Build end-to-end ML projects on SageMaker including data preparation, model training with built-in algorithms and custom frameworks, hyperparameter tuning, and endpoint deployment with auto-scaling
Step 5 - Practice Exams: Take multiple full-length practice exams under timed conditions; review every wrong answer thoroughly and understand why the correct answer is the most appropriate ML solution for the given scenario

Exam Day Checklist

Arrive 15 minutes early for testing center or start your online proctored check-in 30 minutes before the scheduled time
Bring two forms of valid identification (one with photo) for testing center; clear your workspace for online proctoring
You have 180 minutes for 65 questions, which gives you approximately 2 minutes and 46 seconds per question
Use the "Flag for Review" feature liberally on questions you are unsure about; you can return to them later
Read every word in the scenario carefully as questions often contain critical constraints about data volume, latency, accuracy, or budget requirements buried in the middle of the text
Your score is calculated on a scale of 100-1000; you need 750 to pass, which means you need to answer approximately 72-75% correctly
Results are typically available within 1-5 business days through your AWS Certification account
If you do not pass, you can retake the exam after 14 days; there is no limit on the number of attempts
Request accommodations in advance if English is not your first language (extra 30 minutes available for non-native speakers)

Recommended AWS Whitepapers & Resources

AWS Machine Learning Lens (Well-Architected): Comprehensive guide to building ML workloads following AWS best practices; covers data management, model development, deployment, and operations across all four exam domains
Amazon SageMaker Developer Guide: Deep dive into all SageMaker features including built-in algorithms, training configurations, endpoint hosting, and MLOps capabilities; essential reading for Domains 1, 3, and 4
SageMaker Built-in Algorithms Documentation: Detailed documentation for each built-in algorithm including input formats, hyperparameters, instance recommendations, and tuning guidelines; critical for Domain 3
AWS Big Data Analytics Options on AWS: Covers the data engineering stack including Kinesis, Glue, EMR, and data lake architectures; directly relevant to Domain 1 data pipeline questions
Power of ML Whitepaper: Introduction to ML concepts, common use cases, and how AWS services support each stage of the ML lifecycle; good foundational reading for all domains
Amazon SageMaker Model Monitor Documentation: Covers data drift detection, model quality monitoring, bias detection, and feature attribution drift; essential for Domain 4 operational questions

Domain 1: Data Engineering (20%)

This domain focuses on creating data repositories for machine learning, identifying and implementing data ingestion and transformation solutions, and preparing data for ML model training. You must understand how to build reliable, scalable data pipelines using AWS services that can handle both batch and streaming data. The domain tests your ability to select appropriate data storage formats, implement data labeling strategies, and ensure data quality throughout the ML lifecycle. A strong foundation in data engineering is essential because even the best ML algorithms produce poor results when fed low-quality or improperly formatted data.

Key AWS Data Services for ML

Understanding the role of each data service in the ML pipeline is critical. The following table maps services to their primary function in preparing data for machine learning.

Service	Category	ML Pipeline Role
Amazon S3	Storage	Central data lake for raw data, processed features, model artifacts, and training outputs; supports versioning and lifecycle policies
Amazon Kinesis Data Streams	Streaming	Real-time data ingestion with configurable shard capacity; 1-365 day retention; consumers process records for real-time ML inference
Amazon Kinesis Data Firehose	Streaming ETL	Fully managed delivery stream to S3, Redshift, or OpenSearch; supports data transformation via Lambda; near-real-time delivery with buffering
Amazon Kinesis Data Analytics	Stream Processing	SQL or Apache Flink on streaming data; real-time anomaly detection with Random Cut Forest; sliding window aggregations for feature computation
AWS Glue	ETL / Catalog	Serverless ETL with PySpark; Glue Crawlers auto-discover schemas; Glue Data Catalog provides metadata repository; FindMatches ML transform for deduplication
Amazon EMR	Big Data	Managed Hadoop/Spark clusters for large-scale data processing; supports Spark MLlib for distributed ML; cost-effective with Spot instances
AWS Data Pipeline	Orchestration	Scheduled data movement and transformation workflows; supports dependencies between pipeline activities; integrates with S3, RDS, DynamoDB, and EMR
Amazon Athena	Query	Serverless SQL queries on S3 data; useful for ad-hoc data exploration and validation; supports Parquet, ORC, JSON, and CSV formats

SageMaker Data Preparation Tools

SageMaker provides a comprehensive set of tools for preparing data specifically for ML model training. Understanding when to use each tool is critical for the exam.

Tool	Purpose	Key Features
Ground Truth	Data Labeling	Human labeling workforce (Amazon Mechanical Turk, private, vendor); active learning reduces labeling cost by up to 70%; supports image, text, video, 3D point cloud
Data Wrangler	Data Preparation	Visual data preparation with 300+ built-in transforms; data quality insights; exports to SageMaker Pipeline, Processing, or Feature Store
Feature Store	Feature Management	Centralized feature repository; online store (low-latency reads) and offline store (S3-backed for training); feature versioning and sharing across teams
Processing Jobs	Data Processing	Run data preprocessing, postprocessing, and evaluation scripts; supports scikit-learn, Spark, and custom containers; fully managed infrastructure

Data Formats for ML Training

Choosing the right data format significantly impacts training speed and cost. SageMaker algorithms have specific format preferences that you must know for the exam.

Format	Type	Best For
CSV	Text	Small to medium datasets; widely supported; target variable must be in the first column with no header for SageMaker built-in algorithms
RecordIO/Protobuf	Binary	Optimized for SageMaker built-in algorithms; supports Pipe mode for streaming from S3; significantly faster training than CSV for large datasets
Parquet	Columnar	Columnar storage with compression; ideal for Spark-based processing on EMR; excellent for datasets with many columns where only a subset is needed
JSON / JSON Lines	Text	Flexible schema; used by some SageMaker algorithms; JSON Lines (one JSON object per line) is preferred for large datasets
Image Files	Binary	PNG, JPEG for image classification and object detection; use augmented manifest for metadata; RecordIO for efficient I/O in training

Batch vs Streaming Data Ingestion

Batch Ingestion: Use AWS Glue ETL jobs or EMR Spark jobs for scheduled processing of large datasets; data lands in S3 and is processed on a schedule (hourly, daily); ideal for training data preparation where real-time freshness is not required
Streaming Ingestion: Use Kinesis Data Streams for real-time data capture and Kinesis Data Firehose for near-real-time delivery to S3; essential for real-time feature computation and online inference pipelines
Lambda Architecture: Combine batch and streaming layers; use Kinesis for real-time features and Glue/EMR for comprehensive batch reprocessing; SageMaker Feature Store supports both online (real-time) and offline (batch) stores
Pipe Mode vs File Mode: SageMaker Pipe mode streams data directly from S3 during training without downloading the full dataset; reduces startup time and disk requirements; best for large datasets in RecordIO format

Data Labeling Strategies

SageMaker Ground Truth: Managed data labeling service that supports image classification, object detection, semantic segmentation, text classification, named entity recognition, and 3D point cloud labeling; use active learning to reduce labeling costs by automatically labeling high-confidence examples
Workforce Options: Amazon Mechanical Turk for large-scale public labeling tasks; private workforce for sensitive or domain-specific data; third-party vendors for specialized labeling requiring expert knowledge
Active Learning: Ground Truth uses a model trained on human-labeled data to automatically label remaining examples above a confidence threshold; human annotators only review low-confidence predictions; can reduce labeling costs by 40-70%
Annotation Consolidation: Multiple workers label the same data point; Ground Truth uses annotation consolidation algorithms to determine the correct label; increases label quality and reduces individual annotator bias
Augmented Manifests: JSON Lines format that combines S3 data references with labels inline; avoids copying data; supported by SageMaker training jobs for direct consumption of Ground Truth output

Data Quality and Validation

Data Quality Checks: Use SageMaker Data Wrangler to profile data and identify quality issues including missing values, duplicate records, outliers, and class imbalance before model training
Schema Validation: AWS Glue Crawlers automatically detect schema changes; use Glue Data Quality rules to enforce constraints on data freshness, completeness, uniqueness, and referential integrity
Data Versioning: Use S3 versioning for training datasets; SageMaker Experiments tracks dataset versions used for each training run; enables reproducibility of model training
Exam Tip: Questions about data quality often require you to identify the root cause of poor model performance; always consider data issues (missing values, incorrect labels, class imbalance, data leakage) before blaming the algorithm or hyperparameters

Domain 2: Exploratory Data Analysis (24%)

This domain covers sanitizing and preparing data for modeling, feature engineering, and analyzing and visualizing data to identify patterns and inform model selection. At 24% of the exam, this is the second-highest weighted domain and requires a solid understanding of statistics, data preprocessing techniques, and visualization tools. You must know how to handle common data quality issues, transform raw features into model-ready inputs, deal with class imbalance, and use statistical methods to understand data distributions and relationships between variables.

Feature Engineering Techniques

Feature engineering transforms raw data into features that better represent the underlying problem, improving model accuracy. The following techniques are frequently tested on the exam.

Technique	Category	Description & When to Use
One-Hot Encoding	Categorical	Convert categorical variables to binary columns; use for nominal categories with no ordinal relationship; can create high dimensionality with many categories
Label Encoding	Categorical	Assign integer values to categories; use for ordinal variables where order matters (e.g., low/medium/high); not suitable for nominal categories as it implies ordering
Target Encoding	Categorical	Replace category with mean of target variable; useful for high-cardinality categorical features; requires regularization to prevent overfitting
Min-Max Scaling	Numerical	Scale features to [0,1] range; preserves zero entries in sparse data; sensitive to outliers; use when features have different units
Standardization (Z-score)	Numerical	Transform to mean=0, std=1; preferred for algorithms assuming normal distribution; more robust to outliers than Min-Max; required for PCA and many linear models
Log Transform	Numerical	Reduce right skewness; compress wide ranges; make multiplicative relationships additive; useful for income, price, and count data
Binning / Discretization	Numerical	Convert continuous variables to categorical bins; reduces noise; useful for age groups, price ranges; can lose information if bins are too coarse
Polynomial Features	Interaction	Create interaction terms and higher-degree features; captures non-linear relationships in linear models; increases dimensionality significantly

Handling Missing Data

Mean/Median Imputation: Replace missing numerical values with the mean (for normally distributed data) or median (for skewed distributions); simple and fast but reduces variance and ignores relationships between features
Mode Imputation: Replace missing categorical values with the most frequent category; appropriate when the missing mechanism is random and the mode category is dominant
Forward/Backward Fill: Use the previous or next valid value in time-series data; maintains temporal patterns; appropriate when values change infrequently over time
KNN Imputation: Use K-nearest neighbors to impute missing values based on similar records; captures relationships between features; computationally expensive for large datasets
Indicator Variable: Create a binary column indicating whether the value was missing; preserves the information that the value was absent; useful when missingness itself is predictive
Dropping Records: Remove rows with missing values; only appropriate when missing data is a small percentage (less than 5%) and is missing completely at random (MCAR); can introduce bias if data is not MCAR
Exam Tip: The exam often presents scenarios where the correct approach depends on the percentage of missing data, the missing mechanism (MCAR, MAR, MNAR), and whether the feature is critical for prediction; always consider the impact on model bias

Handling Outliers

Z-Score Method: Flag data points more than 3 standard deviations from the mean; assumes normal distribution; simple to implement but not robust for non-Gaussian data
IQR Method: Calculate Q1-1.5*IQR and Q3+1.5*IQR as boundaries; works well for skewed distributions; does not assume normality; preferred for exploratory analysis
Winsorization: Cap extreme values at a specified percentile (e.g., 1st and 99th); preserves all records while limiting the impact of extremes; useful when outliers represent valid but rare observations
Log Transformation: Compresses the range of values, reducing the impact of outliers; effective when the data has a right-skewed distribution; also normalizes the distribution
Random Cut Forest: SageMaker built-in algorithm for anomaly detection; assigns anomaly scores to data points; use Kinesis Data Analytics with Random Cut Forest for real-time streaming anomaly detection
Exam Tip: Do not blindly remove outliers; first determine if they are errors (remove or correct), rare but valid observations (keep and use robust methods), or a separate population (model separately); the exam tests your judgment on when to keep versus remove outliers

Handling Imbalanced Classes

SMOTE (Synthetic Minority Over-sampling): Generate synthetic examples of the minority class by interpolating between existing minority samples; preferred over simple duplication as it adds diversity; can create noisy samples near class boundaries
Random Oversampling: Duplicate minority class examples to match majority class size; simple but can lead to overfitting on duplicated samples; use with caution and monitor validation performance
Random Undersampling: Remove majority class examples to match minority class size; risks losing important information; best when the majority class has clear redundancy
Class Weights: Assign higher weight to minority class samples in the loss function; supported by most SageMaker algorithms; no data modification needed; effectively penalizes misclassification of minority class more heavily
Threshold Adjustment: After training, adjust the classification threshold to favor the minority class; useful when the cost of false negatives is much higher than false positives (e.g., fraud detection)
Evaluation Metrics: Use precision, recall, F1 score, or AUC-ROC instead of accuracy for imbalanced datasets; accuracy can be misleading when one class dominates (99% accuracy by predicting majority class)

Dimensionality Reduction

PCA (Principal Component Analysis): SageMaker built-in algorithm; reduces dimensions by finding orthogonal components that explain maximum variance; requires standardized features; regular mode for dense data, randomized mode for large sparse datasets
t-SNE: Non-linear dimensionality reduction for visualization; maps high-dimensional data to 2D/3D; reveals clusters and patterns not visible in PCA; computationally expensive; not suitable for feature reduction in training pipelines
Feature Selection: Remove irrelevant or redundant features using correlation analysis, mutual information, or recursive feature elimination; simpler than transformation-based methods; maintains interpretability of remaining features
Exam Tip: PCA is the go-to answer for dimensionality reduction on the MLS-C01 exam; know that it requires standardized input, the number of components to keep is based on explained variance ratio (typically 95%), and it is used both as preprocessing and as a SageMaker built-in algorithm

Text Data Preprocessing

Tokenization: Split text into individual words or subwords; sentence tokenization for document-level tasks; word tokenization for feature extraction; subword tokenization (BPE) for neural models handling out-of-vocabulary words
Stop Word Removal: Remove common words (the, is, at) that add noise; reduces feature space; be cautious in sentiment analysis where words like "not" are critical despite being common
Stemming / Lemmatization: Reduce words to root form; stemming (faster, rule-based, e.g., running to run) vs lemmatization (accurate, dictionary-based, e.g., better to good); reduces vocabulary size
TF-IDF: Term Frequency-Inverse Document Frequency; weighs word importance by frequency in document vs rarity across corpus; better than raw counts for text classification; produces sparse feature vectors
Word Embeddings: Dense vector representations (Word2Vec, GloVe, FastText); capture semantic relationships; SageMaker BlazingText implements Word2Vec; pre-trained embeddings available for transfer learning
N-grams: Capture word sequences (bigrams, trigrams); preserve local context; combined with TF-IDF for text classification; increase feature dimensionality significantly

Image Data Preprocessing

Resizing: Standardize image dimensions to match model input requirements; most CNNs expect fixed-size inputs (e.g., 224x224 for ResNet); use bicubic or bilinear interpolation
Normalization: Scale pixel values from [0,255] to [0,1] or standardize with ImageNet mean/std; required for convergence in deep learning; must apply same normalization at inference time
Data Augmentation: Generate additional training samples through random flips, rotations, crops, color jitter, and scaling; reduces overfitting; critical when training data is limited; apply only to training set, never to validation or test
RecordIO Conversion: Convert images to RecordIO format for efficient I/O during SageMaker training; reduces S3 read overhead; essential for large image datasets with Pipe mode training
Exam Tip: Image augmentation is the exam-favorite answer for improving image classification accuracy when training data is limited; know the specific augmentation techniques and that they should only be applied to training data

Statistical Analysis and Visualization

Distribution Analysis: Use histograms and density plots to understand feature distributions; identify skewness, multimodality, and the need for transformations; normal distribution is assumed by many ML algorithms
Correlation Analysis: Pearson correlation for linear relationships; Spearman for monotonic relationships; identify multicollinearity between features; remove highly correlated features to reduce dimensionality
Scatter Plots / Pair Plots: Visualize relationships between pairs of features; identify non-linear patterns, clusters, and outliers; essential for understanding feature interactions
Box Plots: Show distribution, median, quartiles, and outliers; useful for comparing distributions across categories; quickly identify skewness and outlier presence
Amazon QuickSight: AWS managed BI service for interactive dashboards; ML-powered insights with anomaly detection and forecasting; integrates with S3, Athena, Redshift, and RDS data sources
SageMaker Notebooks: Jupyter notebooks with matplotlib, seaborn, and pandas for custom visualization; preferred for detailed EDA workflows; can connect to Athena, S3, and other AWS data sources

Domain 3: Modeling (36%)

This is the highest-weighted domain at 36% of the exam, covering the selection, training, tuning, and evaluation of ML models. You must have deep knowledge of SageMaker built-in algorithms and when to use each one, understand deep learning architectures, know how to configure hyperparameter tuning jobs, apply regularization techniques, and evaluate model performance using appropriate metrics. This domain also covers training infrastructure choices including instance types, distributed training, and managed training features like Spot training and checkpointing.

SageMaker Built-in Algorithms

SageMaker provides optimized implementations of common ML algorithms. Knowing each algorithm, its type, input format, and ideal use case is critical for the exam.

Algorithm	Type	Use Case
Linear Learner	Supervised	Binary/multi-class classification and regression; supports L1/L2 regularization; handles high-dimensional sparse data efficiently
XGBoost	Supervised	Classification and regression with gradient boosted trees; best for structured/tabular data; highly configurable; supports CSV and LibSVM input
K-Nearest Neighbors (KNN)	Supervised	Classification and regression based on proximity; index-based for fast inference; sampling and dimensionality reduction for large datasets
Factorization Machines	Supervised	Recommendation systems and click-through prediction; captures pairwise feature interactions; excels with high-dimensional sparse data
Image Classification	Supervised (DL)	CNN-based image classification; supports transfer learning with pre-trained ResNet; fine-tuning and full training modes; RecordIO or augmented manifest input
Object Detection	Supervised (DL)	Detect and localize objects in images; supports SSD and Faster R-CNN; transfer learning from pre-trained models; outputs bounding boxes with class labels
Semantic Segmentation	Supervised (DL)	Pixel-level classification of images; FCN, PSP, DeepLab architectures; used for autonomous driving, medical imaging, satellite analysis
BlazingText	Supervised / Unsupervised	Text classification (supervised) and Word2Vec embeddings (unsupervised); extremely fast training; single or multi-core CPU and GPU modes
Sequence to Sequence	Supervised (DL)	Machine translation, text summarization, speech-to-text; encoder-decoder architecture with attention mechanism; requires tokenized input
DeepAR	Supervised (DL)	Time-series forecasting using autoregressive RNN; trains on multiple related time series simultaneously; produces probabilistic forecasts with quantiles
K-Means	Unsupervised	Clustering data into K groups; web-scale implementation; uses mini-batch SGD; choose K using elbow method or silhouette score
PCA	Unsupervised	Dimensionality reduction; regular mode for dense data, randomized mode for large sparse datasets; useful as preprocessing before other algorithms
Random Cut Forest	Unsupervised	Anomaly detection; assigns anomaly scores; integrated with Kinesis Analytics for real-time streaming anomaly detection; no labeled data required
LDA (Latent Dirichlet Allocation)	Unsupervised	Topic modeling; discovers hidden topics in document collections; each document is a mixture of topics; useful for content recommendation and organization
NTM (Neural Topic Model)	Unsupervised	Neural network-based topic modeling; often produces more coherent topics than LDA; supports GPU training for large document corpora
IP Insights	Unsupervised	Detect anomalous IP address usage patterns; identifies compromised credentials; learns associations between entities and IP addresses

Deep Learning Architectures

CNN (Convolutional Neural Network): Primary architecture for image tasks; convolutional layers extract spatial features through learnable filters; pooling layers reduce spatial dimensions; fully connected layers for final classification; key architectures include ResNet, VGG, Inception; use transfer learning with pre-trained models for small image datasets
RNN (Recurrent Neural Network): Process sequential data with memory of previous inputs; suffer from vanishing gradient problem for long sequences; suitable for short text classification and simple time-series; largely replaced by LSTM/GRU for longer sequences
LSTM (Long Short-Term Memory): Specialized RNN with gating mechanisms (forget, input, output gates) that solve vanishing gradient; excellent for long sequences including time-series forecasting, speech recognition, and language modeling; bidirectional LSTM reads sequences in both directions for better context
GRU (Gated Recurrent Unit): Simplified version of LSTM with fewer parameters (reset and update gates); faster training than LSTM; comparable performance for many tasks; preferred when training time and computational resources are limited
Transformers: Attention-based architecture that processes all positions simultaneously; self-attention captures long-range dependencies without recurrence; foundation of BERT, GPT, and modern NLP; superior parallelization compared to RNNs
Autoencoders: Encoder-decoder architecture for unsupervised feature learning; compresses input to low-dimensional representation and reconstructs; variational autoencoders (VAE) for generative tasks; denoising autoencoders for robust feature extraction
GANs (Generative Adversarial Networks): Generator creates synthetic data while discriminator evaluates authenticity; used for data augmentation, image generation, and style transfer; training can be unstable (mode collapse); not heavily tested but know the concept

Hyperparameter Tuning

SageMaker Automatic Model Tuning: Managed hyperparameter optimization service; define objective metric, parameter ranges, and max training jobs; creates tuning job that launches multiple training jobs with different hyperparameter combinations
Bayesian Optimization: Default strategy for SageMaker tuning; builds probabilistic model of objective function; balances exploration and exploitation; more efficient than random or grid search for expensive training jobs
Random Search: Samples hyperparameters randomly from defined ranges; good baseline; more efficient than grid search in high dimensions; can run in parallel without dependencies between trials
Hyperband: Early stopping strategy that allocates resources to promising configurations; quickly discards poor performers; efficient for large hyperparameter spaces; supported by SageMaker tuning
Key Hyperparameters: Learning rate (most impactful), batch size, number of epochs, regularization strength (L1/L2), number of layers/neurons, dropout rate, momentum; know which hyperparameters to tune first for each algorithm
Warm Start: Initialize new tuning job from results of previous tuning jobs; reduces total training time; use when refining hyperparameters after initial broad search
Exam Tip: The exam frequently asks about choosing between Bayesian optimization and random search; Bayesian is preferred for expensive training jobs with few parameters, while random search is better for highly parallel exploration of large parameter spaces

Regularization Techniques

L1 Regularization (Lasso): Adds absolute value of weights to loss function; produces sparse models by driving some weights to exactly zero; useful for feature selection; preferred when you expect many irrelevant features
L2 Regularization (Ridge): Adds squared weights to loss function; shrinks all weights toward zero without eliminating features; prevents any single feature from dominating; default choice for most algorithms
Elastic Net: Combines L1 and L2 regularization; balances feature selection with weight shrinkage; useful when features are correlated; alpha parameter controls L1/L2 ratio
Dropout: Randomly deactivate neurons during training; forces network to learn redundant representations; typical rates of 0.2-0.5; apply to hidden layers; do not apply during inference
Early Stopping: Monitor validation loss and stop training when it starts increasing; prevents overfitting; SageMaker supports early stopping in hyperparameter tuning jobs
Batch Normalization: Normalize layer inputs to reduce internal covariate shift; allows higher learning rates; acts as mild regularization; applied between layers in deep networks
Data Augmentation: Generate additional training samples through transformations; effective regularizer for image and text tasks; reduces overfitting by increasing effective training set size

Model Evaluation Metrics

Metric	Task Type	When to Use
Accuracy	Classification	Balanced classes only; proportion of correct predictions; misleading for imbalanced datasets
Precision	Classification	When false positives are costly (spam filtering, content moderation); TP / (TP + FP)
Recall (Sensitivity)	Classification	When false negatives are costly (fraud detection, medical diagnosis); TP / (TP + FN)
F1 Score	Classification	Harmonic mean of precision and recall; when you need balance between FP and FN costs; good for imbalanced datasets
AUC-ROC	Classification	Area under ROC curve; threshold-independent; compares true positive rate vs false positive rate; 0.5 = random, 1.0 = perfect
RMSE	Regression	Root Mean Squared Error; penalizes large errors more; same units as target; sensitive to outliers
MAE	Regression	Mean Absolute Error; robust to outliers; equal weight to all errors; use when outliers should not dominate
R-squared	Regression	Proportion of variance explained; 0 to 1 (higher is better); can be negative for very poor models; easy to interpret
Silhouette Score	Clustering	Measures cluster cohesion and separation; -1 to 1 (higher is better); use to determine optimal K in K-Means

Training Infrastructure on SageMaker

Instance Selection: ml.m5 for general purpose CPU training; ml.c5 for compute-intensive tabular data (XGBoost, Linear Learner); ml.p3/ml.p4 for GPU deep learning; ml.g4dn for cost-effective GPU inference; match instance type to algorithm requirements
Distributed Training: Data parallelism splits mini-batches across multiple GPUs; model parallelism splits model layers across GPUs when model is too large for single GPU; SageMaker distributed training library simplifies both approaches
Managed Spot Training: Use EC2 Spot instances for up to 90% cost reduction; SageMaker handles interruptions with checkpointing; specify max wait time and max runtime; ideal for fault-tolerant training jobs
Checkpointing: Save model state periodically during training; enables resume from interruption (critical for Spot training); also useful for monitoring training progress and selecting best intermediate model
Custom Containers: Bring your own Docker container with custom frameworks; must implement SageMaker training toolkit interface; or extend pre-built SageMaker containers with additional dependencies
Frameworks: SageMaker provides pre-built containers for TensorFlow, PyTorch, MXNet, Hugging Face, scikit-learn, and XGBoost; script mode allows custom training scripts with managed infrastructure
Exam Tip: Know that SageMaker built-in algorithms are already optimized and do not require custom containers; use pre-built framework containers for custom models; bring-your-own-container only when you need libraries not available in pre-built images

Bias-Variance Tradeoff

Underfitting (High Bias): Model is too simple to capture patterns; high training error and high validation error; remedies include more features, more complex model, less regularization, longer training, polynomial features
Overfitting (High Variance): Model memorizes training data including noise; low training error but high validation error; remedies include more training data, regularization (L1/L2/dropout), simpler model, early stopping, cross-validation
Cross-Validation: K-fold cross-validation splits data into K folds and trains K models; provides robust performance estimate; reduces variance of evaluation metrics; standard K=5 or K=10
Learning Curves: Plot training and validation error vs training set size; converging curves indicate good fit; large gap indicates overfitting; both high indicates underfitting; essential diagnostic tool
Exam Tip: The exam frequently presents scenarios with performance issues and asks you to identify underfitting vs overfitting; always look at the gap between training and validation metrics to diagnose the problem before selecting a solution

Domain 4: Machine Learning Implementation and Operations (20%)

This domain covers building, deploying, and maintaining ML solutions in production. You must understand the various SageMaker deployment options including real-time endpoints, batch transform, serverless inference, and asynchronous inference. The domain also tests your knowledge of MLOps practices including model monitoring, CI/CD pipelines, A/B testing, model registry, and operational security. Understanding how to monitor model quality, detect data drift, ensure fairness, and implement cost-effective auto-scaling strategies is essential for passing this section of the exam.

SageMaker Deployment Options

Choosing the right deployment option is critical and depends on latency requirements, traffic patterns, and cost constraints. The exam frequently tests your ability to match deployment types to use cases.

Deployment Type	Latency	Use Case & Characteristics
Real-time Endpoints	Milliseconds	Persistent endpoint for synchronous inference; always running; auto-scaling based on traffic; supports A/B testing with production variants; use for interactive applications requiring sub-second responses
Batch Transform	Minutes-Hours	Process entire datasets in S3; no persistent endpoint; pay only for processing time; ideal for overnight scoring, periodic reports, and large-scale batch predictions
Serverless Inference	Seconds (cold start)	Auto-scales to zero when idle; no minimum cost; cold start latency on first request; ideal for intermittent traffic patterns with unpredictable spikes; specify memory size and max concurrency
Asynchronous Inference	Seconds-Minutes	Queue requests in SQS; process asynchronously; scale to zero; ideal for large payloads (up to 1 GB), long processing times, and cost-sensitive workloads with tolerance for delayed responses
Multi-Model Endpoints	Milliseconds	Host multiple models on single endpoint; dynamically loads/unloads models from S3; reduces hosting cost for many models; ideal for multi-tenant SaaS applications with per-customer models
Multi-Container Endpoints	Milliseconds	Run multiple containers on single endpoint; serial inference pipeline or direct invocation; use for preprocessing, inference, and postprocessing in separate containers

Model Monitoring and Drift Detection

SageMaker Model Monitor: Automatically monitors deployed models for data quality, model quality, bias, and feature attribution drift; creates baseline from training data; generates violation reports when drift is detected; integrates with CloudWatch for alerting
Data Quality Monitor: Detects statistical drift in input features; compares live inference data against training baseline; tracks metrics like mean, std, missing values, and distribution shape; alerts when features deviate beyond configurable thresholds
Model Quality Monitor: Tracks prediction accuracy over time by comparing predictions to ground truth labels; monitors metrics like accuracy, precision, recall, RMSE; requires ground truth labels to be uploaded to S3 and merged with predictions
Bias Drift Monitor: Uses SageMaker Clarify to detect changes in bias metrics over time; monitors both pre-training data bias and post-training prediction bias; essential for regulatory compliance and fairness requirements
Feature Attribution Drift: Uses SHAP values to monitor changes in feature importance; detects when the model begins relying on different features than during training; indicates potential concept drift requiring model retraining
Concept Drift: The relationship between input features and the target variable changes over time; Model Monitor detects symptoms but retraining is the remedy; implement automated retraining pipelines triggered by drift alerts
CloudWatch Integration: Model Monitor publishes metrics to CloudWatch; create alarms for threshold violations; trigger Lambda functions or Step Functions for automated remediation or retraining workflows

SageMaker Clarify

Pre-training Bias Detection: Analyze training data for statistical biases before model training; detects class imbalance, label imbalance, and feature-level biases across sensitive attributes (race, gender, age); generates bias reports with specific metrics
Post-training Bias Detection: Analyze model predictions for disparate impact across groups; measures metrics like Demographic Parity Difference, Equal Opportunity Difference, and Treatment Equality; compares predicted outcomes across sensitive groups
Feature Attribution (Explainability): Uses SHAP (SHapley Additive exPlanations) values to explain individual predictions; identifies which features contributed most to each prediction; supports both global and local explanations
Exam Tip: Clarify is the go-to answer for any exam question about model explainability, bias detection, or fairness; know that it integrates with SageMaker Pipelines, Model Monitor, and can be run as standalone processing jobs

MLOps with SageMaker

SageMaker Pipelines: Native CI/CD for ML workflows; define DAGs with steps for processing, training, evaluation, registration, and deployment; supports conditional logic and parameterization; integrates with SageMaker Experiments for tracking
Model Registry: Central repository for managing model versions; tracks model metadata, lineage, and approval status; supports approval workflows (PendingManualApproval, Approved, Rejected); version control for production models
SageMaker Projects: MLOps templates for end-to-end ML workflows; pre-built templates for model building, training, and deployment; integrates with CodeCommit, CodeBuild, and CodePipeline for source control and CI/CD
SageMaker Experiments: Track and organize training runs; log parameters, metrics, and artifacts for each trial; compare experiments to identify best-performing configurations; integrates with Pipelines for automated tracking
A/B Testing: Deploy multiple model variants on the same endpoint with configurable traffic distribution; compare performance metrics across variants; gradually shift traffic to the winning model; production variants support different instance types and counts
Blue/Green Deployment: Deploy new model version alongside existing version; validate new model before switching traffic; supports instant rollback if issues are detected; use with deployment guardrails in SageMaker
Canary Deployment: Route small percentage of traffic to new model version; monitor for errors and performance degradation; gradually increase traffic if metrics are healthy; automatic rollback if alarm triggers during bake period

Security for ML Workloads

IAM Roles: SageMaker execution roles define permissions for training jobs and endpoints; separate roles for notebook access, training, and deployment; follow least privilege principle; use resource-based policies for cross-account access
VPC Configuration: Run SageMaker training and hosting in your VPC; use VPC endpoints for S3 and SageMaker API access without internet; security groups control inbound/outbound traffic; private subnets for sensitive workloads
Network Isolation: Enable network isolation for training and inference containers; prevents containers from making outbound network calls; data must be pre-downloaded to S3; strongest security posture for sensitive data
Encryption at Rest: S3 server-side encryption (SSE-S3 or SSE-KMS) for training data and model artifacts; EBS encryption for notebook instances and training volumes; KMS customer managed keys for full key control
Encryption in Transit: TLS encryption for all SageMaker API calls; inter-container encryption for distributed training; HTTPS endpoints for inference; encrypted communication between SageMaker and S3
Exam Tip: Security questions on MLS-C01 focus on the combination of IAM roles, VPC configuration, network isolation, and encryption; the most secure answer typically combines all four; know that network isolation prevents the container from accessing the internet even within a VPC

Cost Optimization

Managed Spot Training: Use Spot instances for training jobs with up to 90% cost savings; configure checkpointing for fault tolerance; specify max wait time to control total job duration; ideal for long-running, fault-tolerant training jobs
SageMaker Savings Plans: Commit to consistent compute usage for 1 or 3 years; up to 64% savings on SageMaker ML instances; applies to notebook instances, training, processing, batch transform, and real-time inference
Auto-Scaling Endpoints: Configure target tracking scaling policies on real-time endpoints; scale based on InvocationsPerInstance metric; set minimum and maximum instance counts; scale to match traffic patterns and avoid over-provisioning
Serverless Inference: Zero cost when idle; pay per inference request; ideal for workloads with intermittent traffic and tolerance for cold-start latency; eliminates always-on endpoint costs
Multi-Model Endpoints: Host hundreds of models on shared infrastructure; reduces per-model hosting cost significantly; models are loaded on demand from S3; ideal when individual models have low and infrequent traffic
Right-Sizing: Select the smallest instance type that meets performance requirements; use SageMaker Inference Recommender to benchmark model performance across instance types; avoid GPU instances for algorithms that do not benefit from GPU acceleration
Exam Tip: Cost optimization questions typically present a scenario with specific traffic patterns and ask for the most cost-effective deployment; match intermittent traffic to serverless, high-volume steady traffic to Savings Plans, many models to multi-model endpoints, and training to Spot instances

AWS AI Services (No-Code ML)

For common ML tasks, AWS provides pre-trained AI services that require no ML expertise. Know when to recommend these instead of custom SageMaker solutions.

Amazon Rekognition: Image and video analysis including object detection, facial analysis, content moderation, and celebrity recognition; no training required; use when standard computer vision tasks meet business needs
Amazon Comprehend: Natural language processing including sentiment analysis, entity extraction, key phrase detection, language detection, and topic modeling; custom entity recognition for domain-specific needs
Amazon Translate: Neural machine translation between 75+ languages; supports real-time and batch translation; custom terminology for domain-specific vocabulary; integrates with S3 for document translation
Amazon Transcribe: Automatic speech recognition (ASR); converts audio to text; supports custom vocabularies and language models; streaming and batch transcription; medical transcription variant for healthcare
Amazon Polly: Text-to-speech with natural-sounding voices; supports SSML for pronunciation control; Neural TTS for lifelike speech; use for voice-enabled applications and content accessibility
Amazon Forecast: Time-series forecasting service; automatically selects best algorithm; no ML expertise required; use instead of DeepAR when you want a fully managed forecasting solution
Amazon Personalize: Real-time recommendation engine; supports user personalization, similar items, and personalized ranking; use instead of building custom recommendation models with Factorization Machines
Amazon Textract: Extract text, tables, and forms from documents; goes beyond simple OCR; understands document structure; use for invoice processing, form extraction, and document analysis
Exam Tip: The exam tests whether you know when to use a pre-trained AI service versus building a custom model on SageMaker; AI services are preferred when the use case matches their capabilities, reducing development time and cost significantly