MLS-C01
AWS Certified Machine Learning - Specialty
The AWS Certified Machine Learning - Specialty (MLS-C01) validates expertise in building, training, tuning, and deploying machine learning models on AWS. This specialty certification is designed for data scientists and ML engineers with at least two years of hands-on experience developing, architecting, and running ML/deep learning workloads on AWS.
The exam covers four domains: Data Engineering (20%), Exploratory Data Analysis (24%), Modeling (36%), and Machine Learning Implementation and Operations (20%). Candidates must demonstrate deep understanding of Amazon SageMaker, including training, tuning, and deploying models, as well as knowledge of ML algorithms (linear regression, logistic regression, decision trees, random forests, XGBoost, neural networks, CNNs, RNNs, and reinforcement learning).
Key skills tested include creating data repositories, implementing data ingestion and transformation solutions, performing data visualization and feature engineering, selecting appropriate ML algorithms and frameworks, training and tuning ML models, evaluating model performance using metrics like accuracy, precision, recall, F1 score, AUC-ROC, deploying models to production endpoints, and implementing A/B testing and model monitoring.
Note: This certification is being retired on March 31, 2026. AWS recommends the newer AWS Certified Machine Learning Engineer - Associate (MLA-C01) as a replacement path.
MLS-C01 Practice Exam 1
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
MLS-C01 Practice Exam 2
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
MLS-C01 Practice Exam 3
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
MLS-C01 Practice Exam 4
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
MLS-C01 Practice Exam 5
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
MLS-C01 Practice Exam 6
Comprehensive practice exam covering AWS Machine Learning data engineering, exploratory data analysis, modeling, and ML implementation and operations across 65 specialty-level questions.
Teljes hozzáférés feloldása: MLS-C01
6 Gyakorló teszt(ek) + Tanulókártyák — 3 hónapos hozzáférés
vagy benne van a Havi előfizetésben / Tartalomcsomagban
Előnézet (10 / 120)
Tanulókártyák
kártya a legfontosabb 120 fogalmakról MLS-C01
vagy benne van a Havi előfizetésben / Tartalomcsomagban
110 további kártya érhető el feloldás után
Elérhető nyelvek
Vizsgatémák
MLS-C01 Cheat Sheet
Gyors összefoglaló - 5 szakasz
AWS Certified Machine Learning - Specialty (MLS-C01)
The MLS-C01 exam validates your ability to design, implement, deploy, and maintain machine learning solutions on AWS. This Specialty-level certification is intended for data scientists, ML engineers, and developers who have at least two years of hands-on experience developing, architecting, and running ML or deep learning workloads on the AWS Cloud. The exam covers the entire ML lifecycle from data preparation through model deployment and operational monitoring, with heavy emphasis on Amazon SageMaker and its ecosystem of tools. Candidates must demonstrate expertise in selecting appropriate ML algorithms, engineering features, training and tuning models, and deploying production-ready ML solutions that meet business requirements for accuracy, latency, cost, and scalability.
Exam Details
| Exam Code | MLS-C01 |
| Duration | 180 minutes |
| Number of Questions | 65 questions (50 scored + 15 unscored) |
| Passing Score | 750 / 1000 |
| Cost | $300 USD |
| Validity | 3 years |
| Question Types | Multiple choice (single & multiple select), scenario-based |
| Testing Options | Pearson VUE testing center or online proctored |
| Recommended Experience | 2+ years hands-on experience with ML/deep learning workloads on AWS |
| Certification Level | Specialty |
Domain Weights
| Domain | Weight |
|---|---|
| Domain 1: Data Engineering | 20% |
| Domain 2: Exploratory Data Analysis | 24% |
| Domain 3: Modeling | 36% |
| Domain 4: Machine Learning Implementation and Operations | 20% |
Study Tips
- Domain 3 (Modeling) carries the highest weight at 36%; dedicate the most study time to understanding SageMaker built-in algorithms, deep learning frameworks, hyperparameter tuning, and model evaluation metrics
- Amazon SageMaker is the centerpiece of this exam; you must master every component including training jobs, hosting endpoints, built-in algorithms, Autopilot, Ground Truth, Data Wrangler, Feature Store, Pipelines, Model Monitor, and Clarify
- Know the SageMaker built-in algorithms table cold; for each algorithm understand its type (supervised/unsupervised), input format (CSV, RecordIO, JSON), use case, key hyperparameters, and whether it supports GPU training
- Data engineering questions focus on building data pipelines; understand S3, Kinesis (Streams, Firehose, Analytics), Glue, EMR, and how data flows from ingestion to training-ready datasets
- Feature engineering is heavily tested; know techniques for handling missing data, encoding categorical variables, scaling numerical features, creating time-based features, and applying dimensionality reduction
- Understand the complete ML lifecycle end-to-end from data collection and labeling through model deployment and monitoring; the exam tests your ability to connect these stages into production pipelines
- Deep learning questions test your knowledge of CNNs for image tasks, RNNs/LSTMs for sequential data, and Transformers for NLP; know when to use each architecture and how to train them on SageMaker
- Practice with real AWS scenarios; the exam is heavily scenario-based and expects you to recommend the most appropriate ML approach while considering accuracy, latency, cost, and operational complexity
Question Strategy Tips
- Read the last sentence first to identify what is being asked, then read the full scenario with that context in mind; many questions embed the key requirement in the final line
- Look for keywords like "real-time inference", "batch transform", "streaming data", "labeled data", "unlabeled data", or "anomaly detection" which point toward specific SageMaker features and algorithms
- When two answers seem correct, choose the AWS-managed or SageMaker-native solution over custom implementations unless the question specifically requires a custom approach
- SageMaker built-in algorithms are almost always preferred over custom frameworks when the use case matches; the exam favors managed solutions that reduce operational overhead
- Pay attention to data format requirements; some algorithms require RecordIO/Protobuf for optimal performance while others accept CSV or JSON; the wrong format choice can be a distractor
- Questions about model performance issues require you to distinguish between underfitting and overfitting; know the remediation techniques for each including more data, regularization, cross-validation, and architecture changes
- Flag complex questions and return later; do not spend more than 2.5 minutes per question on the first pass
- Use the full 180 minutes; ML scenarios require careful reading to identify constraints around data volume, latency requirements, cost budgets, and accuracy thresholds
Key Differences from Other AWS Exams
- Machine Learning Specialty requires strong knowledge of ML theory including statistics, linear algebra, probability distributions, gradient descent, and evaluation metrics that are not tested in other AWS exams
- Unlike Solutions Architect which focuses on infrastructure design, MLS-C01 expects you to understand the mathematical intuition behind algorithms and know when to apply specific techniques based on data characteristics
- This exam tests data preprocessing and feature engineering at a much deeper level than any other AWS certification; you must understand normalization, standardization, one-hot encoding, TF-IDF, word embeddings, and image augmentation
- SageMaker is tested more comprehensively than any single service in other Specialty exams; you must know the full service ecosystem including notebook instances, processing jobs, training jobs, tuning jobs, endpoints, and pipelines
- The exam expects you to understand deep learning concepts including activation functions, backpropagation, learning rate schedules, batch normalization, dropout, and transfer learning
- Model evaluation and selection questions require knowledge of precision, recall, F1 score, AUC-ROC, RMSE, MAE, and confusion matrices; know which metric is appropriate for each business scenario
Recommended Preparation Path
- Step 1 - Foundation: Build a solid understanding of ML fundamentals including supervised learning, unsupervised learning, reinforcement learning, bias-variance tradeoff, and the complete ML lifecycle before diving into AWS-specific services
- Step 2 - SageMaker Deep Dive: Study every SageMaker component including built-in algorithms, training infrastructure, hosting options, Ground Truth, Autopilot, Data Wrangler, Feature Store, Pipelines, Model Monitor, and Clarify
- Step 3 - Data Services: Learn the AWS data engineering stack including S3 data lakes, Kinesis streaming, Glue ETL, EMR for big data processing, and Athena for ad-hoc querying; understand how to build end-to-end data pipelines
- Step 4 - Hands-On Labs: Build end-to-end ML projects on SageMaker including data preparation, model training with built-in algorithms and custom frameworks, hyperparameter tuning, and endpoint deployment with auto-scaling
- Step 5 - Practice Exams: Take multiple full-length practice exams under timed conditions; review every wrong answer thoroughly and understand why the correct answer is the most appropriate ML solution for the given scenario
Exam Day Checklist
- Arrive 15 minutes early for testing center or start your online proctored check-in 30 minutes before the scheduled time
- Bring two forms of valid identification (one with photo) for testing center; clear your workspace for online proctoring
- You have 180 minutes for 65 questions, which gives you approximately 2 minutes and 46 seconds per question
- Use the "Flag for Review" feature liberally on questions you are unsure about; you can return to them later
- Read every word in the scenario carefully as questions often contain critical constraints about data volume, latency, accuracy, or budget requirements buried in the middle of the text
- Your score is calculated on a scale of 100-1000; you need 750 to pass, which means you need to answer approximately 72-75% correctly
- Results are typically available within 1-5 business days through your AWS Certification account
- If you do not pass, you can retake the exam after 14 days; there is no limit on the number of attempts
- Request accommodations in advance if English is not your first language (extra 30 minutes available for non-native speakers)
Recommended AWS Whitepapers & Resources
- AWS Machine Learning Lens (Well-Architected): Comprehensive guide to building ML workloads following AWS best practices; covers data management, model development, deployment, and operations across all four exam domains
- Amazon SageMaker Developer Guide: Deep dive into all SageMaker features including built-in algorithms, training configurations, endpoint hosting, and MLOps capabilities; essential reading for Domains 1, 3, and 4
- SageMaker Built-in Algorithms Documentation: Detailed documentation for each built-in algorithm including input formats, hyperparameters, instance recommendations, and tuning guidelines; critical for Domain 3
- AWS Big Data Analytics Options on AWS: Covers the data engineering stack including Kinesis, Glue, EMR, and data lake architectures; directly relevant to Domain 1 data pipeline questions
- Power of ML Whitepaper: Introduction to ML concepts, common use cases, and how AWS services support each stage of the ML lifecycle; good foundational reading for all domains
- Amazon SageMaker Model Monitor Documentation: Covers data drift detection, model quality monitoring, bias detection, and feature attribution drift; essential for Domain 4 operational questions
Domain 1: Data Engineering (20%)
This domain focuses on creating data repositories for machine learning, identifying and implementing data ingestion and transformation solutions, and preparing data for ML model training. You must understand how to build reliable, scalable data pipelines using AWS services that can handle both batch and streaming data. The domain tests your ability to select appropriate data storage formats, implement data labeling strategies, and ensure data quality throughout the ML lifecycle. A strong foundation in data engineering is essential because even the best ML algorithms produce poor results when fed low-quality or improperly formatted data.
Key AWS Data Services for ML
Understanding the role of each data service in the ML pipeline is critical. The following table maps services to their primary function in preparing data for machine learning.
| Service | Category | ML Pipeline Role |
|---|---|---|
| Amazon S3 | Storage | Central data lake for raw data, processed features, model artifacts, and training outputs; supports versioning and lifecycle policies |
| Amazon Kinesis Data Streams | Streaming | Real-time data ingestion with configurable shard capacity; 1-365 day retention; consumers process records for real-time ML inference |
| Amazon Kinesis Data Firehose | Streaming ETL | Fully managed delivery stream to S3, Redshift, or OpenSearch; supports data transformation via Lambda; near-real-time delivery with buffering |
| Amazon Kinesis Data Analytics | Stream Processing | SQL or Apache Flink on streaming data; real-time anomaly detection with Random Cut Forest; sliding window aggregations for feature computation |
| AWS Glue | ETL / Catalog | Serverless ETL with PySpark; Glue Crawlers auto-discover schemas; Glue Data Catalog provides metadata repository; FindMatches ML transform for deduplication |
| Amazon EMR | Big Data | Managed Hadoop/Spark clusters for large-scale data processing; supports Spark MLlib for distributed ML; cost-effective with Spot instances |
| AWS Data Pipeline | Orchestration | Scheduled data movement and transformation workflows; supports dependencies between pipeline activities; integrates with S3, RDS, DynamoDB, and EMR |
| Amazon Athena | Query | Serverless SQL queries on S3 data; useful for ad-hoc data exploration and validation; supports Parquet, ORC, JSON, and CSV formats |
SageMaker Data Preparation Tools
SageMaker provides a comprehensive set of tools for preparing data specifically for ML model training. Understanding when to use each tool is critical for the exam.
| Tool | Purpose | Key Features |
|---|---|---|
| Ground Truth | Data Labeling | Human labeling workforce (Amazon Mechanical Turk, private, vendor); active learning reduces labeling cost by up to 70%; supports image, text, video, 3D point cloud |
| Data Wrangler | Data Preparation | Visual data preparation with 300+ built-in transforms; data quality insights; exports to SageMaker Pipeline, Processing, or Feature Store |
| Feature Store | Feature Management | Centralized feature repository; online store (low-latency reads) and offline store (S3-backed for training); feature versioning and sharing across teams |
| Processing Jobs | Data Processing | Run data preprocessing, postprocessing, and evaluation scripts; supports scikit-learn, Spark, and custom containers; fully managed infrastructure |
Data Formats for ML Training
Choosing the right data format significantly impacts training speed and cost. SageMaker algorithms have specific format preferences that you must know for the exam.
| Format | Type | Best For |
|---|---|---|
| CSV | Text | Small to medium datasets; widely supported; target variable must be in the first column with no header for SageMaker built-in algorithms |
| RecordIO/Protobuf | Binary | Optimized for SageMaker built-in algorithms; supports Pipe mode for streaming from S3; significantly faster training than CSV for large datasets |
| Parquet | Columnar | Columnar storage with compression; ideal for Spark-based processing on EMR; excellent for datasets with many columns where only a subset is needed |
| JSON / JSON Lines | Text | Flexible schema; used by some SageMaker algorithms; JSON Lines (one JSON object per line) is preferred for large datasets |
| Image Files | Binary | PNG, JPEG for image classification and object detection; use augmented manifest for metadata; RecordIO for efficient I/O in training |
Batch vs Streaming Data Ingestion
- Batch Ingestion: Use AWS Glue ETL jobs or EMR Spark jobs for scheduled processing of large datasets; data lands in S3 and is processed on a schedule (hourly, daily); ideal for training data preparation where real-time freshness is not required
- Streaming Ingestion: Use Kinesis Data Streams for real-time data capture and Kinesis Data Firehose for near-real-time delivery to S3; essential for real-time feature computation and online inference pipelines
- Lambda Architecture: Combine batch and streaming layers; use Kinesis for real-time features and Glue/EMR for comprehensive batch reprocessing; SageMaker Feature Store supports both online (real-time) and offline (batch) stores
- Pipe Mode vs File Mode: SageMaker Pipe mode streams data directly from S3 during training without downloading the full dataset; reduces startup time and disk requirements; best for large datasets in RecordIO format
Data Labeling Strategies
- SageMaker Ground Truth: Managed data labeling service that supports image classification, object detection, semantic segmentation, text classification, named entity recognition, and 3D point cloud labeling; use active learning to reduce labeling costs by automatically labeling high-confidence examples
- Workforce Options: Amazon Mechanical Turk for large-scale public labeling tasks; private workforce for sensitive or domain-specific data; third-party vendors for specialized labeling requiring expert knowledge
- Active Learning: Ground Truth uses a model trained on human-labeled data to automatically label remaining examples above a confidence threshold; human annotators only review low-confidence predictions; can reduce labeling costs by 40-70%
- Annotation Consolidation: Multiple workers label the same data point; Ground Truth uses annotation consolidation algorithms to determine the correct label; increases label quality and reduces individual annotator bias
- Augmented Manifests: JSON Lines format that combines S3 data references with labels inline; avoids copying data; supported by SageMaker training jobs for direct consumption of Ground Truth output
Data Quality and Validation
- Data Quality Checks: Use SageMaker Data Wrangler to profile data and identify quality issues including missing values, duplicate records, outliers, and class imbalance before model training
- Schema Validation: AWS Glue Crawlers automatically detect schema changes; use Glue Data Quality rules to enforce constraints on data freshness, completeness, uniqueness, and referential integrity
- Data Versioning: Use S3 versioning for training datasets; SageMaker Experiments tracks dataset versions used for each training run; enables reproducibility of model training
- Exam Tip: Questions about data quality often require you to identify the root cause of poor model performance; always consider data issues (missing values, incorrect labels, class imbalance, data leakage) before blaming the algorithm or hyperparameters
Domain 2: Exploratory Data Analysis (24%)
This domain covers sanitizing and preparing data for modeling, feature engineering, and analyzing and visualizing data to identify patterns and inform model selection. At 24% of the exam, this is the second-highest weighted domain and requires a solid understanding of statistics, data preprocessing techniques, and visualization tools. You must know how to handle common data quality issues, transform raw features into model-ready inputs, deal with class imbalance, and use statistical methods to understand data distributions and relationships between variables.
Feature Engineering Techniques
Feature engineering transforms raw data into features that better represent the underlying problem, improving model accuracy. The following techniques are frequently tested on the exam.
| Technique | Category | Description & When to Use |
|---|---|---|
| One-Hot Encoding | Categorical | Convert categorical variables to binary columns; use for nominal categories with no ordinal relationship; can create high dimensionality with many categories |
| Label Encoding | Categorical | Assign integer values to categories; use for ordinal variables where order matters (e.g., low/medium/high); not suitable for nominal categories as it implies ordering |
| Target Encoding | Categorical | Replace category with mean of target variable; useful for high-cardinality categorical features; requires regularization to prevent overfitting |
| Min-Max Scaling | Numerical | Scale features to [0,1] range; preserves zero entries in sparse data; sensitive to outliers; use when features have different units |
| Standardization (Z-score) | Numerical | Transform to mean=0, std=1; preferred for algorithms assuming normal distribution; more robust to outliers than Min-Max; required for PCA and many linear models |
| Log Transform | Numerical | Reduce right skewness; compress wide ranges; make multiplicative relationships additive; useful for income, price, and count data |
| Binning / Discretization | Numerical | Convert continuous variables to categorical bins; reduces noise; useful for age groups, price ranges; can lose information if bins are too coarse |
| Polynomial Features | Interaction | Create interaction terms and higher-degree features; captures non-linear relationships in linear models; increases dimensionality significantly |
Handling Missing Data
- Mean/Median Imputation: Replace missing numerical values with the mean (for normally distributed data) or median (for skewed distributions); simple and fast but reduces variance and ignores relationships between features
- Mode Imputation: Replace missing categorical values with the most frequent category; appropriate when the missing mechanism is random and the mode category is dominant
- Forward/Backward Fill: Use the previous or next valid value in time-series data; maintains temporal patterns; appropriate when values change infrequently over time
- KNN Imputation: Use K-nearest neighbors to impute missing values based on similar records; captures relationships between features; computationally expensive for large datasets
- Indicator Variable: Create a binary column indicating whether the value was missing; preserves the information that the value was absent; useful when missingness itself is predictive
- Dropping Records: Remove rows with missing values; only appropriate when missing data is a small percentage (less than 5%) and is missing completely at random (MCAR); can introduce bias if data is not MCAR
- Exam Tip: The exam often presents scenarios where the correct approach depends on the percentage of missing data, the missing mechanism (MCAR, MAR, MNAR), and whether the feature is critical for prediction; always consider the impact on model bias
Handling Outliers
- Z-Score Method: Flag data points more than 3 standard deviations from the mean; assumes normal distribution; simple to implement but not robust for non-Gaussian data
- IQR Method: Calculate Q1-1.5*IQR and Q3+1.5*IQR as boundaries; works well for skewed distributions; does not assume normality; preferred for exploratory analysis
- Winsorization: Cap extreme values at a specified percentile (e.g., 1st and 99th); preserves all records while limiting the impact of extremes; useful when outliers represent valid but rare observations
- Log Transformation: Compresses the range of values, reducing the impact of outliers; effective when the data has a right-skewed distribution; also normalizes the distribution
- Random Cut Forest: SageMaker built-in algorithm for anomaly detection; assigns anomaly scores to data points; use Kinesis Data Analytics with Random Cut Forest for real-time streaming anomaly detection
- Exam Tip: Do not blindly remove outliers; first determine if they are errors (remove or correct), rare but valid observations (keep and use robust methods), or a separate population (model separately); the exam tests your judgment on when to keep versus remove outliers
Handling Imbalanced Classes
- SMOTE (Synthetic Minority Over-sampling): Generate synthetic examples of the minority class by interpolating between existing minority samples; preferred over simple duplication as it adds diversity; can create noisy samples near class boundaries
- Random Oversampling: Duplicate minority class examples to match majority class size; simple but can lead to overfitting on duplicated samples; use with caution and monitor validation performance
- Random Undersampling: Remove majority class examples to match minority class size; risks losing important information; best when the majority class has clear redundancy
- Class Weights: Assign higher weight to minority class samples in the loss function; supported by most SageMaker algorithms; no data modification needed; effectively penalizes misclassification of minority class more heavily
- Threshold Adjustment: After training, adjust the classification threshold to favor the minority class; useful when the cost of false negatives is much higher than false positives (e.g., fraud detection)
- Evaluation Metrics: Use precision, recall, F1 score, or AUC-ROC instead of accuracy for imbalanced datasets; accuracy can be misleading when one class dominates (99% accuracy by predicting majority class)
Dimensionality Reduction
- PCA (Principal Component Analysis): SageMaker built-in algorithm; reduces dimensions by finding orthogonal components that explain maximum variance; requires standardized features; regular mode for dense data, randomized mode for large sparse datasets
- t-SNE: Non-linear dimensionality reduction for visualization; maps high-dimensional data to 2D/3D; reveals clusters and patterns not visible in PCA; computationally expensive; not suitable for feature reduction in training pipelines
- Feature Selection: Remove irrelevant or redundant features using correlation analysis, mutual information, or recursive feature elimination; simpler than transformation-based methods; maintains interpretability of remaining features
- Exam Tip: PCA is the go-to answer for dimensionality reduction on the MLS-C01 exam; know that it requires standardized input, the number of components to keep is based on explained variance ratio (typically 95%), and it is used both as preprocessing and as a SageMaker built-in algorithm
Text Data Preprocessing
- Tokenization: Split text into individual words or subwords; sentence tokenization for document-level tasks; word tokenization for feature extraction; subword tokenization (BPE) for neural models handling out-of-vocabulary words
- Stop Word Removal: Remove common words (the, is, at) that add noise; reduces feature space; be cautious in sentiment analysis where words like "not" are critical despite being common
- Stemming / Lemmatization: Reduce words to root form; stemming (faster, rule-based, e.g., running to run) vs lemmatization (accurate, dictionary-based, e.g., better to good); reduces vocabulary size
- TF-IDF: Term Frequency-Inverse Document Frequency; weighs word importance by frequency in document vs rarity across corpus; better than raw counts for text classification; produces sparse feature vectors
- Word Embeddings: Dense vector representations (Word2Vec, GloVe, FastText); capture semantic relationships; SageMaker BlazingText implements Word2Vec; pre-trained embeddings available for transfer learning
- N-grams: Capture word sequences (bigrams, trigrams); preserve local context; combined with TF-IDF for text classification; increase feature dimensionality significantly
Image Data Preprocessing
- Resizing: Standardize image dimensions to match model input requirements; most CNNs expect fixed-size inputs (e.g., 224x224 for ResNet); use bicubic or bilinear interpolation
- Normalization: Scale pixel values from [0,255] to [0,1] or standardize with ImageNet mean/std; required for convergence in deep learning; must apply same normalization at inference time
- Data Augmentation: Generate additional training samples through random flips, rotations, crops, color jitter, and scaling; reduces overfitting; critical when training data is limited; apply only to training set, never to validation or test
- RecordIO Conversion: Convert images to RecordIO format for efficient I/O during SageMaker training; reduces S3 read overhead; essential for large image datasets with Pipe mode training
- Exam Tip: Image augmentation is the exam-favorite answer for improving image classification accuracy when training data is limited; know the specific augmentation techniques and that they should only be applied to training data
Statistical Analysis and Visualization
- Distribution Analysis: Use histograms and density plots to understand feature distributions; identify skewness, multimodality, and the need for transformations; normal distribution is assumed by many ML algorithms
- Correlation Analysis: Pearson correlation for linear relationships; Spearman for monotonic relationships; identify multicollinearity between features; remove highly correlated features to reduce dimensionality
- Scatter Plots / Pair Plots: Visualize relationships between pairs of features; identify non-linear patterns, clusters, and outliers; essential for understanding feature interactions
- Box Plots: Show distribution, median, quartiles, and outliers; useful for comparing distributions across categories; quickly identify skewness and outlier presence
- Amazon QuickSight: AWS managed BI service for interactive dashboards; ML-powered insights with anomaly detection and forecasting; integrates with S3, Athena, Redshift, and RDS data sources
- SageMaker Notebooks: Jupyter notebooks with matplotlib, seaborn, and pandas for custom visualization; preferred for detailed EDA workflows; can connect to Athena, S3, and other AWS data sources
Domain 3: Modeling (36%)
This is the highest-weighted domain at 36% of the exam, covering the selection, training, tuning, and evaluation of ML models. You must have deep knowledge of SageMaker built-in algorithms and when to use each one, understand deep learning architectures, know how to configure hyperparameter tuning jobs, apply regularization techniques, and evaluate model performance using appropriate metrics. This domain also covers training infrastructure choices including instance types, distributed training, and managed training features like Spot training and checkpointing.
SageMaker Built-in Algorithms
SageMaker provides optimized implementations of common ML algorithms. Knowing each algorithm, its type, input format, and ideal use case is critical for the exam.
| Algorithm | Type | Use Case |
|---|---|---|
| Linear Learner | Supervised | Binary/multi-class classification and regression; supports L1/L2 regularization; handles high-dimensional sparse data efficiently |
| XGBoost | Supervised | Classification and regression with gradient boosted trees; best for structured/tabular data; highly configurable; supports CSV and LibSVM input |
| K-Nearest Neighbors (KNN) | Supervised | Classification and regression based on proximity; index-based for fast inference; sampling and dimensionality reduction for large datasets |
| Factorization Machines | Supervised | Recommendation systems and click-through prediction; captures pairwise feature interactions; excels with high-dimensional sparse data |
| Image Classification | Supervised (DL) | CNN-based image classification; supports transfer learning with pre-trained ResNet; fine-tuning and full training modes; RecordIO or augmented manifest input |
| Object Detection | Supervised (DL) | Detect and localize objects in images; supports SSD and Faster R-CNN; transfer learning from pre-trained models; outputs bounding boxes with class labels |
| Semantic Segmentation | Supervised (DL) | Pixel-level classification of images; FCN, PSP, DeepLab architectures; used for autonomous driving, medical imaging, satellite analysis |
| BlazingText | Supervised / Unsupervised | Text classification (supervised) and Word2Vec embeddings (unsupervised); extremely fast training; single or multi-core CPU and GPU modes |
| Sequence to Sequence | Supervised (DL) | Machine translation, text summarization, speech-to-text; encoder-decoder architecture with attention mechanism; requires tokenized input |
| DeepAR | Supervised (DL) | Time-series forecasting using autoregressive RNN; trains on multiple related time series simultaneously; produces probabilistic forecasts with quantiles |
| K-Means | Unsupervised | Clustering data into K groups; web-scale implementation; uses mini-batch SGD; choose K using elbow method or silhouette score |
| PCA | Unsupervised | Dimensionality reduction; regular mode for dense data, randomized mode for large sparse datasets; useful as preprocessing before other algorithms |
| Random Cut Forest | Unsupervised | Anomaly detection; assigns anomaly scores; integrated with Kinesis Analytics for real-time streaming anomaly detection; no labeled data required |
| LDA (Latent Dirichlet Allocation) | Unsupervised | Topic modeling; discovers hidden topics in document collections; each document is a mixture of topics; useful for content recommendation and organization |
| NTM (Neural Topic Model) | Unsupervised | Neural network-based topic modeling; often produces more coherent topics than LDA; supports GPU training for large document corpora |
| IP Insights | Unsupervised | Detect anomalous IP address usage patterns; identifies compromised credentials; learns associations between entities and IP addresses |
Deep Learning Architectures
- CNN (Convolutional Neural Network): Primary architecture for image tasks; convolutional layers extract spatial features through learnable filters; pooling layers reduce spatial dimensions; fully connected layers for final classification; key architectures include ResNet, VGG, Inception; use transfer learning with pre-trained models for small image datasets
- RNN (Recurrent Neural Network): Process sequential data with memory of previous inputs; suffer from vanishing gradient problem for long sequences; suitable for short text classification and simple time-series; largely replaced by LSTM/GRU for longer sequences
- LSTM (Long Short-Term Memory): Specialized RNN with gating mechanisms (forget, input, output gates) that solve vanishing gradient; excellent for long sequences including time-series forecasting, speech recognition, and language modeling; bidirectional LSTM reads sequences in both directions for better context
- GRU (Gated Recurrent Unit): Simplified version of LSTM with fewer parameters (reset and update gates); faster training than LSTM; comparable performance for many tasks; preferred when training time and computational resources are limited
- Transformers: Attention-based architecture that processes all positions simultaneously; self-attention captures long-range dependencies without recurrence; foundation of BERT, GPT, and modern NLP; superior parallelization compared to RNNs
- Autoencoders: Encoder-decoder architecture for unsupervised feature learning; compresses input to low-dimensional representation and reconstructs; variational autoencoders (VAE) for generative tasks; denoising autoencoders for robust feature extraction
- GANs (Generative Adversarial Networks): Generator creates synthetic data while discriminator evaluates authenticity; used for data augmentation, image generation, and style transfer; training can be unstable (mode collapse); not heavily tested but know the concept
Hyperparameter Tuning
- SageMaker Automatic Model Tuning: Managed hyperparameter optimization service; define objective metric, parameter ranges, and max training jobs; creates tuning job that launches multiple training jobs with different hyperparameter combinations
- Bayesian Optimization: Default strategy for SageMaker tuning; builds probabilistic model of objective function; balances exploration and exploitation; more efficient than random or grid search for expensive training jobs
- Random Search: Samples hyperparameters randomly from defined ranges; good baseline; more efficient than grid search in high dimensions; can run in parallel without dependencies between trials
- Hyperband: Early stopping strategy that allocates resources to promising configurations; quickly discards poor performers; efficient for large hyperparameter spaces; supported by SageMaker tuning
- Key Hyperparameters: Learning rate (most impactful), batch size, number of epochs, regularization strength (L1/L2), number of layers/neurons, dropout rate, momentum; know which hyperparameters to tune first for each algorithm
- Warm Start: Initialize new tuning job from results of previous tuning jobs; reduces total training time; use when refining hyperparameters after initial broad search
- Exam Tip: The exam frequently asks about choosing between Bayesian optimization and random search; Bayesian is preferred for expensive training jobs with few parameters, while random search is better for highly parallel exploration of large parameter spaces
Regularization Techniques
- L1 Regularization (Lasso): Adds absolute value of weights to loss function; produces sparse models by driving some weights to exactly zero; useful for feature selection; preferred when you expect many irrelevant features
- L2 Regularization (Ridge): Adds squared weights to loss function; shrinks all weights toward zero without eliminating features; prevents any single feature from dominating; default choice for most algorithms
- Elastic Net: Combines L1 and L2 regularization; balances feature selection with weight shrinkage; useful when features are correlated; alpha parameter controls L1/L2 ratio
- Dropout: Randomly deactivate neurons during training; forces network to learn redundant representations; typical rates of 0.2-0.5; apply to hidden layers; do not apply during inference
- Early Stopping: Monitor validation loss and stop training when it starts increasing; prevents overfitting; SageMaker supports early stopping in hyperparameter tuning jobs
- Batch Normalization: Normalize layer inputs to reduce internal covariate shift; allows higher learning rates; acts as mild regularization; applied between layers in deep networks
- Data Augmentation: Generate additional training samples through transformations; effective regularizer for image and text tasks; reduces overfitting by increasing effective training set size
Model Evaluation Metrics
| Metric | Task Type | When to Use |
|---|---|---|
| Accuracy | Classification | Balanced classes only; proportion of correct predictions; misleading for imbalanced datasets |
| Precision | Classification | When false positives are costly (spam filtering, content moderation); TP / (TP + FP) |
| Recall (Sensitivity) | Classification | When false negatives are costly (fraud detection, medical diagnosis); TP / (TP + FN) |
| F1 Score | Classification | Harmonic mean of precision and recall; when you need balance between FP and FN costs; good for imbalanced datasets |
| AUC-ROC | Classification | Area under ROC curve; threshold-independent; compares true positive rate vs false positive rate; 0.5 = random, 1.0 = perfect |
| RMSE | Regression | Root Mean Squared Error; penalizes large errors more; same units as target; sensitive to outliers |
| MAE | Regression | Mean Absolute Error; robust to outliers; equal weight to all errors; use when outliers should not dominate |
| R-squared | Regression | Proportion of variance explained; 0 to 1 (higher is better); can be negative for very poor models; easy to interpret |
| Silhouette Score | Clustering | Measures cluster cohesion and separation; -1 to 1 (higher is better); use to determine optimal K in K-Means |
Training Infrastructure on SageMaker
- Instance Selection: ml.m5 for general purpose CPU training; ml.c5 for compute-intensive tabular data (XGBoost, Linear Learner); ml.p3/ml.p4 for GPU deep learning; ml.g4dn for cost-effective GPU inference; match instance type to algorithm requirements
- Distributed Training: Data parallelism splits mini-batches across multiple GPUs; model parallelism splits model layers across GPUs when model is too large for single GPU; SageMaker distributed training library simplifies both approaches
- Managed Spot Training: Use EC2 Spot instances for up to 90% cost reduction; SageMaker handles interruptions with checkpointing; specify max wait time and max runtime; ideal for fault-tolerant training jobs
- Checkpointing: Save model state periodically during training; enables resume from interruption (critical for Spot training); also useful for monitoring training progress and selecting best intermediate model
- Custom Containers: Bring your own Docker container with custom frameworks; must implement SageMaker training toolkit interface; or extend pre-built SageMaker containers with additional dependencies
- Frameworks: SageMaker provides pre-built containers for TensorFlow, PyTorch, MXNet, Hugging Face, scikit-learn, and XGBoost; script mode allows custom training scripts with managed infrastructure
- Exam Tip: Know that SageMaker built-in algorithms are already optimized and do not require custom containers; use pre-built framework containers for custom models; bring-your-own-container only when you need libraries not available in pre-built images
Bias-Variance Tradeoff
- Underfitting (High Bias): Model is too simple to capture patterns; high training error and high validation error; remedies include more features, more complex model, less regularization, longer training, polynomial features
- Overfitting (High Variance): Model memorizes training data including noise; low training error but high validation error; remedies include more training data, regularization (L1/L2/dropout), simpler model, early stopping, cross-validation
- Cross-Validation: K-fold cross-validation splits data into K folds and trains K models; provides robust performance estimate; reduces variance of evaluation metrics; standard K=5 or K=10
- Learning Curves: Plot training and validation error vs training set size; converging curves indicate good fit; large gap indicates overfitting; both high indicates underfitting; essential diagnostic tool
- Exam Tip: The exam frequently presents scenarios with performance issues and asks you to identify underfitting vs overfitting; always look at the gap between training and validation metrics to diagnose the problem before selecting a solution
Domain 4: Machine Learning Implementation and Operations (20%)
This domain covers building, deploying, and maintaining ML solutions in production. You must understand the various SageMaker deployment options including real-time endpoints, batch transform, serverless inference, and asynchronous inference. The domain also tests your knowledge of MLOps practices including model monitoring, CI/CD pipelines, A/B testing, model registry, and operational security. Understanding how to monitor model quality, detect data drift, ensure fairness, and implement cost-effective auto-scaling strategies is essential for passing this section of the exam.
SageMaker Deployment Options
Choosing the right deployment option is critical and depends on latency requirements, traffic patterns, and cost constraints. The exam frequently tests your ability to match deployment types to use cases.
| Deployment Type | Latency | Use Case & Characteristics |
|---|---|---|
| Real-time Endpoints | Milliseconds | Persistent endpoint for synchronous inference; always running; auto-scaling based on traffic; supports A/B testing with production variants; use for interactive applications requiring sub-second responses |
| Batch Transform | Minutes-Hours | Process entire datasets in S3; no persistent endpoint; pay only for processing time; ideal for overnight scoring, periodic reports, and large-scale batch predictions |
| Serverless Inference | Seconds (cold start) | Auto-scales to zero when idle; no minimum cost; cold start latency on first request; ideal for intermittent traffic patterns with unpredictable spikes; specify memory size and max concurrency |
| Asynchronous Inference | Seconds-Minutes | Queue requests in SQS; process asynchronously; scale to zero; ideal for large payloads (up to 1 GB), long processing times, and cost-sensitive workloads with tolerance for delayed responses |
| Multi-Model Endpoints | Milliseconds | Host multiple models on single endpoint; dynamically loads/unloads models from S3; reduces hosting cost for many models; ideal for multi-tenant SaaS applications with per-customer models |
| Multi-Container Endpoints | Milliseconds | Run multiple containers on single endpoint; serial inference pipeline or direct invocation; use for preprocessing, inference, and postprocessing in separate containers |
Model Monitoring and Drift Detection
- SageMaker Model Monitor: Automatically monitors deployed models for data quality, model quality, bias, and feature attribution drift; creates baseline from training data; generates violation reports when drift is detected; integrates with CloudWatch for alerting
- Data Quality Monitor: Detects statistical drift in input features; compares live inference data against training baseline; tracks metrics like mean, std, missing values, and distribution shape; alerts when features deviate beyond configurable thresholds
- Model Quality Monitor: Tracks prediction accuracy over time by comparing predictions to ground truth labels; monitors metrics like accuracy, precision, recall, RMSE; requires ground truth labels to be uploaded to S3 and merged with predictions
- Bias Drift Monitor: Uses SageMaker Clarify to detect changes in bias metrics over time; monitors both pre-training data bias and post-training prediction bias; essential for regulatory compliance and fairness requirements
- Feature Attribution Drift: Uses SHAP values to monitor changes in feature importance; detects when the model begins relying on different features than during training; indicates potential concept drift requiring model retraining
- Concept Drift: The relationship between input features and the target variable changes over time; Model Monitor detects symptoms but retraining is the remedy; implement automated retraining pipelines triggered by drift alerts
- CloudWatch Integration: Model Monitor publishes metrics to CloudWatch; create alarms for threshold violations; trigger Lambda functions or Step Functions for automated remediation or retraining workflows
SageMaker Clarify
- Pre-training Bias Detection: Analyze training data for statistical biases before model training; detects class imbalance, label imbalance, and feature-level biases across sensitive attributes (race, gender, age); generates bias reports with specific metrics
- Post-training Bias Detection: Analyze model predictions for disparate impact across groups; measures metrics like Demographic Parity Difference, Equal Opportunity Difference, and Treatment Equality; compares predicted outcomes across sensitive groups
- Feature Attribution (Explainability): Uses SHAP (SHapley Additive exPlanations) values to explain individual predictions; identifies which features contributed most to each prediction; supports both global and local explanations
- Exam Tip: Clarify is the go-to answer for any exam question about model explainability, bias detection, or fairness; know that it integrates with SageMaker Pipelines, Model Monitor, and can be run as standalone processing jobs
MLOps with SageMaker
- SageMaker Pipelines: Native CI/CD for ML workflows; define DAGs with steps for processing, training, evaluation, registration, and deployment; supports conditional logic and parameterization; integrates with SageMaker Experiments for tracking
- Model Registry: Central repository for managing model versions; tracks model metadata, lineage, and approval status; supports approval workflows (PendingManualApproval, Approved, Rejected); version control for production models
- SageMaker Projects: MLOps templates for end-to-end ML workflows; pre-built templates for model building, training, and deployment; integrates with CodeCommit, CodeBuild, and CodePipeline for source control and CI/CD
- SageMaker Experiments: Track and organize training runs; log parameters, metrics, and artifacts for each trial; compare experiments to identify best-performing configurations; integrates with Pipelines for automated tracking
- A/B Testing: Deploy multiple model variants on the same endpoint with configurable traffic distribution; compare performance metrics across variants; gradually shift traffic to the winning model; production variants support different instance types and counts
- Blue/Green Deployment: Deploy new model version alongside existing version; validate new model before switching traffic; supports instant rollback if issues are detected; use with deployment guardrails in SageMaker
- Canary Deployment: Route small percentage of traffic to new model version; monitor for errors and performance degradation; gradually increase traffic if metrics are healthy; automatic rollback if alarm triggers during bake period
Security for ML Workloads
- IAM Roles: SageMaker execution roles define permissions for training jobs and endpoints; separate roles for notebook access, training, and deployment; follow least privilege principle; use resource-based policies for cross-account access
- VPC Configuration: Run SageMaker training and hosting in your VPC; use VPC endpoints for S3 and SageMaker API access without internet; security groups control inbound/outbound traffic; private subnets for sensitive workloads
- Network Isolation: Enable network isolation for training and inference containers; prevents containers from making outbound network calls; data must be pre-downloaded to S3; strongest security posture for sensitive data
- Encryption at Rest: S3 server-side encryption (SSE-S3 or SSE-KMS) for training data and model artifacts; EBS encryption for notebook instances and training volumes; KMS customer managed keys for full key control
- Encryption in Transit: TLS encryption for all SageMaker API calls; inter-container encryption for distributed training; HTTPS endpoints for inference; encrypted communication between SageMaker and S3
- Exam Tip: Security questions on MLS-C01 focus on the combination of IAM roles, VPC configuration, network isolation, and encryption; the most secure answer typically combines all four; know that network isolation prevents the container from accessing the internet even within a VPC
Cost Optimization
- Managed Spot Training: Use Spot instances for training jobs with up to 90% cost savings; configure checkpointing for fault tolerance; specify max wait time to control total job duration; ideal for long-running, fault-tolerant training jobs
- SageMaker Savings Plans: Commit to consistent compute usage for 1 or 3 years; up to 64% savings on SageMaker ML instances; applies to notebook instances, training, processing, batch transform, and real-time inference
- Auto-Scaling Endpoints: Configure target tracking scaling policies on real-time endpoints; scale based on InvocationsPerInstance metric; set minimum and maximum instance counts; scale to match traffic patterns and avoid over-provisioning
- Serverless Inference: Zero cost when idle; pay per inference request; ideal for workloads with intermittent traffic and tolerance for cold-start latency; eliminates always-on endpoint costs
- Multi-Model Endpoints: Host hundreds of models on shared infrastructure; reduces per-model hosting cost significantly; models are loaded on demand from S3; ideal when individual models have low and infrequent traffic
- Right-Sizing: Select the smallest instance type that meets performance requirements; use SageMaker Inference Recommender to benchmark model performance across instance types; avoid GPU instances for algorithms that do not benefit from GPU acceleration
- Exam Tip: Cost optimization questions typically present a scenario with specific traffic patterns and ask for the most cost-effective deployment; match intermittent traffic to serverless, high-volume steady traffic to Savings Plans, many models to multi-model endpoints, and training to Spot instances
AWS AI Services (No-Code ML)
For common ML tasks, AWS provides pre-trained AI services that require no ML expertise. Know when to recommend these instead of custom SageMaker solutions.
- Amazon Rekognition: Image and video analysis including object detection, facial analysis, content moderation, and celebrity recognition; no training required; use when standard computer vision tasks meet business needs
- Amazon Comprehend: Natural language processing including sentiment analysis, entity extraction, key phrase detection, language detection, and topic modeling; custom entity recognition for domain-specific needs
- Amazon Translate: Neural machine translation between 75+ languages; supports real-time and batch translation; custom terminology for domain-specific vocabulary; integrates with S3 for document translation
- Amazon Transcribe: Automatic speech recognition (ASR); converts audio to text; supports custom vocabularies and language models; streaming and batch transcription; medical transcription variant for healthcare
- Amazon Polly: Text-to-speech with natural-sounding voices; supports SSML for pronunciation control; Neural TTS for lifelike speech; use for voice-enabled applications and content accessibility
- Amazon Forecast: Time-series forecasting service; automatically selects best algorithm; no ML expertise required; use instead of DeepAR when you want a fully managed forecasting solution
- Amazon Personalize: Real-time recommendation engine; supports user personalization, similar items, and personalized ranking; use instead of building custom recommendation models with Factorization Machines
- Amazon Textract: Extract text, tables, and forms from documents; goes beyond simple OCR; understands document structure; use for invoice processing, form extraction, and document analysis
- Exam Tip: The exam tests whether you know when to use a pre-trained AI service versus building a custom model on SageMaker; AI services are preferred when the use case matches their capabilities, reducing development time and cost significantly