MLA-C01 Domain 1 Complete Guide: Data Preparation for 28% of the Exam

Domain 1, Data Preparation for Machine Learning, carries the largest weight on MLA-C01 at 28%. The exam rewards engineers who can ingest data in the right format, store it in the right service, transform it with the right tool, and keep it compliant. This guide walks through the three task statements of Domain 1 and the AWS services behind each.
- Why Domain 1 Is the Heaviest at 28%
- Task 1.1: Choosing Among Six Data Formats
- Task 1.1: Storage Is a Three-Way Choice of S3, EFS, and FSx
- Task 1.1: The Cast of Streaming Ingestion
- Task 1.2: Standard Cleaning and Feature Engineering
- Task 1.2: Four Transformation Tools – Glue, DataBrew, EMR, Data Wrangler
- Feature Store: A Two-Layer Online and Offline Design
- Ground Truth: Three Labeling Workforces
- Task 1.3: Finishing with Bias Metrics and Compliance
- High-Frequency Checklist: Self-Diagnosis for Exam Day
- Conclusion: Master the Data, Master MLA-C01
Why Domain 1 Is the Heaviest at 28%
In real machine learning work, data preparation is where most of the time goes, and AWS reflects that in the weighting. Domain 1 breaks into three task statements: ingesting and storing data, transforming data and engineering features, and ensuring data integrity for modeling.
| Task | Theme | What is tested |
| Task 1.1 | Ingest and store data | Format selection, storage selection, streaming ingestion |
| Task 1.2 | Transform data and engineer features | Cleaning, encoding, choosing the right transformation tool |
| Task 1.3 | Ensure data integrity and prepare for modeling | Bias metrics, encryption and anonymization, compliance |
Task 1.1: Choosing Among Six Data Formats
| Format | Structure | Strong when |
| Parquet | Columnar | Analytical queries, reading a subset of columns, compression |
| ORC | Columnar | Analytics in the Hive ecosystem |
| CSV | Row-based text | Small data exchange, human inspection |
| JSON | Semi-structured text | Nested structures, API integration, flexible schema |
| Avro | Row-based binary | Schema evolution, record-level streaming |
| RecordIO | Binary | Training input for SageMaker built-in algorithms |
Task 1.1: Storage Is a Three-Way Choice of S3, EFS, and FSx
Amazon S3 is the default data lake for ML: durable, cheap, and the source for most SageMaker training jobs. Amazon EFS suits shared file access across instances, and Amazon FSx for Lustre shines when you need high-throughput, low-latency access to large training datasets. Match the access pattern to the service rather than defaulting to S3 every time.
Task 1.1: The Cast of Streaming Ingestion
For streaming data, know the roles: Amazon Kinesis Data Streams for real-time capture, Amazon Data Firehose for delivery into S3 or Redshift with light transformation, and Amazon MSK (managed Kafka) for high-scale event pipelines. Questions often ask which one fits a latency or transformation requirement.
Task 1.2: Standard Cleaning and Feature Engineering
Expect the classics: handling missing values, removing duplicates and outliers, scaling and normalization, one-hot and label encoding for categoricals, and binning. The exam tests whether you know which technique fits which data problem, not how to derive it.
Task 1.2: Four Transformation Tools – Glue, DataBrew, EMR, Data Wrangler
| Tool | Main user | Character |
| AWS Glue | Data engineer | Serverless ETL, scheduled and automated as jobs |
| AWS Glue DataBrew | Analyst | Visual, no-code data cleaning and profiling |
| Amazon EMR | Data engineer | Spark and Hadoop at scale for big data processing |
| SageMaker Data Wrangler | ML practitioner | Visual feature prep that flows straight into ML pipelines |
Feature Store: A Two-Layer Online and Offline Design
SageMaker Feature Store keeps features consistent between training and inference. The offline store (backed by S3) serves batch training and historical lookups, while the online store provides low-latency reads for real-time inference. Knowing this split, and why it prevents training-serving skew, is a frequent exam point.
Ground Truth: Three Labeling Workforces
For labeling, SageMaker Ground Truth offers three workforce options: the Amazon Mechanical Turk public workforce, a private workforce of your own employees, and vendor-managed workforces from the AWS Marketplace. Choose based on data sensitivity and cost. Sensitive data points toward a private workforce.
Task 1.3: Finishing with Bias Metrics and Compliance
SageMaker Clarify measures pre-training bias so you can catch imbalance before modeling. Combine that with encryption (KMS), anonymization, and compliance controls. The exam expects you to treat data integrity and privacy as part of preparation, not an afterthought.
High-Frequency Checklist: Self-Diagnosis for Exam Day
Conclusion: Master the Data, Master MLA-C01
At 28%, Domain 1 is the single biggest lever on your score. If you can confidently choose formats, storage, ingestion, transformation tools, and integrity controls, you have built the foundation the rest of the exam stands on.




