2026.06.16 2026.07.04

MLA-C01 Domain 1 Complete Guide: Data Preparation for 28% of the Exam

swiftwand

Domain 1, Data Preparation for Machine Learning, carries the largest weight on MLA-C01 at 28%. The exam rewards engineers who can ingest data in the right format, store it in the right service, transform it with the right tool, and keep it compliant. This guide walks through the three task statements of Domain 1 and the AWS services behind each.

Why Domain 1 Is the Heaviest at 28%
Task 1.1: Choosing Among Six Data Formats
Task 1.1: Storage Is a Three-Way Choice of S3, EFS, and FSx
Task 1.1: The Cast of Streaming Ingestion
Task 1.2: Standard Cleaning and Feature Engineering
Task 1.2: Four Transformation Tools – Glue, DataBrew, EMR, Data Wrangler
Feature Store: A Two-Layer Online and Offline Design
Ground Truth: Three Labeling Workforces
Task 1.3: Finishing with Bias Metrics and Compliance
High-Frequency Checklist: Self-Diagnosis for Exam Day
Conclusion: Master the Data, Master MLA-C01

忍者AdMax

Why Domain 1 Is the Heaviest at 28%

In real machine learning work, data preparation is where most of the time goes, and AWS reflects that in the weighting. Domain 1 breaks into three task statements: ingesting and storing data, transforming data and engineering features, and ensuring data integrity for modeling.

Task	Theme	What is tested
Task 1.1	Ingest and store data	Format selection, storage selection, streaming ingestion
Task 1.2	Transform data and engineer features	Cleaning, encoding, choosing the right transformation tool
Task 1.3	Ensure data integrity and prepare for modeling	Bias metrics, encryption and anonymization, compliance

Task 1.1: Choosing Among Six Data Formats

Format	Structure	Strong when
Parquet	Columnar	Analytical queries, reading a subset of columns, compression
ORC	Columnar	Analytics in the Hive ecosystem
CSV	Row-based text	Small data exchange, human inspection
JSON	Semi-structured text	Nested structures, API integration, flexible schema
Avro	Row-based binary	Schema evolution, record-level streaming
RecordIO	Binary	Training input for SageMaker built-in algorithms

Task 1.1: Storage Is a Three-Way Choice of S3, EFS, and FSx

Amazon S3 is the default data lake for ML: durable, cheap, and the source for most SageMaker training jobs. Amazon EFS suits shared file access across instances, and Amazon FSx for Lustre shines when you need high-throughput, low-latency access to large training datasets. Match the access pattern to the service rather than defaulting to S3 every time.

Task 1.1: The Cast of Streaming Ingestion

For streaming data, know the roles: Amazon Kinesis Data Streams for real-time capture, Amazon Data Firehose for delivery into S3 or Redshift with light transformation, and Amazon MSK (managed Kafka) for high-scale event pipelines. Questions often ask which one fits a latency or transformation requirement.

Task 1.2: Standard Cleaning and Feature Engineering

Expect the classics: handling missing values, removing duplicates and outliers, scaling and normalization, one-hot and label encoding for categoricals, and binning. The exam tests whether you know which technique fits which data problem, not how to derive it.

Task 1.2: Four Transformation Tools – Glue, DataBrew, EMR, Data Wrangler

Tool	Main user	Character
AWS Glue	Data engineer	Serverless ETL, scheduled and automated as jobs
AWS Glue DataBrew	Analyst	Visual, no-code data cleaning and profiling
Amazon EMR	Data engineer	Spark and Hadoop at scale for big data processing
SageMaker Data Wrangler	ML practitioner	Visual feature prep that flows straight into ML pipelines

Feature Store: A Two-Layer Online and Offline Design

SageMaker Feature Store keeps features consistent between training and inference. The offline store (backed by S3) serves batch training and historical lookups, while the online store provides low-latency reads for real-time inference. Knowing this split, and why it prevents training-serving skew, is a frequent exam point.

Ground Truth: Three Labeling Workforces

For labeling, SageMaker Ground Truth offers three workforce options: the Amazon Mechanical Turk public workforce, a private workforce of your own employees, and vendor-managed workforces from the AWS Marketplace. Choose based on data sensitivity and cost. Sensitive data points toward a private workforce.

Task 1.3: Finishing with Bias Metrics and Compliance

SageMaker Clarify measures pre-training bias so you can catch imbalance before modeling. Combine that with encryption (KMS), anonymization, and compliance controls. The exam expects you to treat data integrity and privacy as part of preparation, not an afterthought.

High-Frequency Checklist: Self-Diagnosis for Exam Day

Conclusion: Master the Data, Master MLA-C01

At 28%, Domain 1 is the single biggest lever on your score. If you can confidently choose formats, storage, ingestion, transformation tools, and integrity controls, you have built the foundation the rest of the exam stands on.

#AWS #AWS Certification #Feature Store #MLA-C01 #SageMaker

ブラウザだけでできる本格的なAI画像生成【ConoHa AI Canvas】