知識がなくても始められる、AIと共にある豊かな毎日。
AI Coding

MLA-C01 Domain 1 Complete Guide: Data Preparation for 28% of the Exam

swiftwand

Domain 1, Data Preparation for Machine Learning, carries the largest weight on MLA-C01 at 28%. The exam rewards engineers who can ingest data in the right format, store it in the right service, transform it with the right tool, and keep it compliant. This guide walks through the three task statements of Domain 1 and the AWS services behind each.

忍者AdMax

Why Domain 1 Is the Heaviest at 28%

In real machine learning work, data preparation is where most of the time goes, and AWS reflects that in the weighting. Domain 1 breaks into three task statements: ingesting and storing data, transforming data and engineering features, and ensuring data integrity for modeling.

TaskThemeWhat is tested
Task 1.1Ingest and store dataFormat selection, storage selection, streaming ingestion
Task 1.2Transform data and engineer featuresCleaning, encoding, choosing the right transformation tool
Task 1.3Ensure data integrity and prepare for modelingBias metrics, encryption and anonymization, compliance

Task 1.1: Choosing Among Six Data Formats

FormatStructureStrong when
ParquetColumnarAnalytical queries, reading a subset of columns, compression
ORCColumnarAnalytics in the Hive ecosystem
CSVRow-based textSmall data exchange, human inspection
JSONSemi-structured textNested structures, API integration, flexible schema
AvroRow-based binarySchema evolution, record-level streaming
RecordIOBinaryTraining input for SageMaker built-in algorithms

Task 1.1: Storage Is a Three-Way Choice of S3, EFS, and FSx

Amazon S3 is the default data lake for ML: durable, cheap, and the source for most SageMaker training jobs. Amazon EFS suits shared file access across instances, and Amazon FSx for Lustre shines when you need high-throughput, low-latency access to large training datasets. Match the access pattern to the service rather than defaulting to S3 every time.

Task 1.1: The Cast of Streaming Ingestion

For streaming data, know the roles: Amazon Kinesis Data Streams for real-time capture, Amazon Data Firehose for delivery into S3 or Redshift with light transformation, and Amazon MSK (managed Kafka) for high-scale event pipelines. Questions often ask which one fits a latency or transformation requirement.

Task 1.2: Standard Cleaning and Feature Engineering

Expect the classics: handling missing values, removing duplicates and outliers, scaling and normalization, one-hot and label encoding for categoricals, and binning. The exam tests whether you know which technique fits which data problem, not how to derive it.

Task 1.2: Four Transformation Tools – Glue, DataBrew, EMR, Data Wrangler

ToolMain userCharacter
AWS GlueData engineerServerless ETL, scheduled and automated as jobs
AWS Glue DataBrewAnalystVisual, no-code data cleaning and profiling
Amazon EMRData engineerSpark and Hadoop at scale for big data processing
SageMaker Data WranglerML practitionerVisual feature prep that flows straight into ML pipelines

Feature Store: A Two-Layer Online and Offline Design

SageMaker Feature Store keeps features consistent between training and inference. The offline store (backed by S3) serves batch training and historical lookups, while the online store provides low-latency reads for real-time inference. Knowing this split, and why it prevents training-serving skew, is a frequent exam point.

Ground Truth: Three Labeling Workforces

For labeling, SageMaker Ground Truth offers three workforce options: the Amazon Mechanical Turk public workforce, a private workforce of your own employees, and vendor-managed workforces from the AWS Marketplace. Choose based on data sensitivity and cost. Sensitive data points toward a private workforce.

Task 1.3: Finishing with Bias Metrics and Compliance

SageMaker Clarify measures pre-training bias so you can catch imbalance before modeling. Combine that with encryption (KMS), anonymization, and compliance controls. The exam expects you to treat data integrity and privacy as part of preparation, not an afterthought.

High-Frequency Checklist: Self-Diagnosis for Exam Day

Conclusion: Master the Data, Master MLA-C01

At 28%, Domain 1 is the single biggest lever on your score. If you can confidently choose formats, storage, ingestion, transformation tools, and integrity controls, you have built the foundation the rest of the exam stands on.

ブラウザだけでできる本格的なAI画像生成【ConoHa AI Canvas】
ABOUT ME
swiftwand
swiftwand
AIを使って、毎日の生活をもっと快適にするアイデアや将来像を発信しています。 初心者にもわかりやすく、すぐに取り入れられる実践的な情報をお届けします。 Sharing ideas and visions for a better daily life with AI. Practical tips that anyone can start using right away.
記事URLをコピーしました