You need to create a data pipeline for a new application. Your application will stream data that needs to be enriched and cleaned. Eventually, the data will be used to train machine learning models. You need to determine the appropriate data manipulation methodology and which Google Cloud services to use in this pipeline. What should you choose?
AnswerA
ExplanationComprehensive and Detailed In-Depth
Streaming data requiring enrichment and cleaning before ML training suggests an ETL (Extract, Transform, Load) approach, with a focus on real-time processing and a data warehouse for ML.
Option A: ETL with Dataflow (streaming transformations) and BigQuery (storage/ML training) is Google's recommended pattern for streaming pipelines. Dataflow handles enrichment/cleaning, and BigQuery supports ML model training (BigQuery ML).
Option B: ETL with Cloud Data Fusion to Cloud Storage is batch-oriented and lacks streaming focus. Cloud Storage isn't ideal for ML training directly.
Option C: ELT (load then transform) with Cloud Storage to Bigtable is misaligned---Bigtable is for NoSQL, not ML training or post-load transformation.
Option D: ELT with Cloud SQL to Analytics Hub is for relational data and data sharing, not streaming or ML.
Reference: Google Cloud Documentation - 'Dataflow: ETL Patterns' (https://cloud.google.com/dataflow/docs/guides), 'BigQuery ML' (https://cloud.google.com/bigquery-ml).
Option D: ELT with Cloud SQL to Analytics Hub is for relational data and data sharing, not streaming or ML.
Reference: Google Cloud Documentation - 'Dataflow: ETL Patterns' (https://cloud.google.com/dataflow/docs/guides), 'BigQuery ML' (https://cloud.google.com/bigquery-ml).