You are designing a pipeline to process data files that arrive in Cloud Storage by 3:00 am each day. Data processing is performed in stages, where the output of one stage becomes the input of the next. Each stage takes a long time to run. Occasionally a stage fails, and you have to address
the problem. You need to ensure that the final output is generated as quickly as possible. What should you do?
Using Cloud Composer to design the processing pipeline as a Directed Acyclic Graph (DAG) is the most suitable approach because:
Fault tolerance: Cloud Composer (based on Apache Airflow) allows for handling failures at specific stages. You can clear the state of a failed task and rerun it without reprocessing the entire pipeline.
Stage-based processing: DAGs are ideal for workflows with interdependent stages where the output of one stage serves as input to the next.
Efficiency: This approach minimizes downtime and ensures that only failed stages are rerun, leading to faster final output generation.
Kristofer
6 days agoLisandra
3 days ago