A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.
The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.
AWS Glue Workflows can coordinate multiple ETL jobs and triggers. They support parallel execution and sequential dependencies, which is ideal for concurrent data processing followed by correlation steps, all with minimal operational overhead.
''Use AWS Glue Workflows to orchestrate multiple ETL jobs in sequence or in parallel, supporting conditional triggers and dependency management.''
-- Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
Currently there are no comments in this discussion, be the first to comment!