Microsoft Exam DP-203 Topic 8 Question 37 Discussion

Actual exam question for Microsoft's DP-203 exam

Question #: 37
Topic #: 8

You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day.

You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times.

What should you include in the solution?

APartition by DateTime fields.

BSink to Azure Queue storage.

CInclude a watermark column.

DUse a JSON format for physical data storage.

Show Suggested Answer

Suggested Answer: B

The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files.

This provides two major advantages:

Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.

Lower costs: no more costly LIST API requests made to ABS.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/aqs

by Helaine at May 06, 2022, 08:11 AM

Limited Time Offer

25%

Off

Get Premium DP-203 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

James Sandman

4 years ago

I think the answer is also A, but for different justification: 1) Microsoft article states that, "When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases. (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition) 2) When calculating the number of partitions another article states, "Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000 partitions." (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool) Thus, if the equation to determine optimal compression and performance of the clustered columnstore index is [2.4 billion / (60*partition range)] >= 1,000,000; then the answer is 40. Anything else results in a number less than 1,000,000

upvoted 2 times

...

Swapnil Pal

4 years ago

answer is A.... 2.4 B / 40 = 60 M.....which is most optimum

upvoted 1 times

...