The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and
transactionDate (in this order). Find the error.
Code block:
transactionsDf.coalesce(14, ("storeId", "transactionDate"))
transactionsDf.select('storeId').dropDuplicates().count()
Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.
transactionsDf.select(count('storeId')).dropDuplicates()
No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count('storeId'))
Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates
instead.
transactionsDf.distinct().select('storeId').count()
Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count
not represent the number of unique values in that column.
transactionsDf.select(distinct('storeId')).count()
False. There is no distinct method in pyspark.sql.functions.
Alline
1 months agoShayne
3 days agoLuis
4 days agoPortia
14 days agoAlberto
2 months agoKaran
1 months agoMarjory
1 months agoTeresita
1 months agoRebbecca
2 months agoChan
10 hours agoTrinidad
3 days agoToshia
5 days agoKris
8 days agoMelda
12 days agoDawne
23 days agoMayra
1 months agoAhmad
2 months agoErnestine
2 months agoValentine
1 months agoDeja
1 months agoDemetra
2 months agoInocencia
2 months agoDemetra
3 months ago