Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
transactionsDf.select('storeId').dropDuplicates().count()
Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.
transactionsDf.select(count('storeId')).dropDuplicates()
No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count('storeId'))
Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates
instead.
transactionsDf.distinct().select('storeId').count()
Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count
not represent the number of unique values in that column.
transactionsDf.select(distinct('storeId')).count()
False. There is no distinct method in pyspark.sql.functions.
Danica
10 months agoMelissa
10 months agoJanet
10 months agoAntonio
10 months agoMelita
10 months agoSharmaine
10 months agoYan
10 months agoStephaine
10 months agoStephaine
10 months agoStephaine
10 months agoTheodora
10 months agoRichelle
10 months agoTy
11 months agoEulah
11 months agoShawnda
11 months agoLinwood
10 months agoTresa
10 months agoVeronika
10 months agoJodi
11 months agoAnastacia
11 months agoSylvia
11 months ago