Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)
This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.
spark.read.option('mergeSchema', 'true').parquet(filePath)
Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or
more columns with the same name that appear in both partitions would have different data types.
spark.read.parquet(filePath)
Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition
(e.g. tax_id) would be lost.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith('.parquet'):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.union(df_temp)
nx = nx+1
df
Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical
data types.
spark.read.parquet(filePath, mergeSchema='y')
False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a
boolean or string variable. But 'y' is not a valid option.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith('.parquet'):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.join(df_temp, how='outer')
nx = nx+1
df
No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all
columns that
are included in the partitions should appear exactly once.
More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium
Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)
The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having
the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))
Correct code block:
transactionsDf.withColumn('cos', round(cos(degrees(transactionsDf.value)),2))
This Question: is especially confusing because col, 'cos' are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to
the
details to identify what the correct answer option is.
The first answer option to throw out is the one that starts with withColumnRenamed: The Question: speaks specifically of adding a column. The withColumnRenamed operator only renames
an
existing column, however, so you cannot use it here.
Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of
withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col('cos') as the option for gap 2 can be disregarded.
This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION
NO: you
can find out that the new column should have 'the values in column value converted to degrees and having the cosine of those converted values taken'. This prescribes you a clear order of
operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,
logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.
More info: pyspark.sql.DataFrame.withColumn --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 49 (Databricks import instructions)
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier
whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1. +------+----------------------------------+-----------------------------+-------------------+
2. |itemId|itemName |attributes |supplier |
3. +------+----------------------------------+-----------------------------+-------------------+
4. |1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5. |2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6. |3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7. +------+----------------------------------+-----------------------------+-------------------+
Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this Question: is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through
the
answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the
first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do
not help us in selecting the right answer.
The second gap is more interesting. One answer option includes 'Sports'.isin(col('Supplier')). This construct does not work, since Python's string does not have an isin method. Another option
contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col
('supplier').contains('Sports') and col('supplier').isin('Sports'). The Question: states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator
here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with 'itemName' and the fourth gap either with explode('attributes') or 'attributes'. While both are correct Spark syntax, only explode
('attributes') will help us achieve our goal. Specifically, the Question: asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 39 (Databricks import instructions)
The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having
the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))
Correct code block:
transactionsDf.withColumn('cos', round(cos(degrees(transactionsDf.value)),2))
This Question: is especially confusing because col, 'cos' are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to
the
details to identify what the correct answer option is.
The first answer option to throw out is the one that starts with withColumnRenamed: The Question: speaks specifically of adding a column. The withColumnRenamed operator only renames
an
existing column, however, so you cannot use it here.
Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of
withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col('cos') as the option for gap 2 can be disregarded.
This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION
NO: you
can find out that the new column should have 'the values in column value converted to degrees and having the cosine of those converted values taken'. This prescribes you a clear order of
operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,
logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.
More info: pyspark.sql.DataFrame.withColumn --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 49 (Databricks import instructions)
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier
whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1. +------+----------------------------------+-----------------------------+-------------------+
2. |itemId|itemName |attributes |supplier |
3. +------+----------------------------------+-----------------------------+-------------------+
4. |1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5. |2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6. |3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7. +------+----------------------------------+-----------------------------+-------------------+
Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this Question: is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through
the
answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the
first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do
not help us in selecting the right answer.
The second gap is more interesting. One answer option includes 'Sports'.isin(col('Supplier')). This construct does not work, since Python's string does not have an isin method. Another option
contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col
('supplier').contains('Sports') and col('supplier').isin('Sports'). The Question: states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator
here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with 'itemName' and the fourth gap either with explode('attributes') or 'attributes'. While both are correct Spark syntax, only explode
('attributes') will help us achieve our goal. Specifically, the Question: asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 39 (Databricks import instructions)
Submit Cancel