Home
Cloudera
CCA175: CCA Spark and Hadoop Developer

Cloudera CCA175 Exam Questions

Name: Cloudera CCA175 Exam
Brand: Pass4Success
SKU: CCA175
Price: 69.00 USD
Availability: InStock
Rating: 4.9 (175 reviews)

Status: RETIRED

Exam Name: CCA Spark and Hadoop Developer

Exam Code: CCA175

Related Certification(s): Cloudera Certified Associate Certification

Certification Provider: Cloudera

Number of CCA175 practice questions in our database: 96 (updated: 16-08-2024)

Expected CCA175 Exam Topics, as suggested by Cloudera :

Topic 1: Understand the fundamentals of querying datasets in Spark/ Write the results back into HDFS using Spark
Topic 2: Write queries that calculate aggregate statistics/ Load data from HDFS for use in Spark applications
Topic 3: Use meta store tables as an input source or an output sink for Spark applications/ Filter data using Spark
Topic 4: Generate reports by using queries against loaded data/ Produce ranked or sorted data
Topic 5: Perform standard extract, transform, load (ETL) processes on data using the Spark API/ Join disparate datasets using Spark
Topic 6: Use Spark SQL to interact with the meta store programmatically in your applications/ Read and write files in a variety of file formats

Disscuss Cloudera CCA175 Topics, Questions or Ask Anything Related

Submit Cancel

Olga

10 months ago

Passing the Cloudera CCA Spark and Hadoop Developer exam was a great achievement for me, and I attribute my success to practicing with Pass4Success practice questions. The exam tested my knowledge of writing queries that calculate aggregate statistics and loading data from HDFS for use in Spark applications. One question that I found particularly tricky was about writing results back into HDFS using Spark. Despite my initial uncertainty, I was able to answer it correctly and pass the exam.

upvoted 0 times

...

Larae

11 months ago

My exam experience was successful as I passed the Cloudera CCA Spark and Hadoop Developer exam. The topics of loading data from HDFS for use in Spark applications were crucial for the exam. One question that I remember was about understanding the fundamentals of querying datasets in Spark. It was a challenging question, but I was able to answer it correctly and pass the exam.

upvoted 0 times

...

Daisy

11 months ago

Just passed the CCA Spark and Hadoop Developer exam! Be prepared for hands-on questions on Spark SQL transformations. Focus on understanding window functions and their applications. Thanks to Pass4Success for the spot-on practice questions that helped me prepare efficiently!

upvoted 0 times

...

Sharmaine

12 months ago

I recently passed the Cloudera CCA Spark and Hadoop Developer exam with the help of Pass4Success practice questions. The exam covered topics such as querying datasets in Spark and writing results back into HDFS. One question that stood out to me was related to writing queries that calculate aggregate statistics. I was a bit unsure of the answer, but I managed to pass the exam.

upvoted 0 times

...

Free Cloudera CCA175 Exam Actual Questions

Note: Premium Questions for CCA175 were last updated On 16-08-2024 (see below)

Question #1

Problem Scenario 32 : You have given three files as below.

spark3/sparkdir1/file1.txt

spark3/sparkd ir2ffile2.txt

spark3/sparkd ir3Zfile3.txt

Each file contain some text.

spark3/sparkdir1/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

spark3/sparkdir2/file2.txt

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

spark3/sparkdir3/file3.txt

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

Now write a Spark code in scala which will load all these three files from hdfs and do the word count by filtering following words. And result should be sorted by word count in reverse order.

Filter words ("a","the","an", "as", "a","with","this","these","is","are","in", "for", "to","and","The","of")

Also please make sure you load all three files as a Single RDD (All three files must be loaded using single API call).

You have also been given following codec

import org.apache.hadoop.io.compress.GzipCodec

Please use above codec to compress file, while saving in hdfs.

ASolution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load content from all files.
val content = sc.textFile('spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3.txt') //Load the text file
Step 3 : Now create split each line and create RDD of words.
val flatContent = content.flatMap(word=>word.split(' '))
step 4 : Remove space after each word (trim it)
val trimmedContent = f1atContent.map(word=>word.trim)
Step 5 : Create an RDD from remove, all the words that needs to be removed.
val removeRDD = sc.parallelize(List('a','theM,ManM, 'as', 'a','with','this','these','is','are'\'in'\ 'for', 'to','and','The','of'))
Step 6 : Filter the RDD, so it can have only content which are not present in removeRDD. val filtered = trimmedContent.subtract(removeRDD}
Step 7 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 8 : Now do the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 9 : Now swap PairRDD.
val swapped = wordCount.map(item => item.swap)
Step 10 : Now revers order the content. val sortedOutput = swapped.sortByKey(false)
Step 11 : Save the output as a Text file. sortedOutput.saveAsTextFile('spark3/result')
Step 12 : Save compressed output.
import org.apache.hadoop.io.compress.GzipCodec
sortedOutput.saveAsTextFile('spark3/compressedresult', classOf[GzipCodec])

BSolution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load content from all files.
val content = sc.textFile('spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3.txt') //Load the text file
Step 3 : Now create split each line and create RDD of words.
val flatContent = content.flatMap(word=>word.split(' '))
step 4 : Remove space after each word (trim it)
val trimmedContent = f1atContent.map(word=>word.trim)
Step 5 : Create an RDD from remove, all the words that needs to be removed.
val removeRDD = sc.parallelize(List('a','theM,ManM, 'as', 'a','with','this','these','is','are'\'in'\ 'for', 'to','and','The','of'))
Step 6 : Filter the RDD, so it can have only content which are not present in removeRDD. val filtered = trimmedContent.subtract(removeRDD}
Step 7 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 8 : Now do the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 9 : Now swap PairRDD.
val swapped = wordCount.map(item => item.swap)

Reveal Solution Hide Solution

Correct Answer: A

Question #2

Problem Scenario 79 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Copy "retaildb.products" table to hdfs in a directory p93_products

2. Filter out all the empty prices

3. Sort all the products based on price in both ascending as well as descending order.

4. Sort all the products based on price as well as product_id in descending order.

5. Use the below functions to do data ordering or ranking and fetch top 10 elements top()

takeOrdered() sortByKey()

ASolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products -target-dir=p93_products -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile('p93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
nonemptyjines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Now sort data based on product_price in order. sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(',')[4]),line.split(',')[2])).sortByKey()
for line in sortedPriceProducts.collect(): print(line)
Step 6 : Now sort data based on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: (float(line.split(',')[4]),line.split(',')[2])).sortByKey(False)
for line in sortedPriceProducts.collect(): print(line)
Step 7 : Get highest price products name. sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(',')[4]),line-split(,,,,,)[2]))-sortByKey(False).take(1)
print(sortedPriceProducts)
Step 8 : Now sort data based on product_price as well as product_id in descending order.
#Dont forget to cast string #Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line print(sortedPriceProducts)
Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.
#Dont forget to cast string
#Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^
print(sortedPriceProducts)
Step 10 : Now sort data based on product_price as ascending and product_id in ascending order, using takeOrdered{) function.
#Dont forget to cast string
#Tuple as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (tuple[0][0],tuple[0][1]))
Step 11 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.
#Dont forget to cast string
#Tuple as key ((price,id},name)
#Using minus(-) parameter can help you to make descending ordering , only for numeric value.
sortedPrlceProducts=nonemptylines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (-tuple[0][0],tuple[0][1]}}

BSolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products -target-dir=p93_products -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile('p93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
nonemptyjines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Now sort data based on product_price in order. sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(',')[4]),line.split(',')[2])).sortByKey()
for line in sortedPriceProducts.collect(): print(line)
Step 6 : Now sort data based on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: (float(line.split(',')[4]),line.split(',')[2])).sortByKey(False)
for line in sortedPriceProducts.collect(): print(line)
Step 7 : Get highest price products name. sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(',')[4]),line-split(,,,,,)[2]))-sortByKey(False).take(1)
print(sortedPriceProducts)
Step 8 : Now sort data based on product_price as well as product_id in descending order.
#Dont forget to cast string #Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line print(sortedPriceProducts)
Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.
#Dont forget to cast string
#Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^
print(sortedPriceProducts)
Step 10 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.
#Dont forget to cast string
#Tuple as key ((price,id},name)
#Using minus(-) parameter can help you to make descending ordering , only for numeric value.
sortedPrlceProducts=nonemptylines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (-tuple[0][0],tuple[0][1]}}

Reveal Solution Hide Solution

Correct Answer: A

Question #3

Problem Scenario 80 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.products

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Copy "retaildb.products" table to hdfs in a directory p93_products

2. Now sort the products data sorted by product price per category, use productcategoryid colunm to group by category

ASolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products --target-dir=p93
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following}. productsRDD = sc.textFile(Mp93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
Nonempty_lines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Create data set like (categroyld, (id,name,price)
mappedRDD = nonempty_lines.map(lambda line: (line.split(',')[1], (line.split(',')[0], line.split(',')[2], float(line.split(',')[4]))))
tor line in mappedRDD.collect(): print(line)
Step 6 : Now groupBy the all records based on categoryld, which a key on mappedRDD it will produce output like (categoryld, iterable of all linesfor a key/categoryld)
groupByCategroyld = mappedRDD.groupByKey() for line in groupByCategroyld.collect(): print(line)
step 7 : Now sort the data in each category based on price in ascending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price onwhich it needs to be sorted.
groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2])).take(5)
Step 8 : Now sort the data in each category based on price in descending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price which it needs to be sorted.
on groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2] , reverse=True)).take(5)

BSolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products --target-dir=p93
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following}. productsRDD = sc.textFile(Mp93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
Nonempty_lines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Create data set like (categroyld, (id,name,price)
mappedRDD = nonempty_lines.map(lambda line: (line.split(',')[1], (line.split(',')[0], line.split(',')[2], float(line.split(',')[4]))))
tor line in mappedRDD.collect(): print(line)
step 6 : Now sort the data in each category based on price in ascending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price onwhich it needs to be sorted.
groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2])).take(5)
Step 7 : Now sort the data in each category based on price in descending order.
# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price which it needs to be sorted.
on groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2] , reverse=True)).take(5)

Reveal Solution Hide Solution

Correct Answer: A

Question #4

Problem Scenario 94 : You have to run your Spark application on yarn with each executor 20GB and number of executors should be 50.Please replace XXX, YYY, ZZZ

export HADOOP_CONF_DIR=XXX

./bin/spark-submit \

-class com.hadoopexam.MyTask \

xxx\

-deploy-mode cluster \ # can be client for client mode

YYY\

222 \

/path/to/hadoopexam.jar \

1000

ASolution
XXX: -master yarn
YYY : -executor-memory 20G
ZZZ: -num-executors 50

BSolution
XXX: -master yarn
YYY : -executor-memory 40G
ZZZ: -num-executors 80

Reveal Solution Hide Solution

Correct Answer: A

Question #5

Problem Scenario 88 : You have been given below three files

product.csv (Create this file in hdfs)

productID,productCode,name,quantity,price,supplierid

1001,PEN,Pen Red,5000,1.23,501

1002,PEN,Pen Blue,8000,1.25,501

1003,PEN,Pen Black,2000,1.25,501

1004,PEC,Pencil 2B,10000,0.48,502

1005,PEC,Pencil 2H,8000,0.49,502

1006,PEC,Pencil HB,0,9999.99,502

2001,PEC,Pencil 3B,500,0.52,501

2002,PEC,Pencil 4B,200,0.62,501

2003,PEC,Pencil 5B,100,0.73,501

2004,PEC,Pencil 6B,500,0.47,502

supplier.csv

supplierid,name,phone

501,ABC Traders,88881111

502,XYZ Company,88882222

503,QQ Corp,88883333

products_suppliers.csv

productID,supplierID

2001,501

2002,501

2003,501

2004,502

2001,503

Now accomplish all the queries given in solution.

1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.

2. Find all the supllier name, who are supplying 'Pencil 3B'

3. Find all the products , which are supplied by ABC Traders.

ASolution :
Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON products_suppliers.supplierlD = suppliers.supplierlD
null t
results.show()
Step 2 : Find all the supllier name, who are supplying 'Pencil 3B'
val results = sqlContext.sql(......SELECT p.name AS 'Product Name', s.name AS 'Supplier Name'
FROM products_suppliers AS ps
Step 3 : Find all the products , which are supplied by ABC Traders.
val results = sqlContext.sql(......SELECT p.name AS 'Product Name', s.name AS 'Supplier Name'
FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID AND ps.supplierlD = s.supplierlD
AND s.name = 'ABC Traders'.....)
results. show()

BSolution :
Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON products_suppliers.supplierlD = suppliers.supplierlD
null t
results.show()
Step 2 : Find all the supllier name, who are supplying 'Pencil 3B'
val results = sqlContext.sql(......SELECT p.name AS 'Product Name', s.name AS 'Supplier Name'
FROM products_suppliers AS ps
JOIN products AS p ON ps.productID = p.productID
JOIN suppliers AS s ON ps.supplierlD = s.supplierlD
WHERE p.name = 'Pencil 3B'',M )
results.show()
Step 3 : Find all the products , which are supplied by ABC Traders.
val results = sqlContext.sql(......SELECT p.name AS 'Product Name', s.name AS 'Supplier Name'
FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID AND ps.supplierlD = s.supplierlD
AND s.name = 'ABC Traders'.....)
results. show()

Reveal Solution Hide Solution

Correct Answer: B

Explore Other Cloudera Exams

Unlock Premium CCA175 Exam Questions with Advanced Practice Test Features:

Select Question Types you want
Set your Desired Pass Percentage
Allocate Time (Hours : Minutes)
Create Multiple Practice tests with Limited Questions
Customer Support

Get Full Access Now

Unlock all CCA175 features

Just for $59 you get

Select Question Types you want
Set your Desired Pass Percentage
Allocate Time (Hours : Minutes)
Create Multiple Practice tests with Limited Questions
Customer Support

Get Full Access Now

Login First To Buy This Product

Password

Questions and Answers Demo

Web-Based Practice Test Demo

Start Demo