New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Generative AI Engineer Associate Exam - Topic 4 Question 9 Discussion

Actual exam question for Databricks's Databricks Certified Generative AI Engineer Associate exam
Question #: 9
Topic #: 4
[All Databricks Certified Generative AI Engineer Associate Questions]

A Generative Al Engineer is creating an LLM-based application. The documents for its retriever have been chunked to a maximum of 512 tokens each. The Generative Al Engineer knows that cost and latency are more important than quality for this application. They have several context length levels to choose from.

Which will fulfill their need?

Show Suggested Answer Hide Answer
Suggested Answer: A

When deploying an LLM application for customer service inquiries, the primary focus is on measuring the operational efficiency and quality of the responses. Here's why A is the correct metric:

Number of customer inquiries processed per unit of time: This metric tracks the throughput of the customer service system, reflecting how many customer inquiries the LLM application can handle in a given time period (e.g., per minute or hour). High throughput is crucial in customer service applications where quick response times are essential to user satisfaction and business efficiency.

Real-time performance monitoring: Monitoring the number of queries processed is an important part of ensuring that the model is performing well under load, especially during peak traffic times. It also helps ensure the system scales properly to meet demand.

Why other options are not ideal:

B . Energy usage per query: While energy efficiency is a consideration, it is not the primary concern for a customer-facing application where user experience (i.e., fast and accurate responses) is critical.

C . Final perplexity scores for the training of the model: Perplexity is a metric for model training, but it doesn't reflect the real-time operational performance of an LLM in production.

D . HuggingFace Leaderboard values for the base LLM: The HuggingFace Leaderboard is more relevant during model selection and benchmarking. However, it is not a direct measure of the model's performance in a specific customer service application in production.

Focusing on throughput (inquiries processed per unit time) ensures that the LLM application is meeting business needs for fast and efficient customer service responses.


Contribute your Thoughts:

0/2000 characters
Dannette
3 months ago
512 tokens? Really? That seems too short for anything useful.
upvoted 0 times
...
Elouise
3 months ago
Definitely D! Smallest model and good enough for speed.
upvoted 0 times
...
Regenia
3 months ago
But isn't 512 a bit limiting for some tasks?
upvoted 0 times
...
Ezekiel
4 months ago
I agree, cost and latency matter more here.
upvoted 0 times
...
Chanel
4 months ago
Context length 512 is the way to go!
upvoted 0 times
...
Muriel
4 months ago
I’m a bit confused about the context lengths. I thought longer context lengths were better for quality, but here they want to prioritize cost. Maybe D is the safest bet?
upvoted 0 times
...
Darrin
4 months ago
I think I saw a similar question where the focus was on balancing performance and cost. I wonder if option A could be a contender since it's just over 512 tokens.
upvoted 0 times
...
Lillian
4 months ago
I'm not entirely sure, but I feel like context length 512 is what we practiced with in class. It seems to fit the requirements.
upvoted 0 times
...
Hector
5 months ago
I remember that for cost and latency, smaller models are usually better, so I think option D might be the right choice.
upvoted 0 times
...
Daniela
5 months ago
I'm leaning towards option A. Even though it's a bit larger than option D, the 514 context length might be a better match for the 512 token chunking mentioned in the question. I'll have to weigh the trade-offs between size and context length.
upvoted 0 times
...
Precious
5 months ago
Okay, I think I've got it. The key here is that cost and latency are more important than quality, so we want the smallest model possible. Option D with a 0.13GB model size and 384 embedding dimension seems like the best fit for the requirements.
upvoted 0 times
...
Kate
5 months ago
Hmm, I'm a bit confused here. The question mentions that the documents have been chunked to 512 tokens, so I'm not sure if option D with a context length of 512 would be the best choice. I'll have to think this through a bit more.
upvoted 0 times
...
Dante
5 months ago
This seems pretty straightforward. Since cost and latency are more important than quality, I'd go with the smallest model option, which is D.
upvoted 0 times
...
Hortencia
9 months ago
Haha, I love how the options just keep getting more and more absurd. 14GB for a model? They must be running this on a supercomputer!
upvoted 0 times
Luther
8 months ago
D: Agreed, option D seems like the best choice for their needs.
upvoted 0 times
...
Lucy
8 months ago
C: I think the engineer should go with option D, it's more cost-effective.
upvoted 0 times
...
Brandon
8 months ago
B: I know right, I can't imagine the processing power needed for that.
upvoted 0 times
...
Paulina
8 months ago
A: Yeah, 14GB for a model is crazy!
upvoted 0 times
...
...
Junita
10 months ago
32,768 tokens? Are they trying to build Skynet or something? I think they need to dial it back a bit and focus on the practical needs.
upvoted 0 times
Amie
8 months ago
C: I think option D with context length 512 would be more practical.
upvoted 0 times
...
Jesus
9 months ago
B: Yeah, they should focus on cost and latency.
upvoted 0 times
...
Dacia
9 months ago
A: I agree, 32,768 tokens seems excessive.
upvoted 0 times
...
...
Robt
10 months ago
Hmm, 514 tokens might work, but that extra cost and size is probably not worth it. I'd go with the 384 embedding dimension option.
upvoted 0 times
Shawna
8 months ago
I think the 384 embedding dimension with 512 tokens is the way to go for this application.
upvoted 0 times
...
Cassie
9 months ago
Yeah, the extra cost and size for the 514 tokens might not be worth it.
upvoted 0 times
...
Jeff
9 months ago
I agree, the 384 embedding dimension option seems like the best choice.
upvoted 0 times
...
...
Stephaine
10 months ago
Wow, 512 tokens per chunk? That's really compact. I guess they're going for speed and efficiency, not the highest quality.
upvoted 0 times
Mattie
9 months ago
B: Yeah, I agree. The smaller model size and lower embedding dimension would definitely help with cost and latency.
upvoted 0 times
...
Ashanti
10 months ago
A: I think option D with context length 512 would be the best choice for speed and efficiency.
upvoted 0 times
...
...
Cherry
11 months ago
But the Generative AI Engineer mentioned that cost and latency are more important than quality, so a smaller model like option D might be more efficient.
upvoted 0 times
...
Maile
11 months ago
I disagree, I believe option A with context length 514 is more suitable for this application.
upvoted 0 times
...
Cherry
11 months ago
I think option D with context length 512 would be the best choice.
upvoted 0 times
...

Save Cancel