Databricks Certified Generative AI Engineer Associate Exam - Topic 2 Question 26 Discussion

Actual exam question for Databricks's Databricks Certified Generative AI Engineer Associate exam

Question #: 26
Topic #: 2

[All Databricks Certified Generative AI Engineer Associate Questions]

A Generative Al Engineer is creating an LLM-based application. The documents for its retriever have been chunked to a maximum of 512 tokens each. The Generative Al Engineer knows that cost and latency are more important than quality for this application. They have several context length levels to choose from.

Which will fulfill their need?

Acontext length 514; smallest model is 0.44GB and embedding dimension 768

Bcontext length 2048: smallest model is 11GB and embedding dimension 2560

Ccontext length 32768: smallest model is 14GB and embedding dimension 4096

Dcontext length 512: smallest model is 0.13GB and embedding dimension 384

Show Suggested Answer

Suggested Answer: D

When prioritizing cost and latency over quality in a Large Language Model (LLM)-based application, it is crucial to select a configuration that minimizes both computational resources and latency while still providing reasonable performance. Here's why D is the best choice:

Context length: The context length of 512 tokens aligns with the chunk size used for the documents (maximum of 512 tokens per chunk). This is sufficient for capturing the needed information and generating responses without unnecessary overhead.

Smallest model size: The model with a size of 0.13GB is significantly smaller than the other options. This small footprint ensures faster inference times and lower memory usage, which directly reduces both latency and cost.

Embedding dimension: While the embedding dimension of 384 is smaller than the other options, it is still adequate for tasks where cost and speed are more important than precision and depth of understanding.

This setup achieves the desired balance between cost-efficiency and reasonable performance in a latency-sensitive, cost-conscious application.

by Katina at Apr 01, 2026, 06:13 AM

Limited Time Offer

25%

2 months ago

I remember we discussed how smaller models can reduce latency, so I think option D might be the best choice here.

upvoted 0 times

...