An education company wants to build a private tutor application. The application will give users the ability to enter text or provide a picture of a question. The application will respond with a written answer and an explanation of the written answer.
Which model type meets these requirements?
Comprehensive and Detailed Explanation From Exact AWS AI documents:
A multimodal large language model (LLM) can:
Accept both text and image inputs
Understand visual and textual context
Generate coherent written explanations
AWS generative AI guidance positions multimodal LLMs as the best choice for applications requiring cross-modal understanding and text generation.
Why the other options are incorrect:
Computer vision (A) does not generate text explanations.
Diffusion models (C) generate images.
Text-to-speech (D) converts text to audio.
AWS AI document references:
Multimodal Foundation Models on AWS
Building AI Tutors with Generative Models
Currently there are no comments in this discussion, be the first to comment!