Microsoft AI-103 Exam - Topic 5 Question 2 Discussion

Actual exam question for Microsoft's AI-103 exam

Question #: 2
Topic #: 5

You are creating an agent workflow in a Microsoft Foundry project to support natural voice interactions.

The agent must receive continuous audio input, convert the input into text for reasoning, and then return spoken responses to a user. The workflow must meet the following requirements:

. Support turn-taking dynamics, where the agent begins to generate the speech output before the user finishes speaking. . Operate with low latency to maintain a conversational experience.

You need to enable both speech to text and text to speech in a real-time agent interaction.

What should you do?

AUse an embeddings model to encode the audio, and then decode the audio into text and speech.

BUse batch transcription to convert the audio input and return text responses from the agent.

CUse speech translation to convert the audio into another language and return the translated text.

DUse real-time speech to text for incoming audio and text to speech for agent responses.

Show Suggested Answer

Suggested Answer: D

The correct answer is D. Use real-time speech to text for incoming audio and text to speech for agent responses. The workflow requires continuous audio input, low-latency transcription for reasoning, and spoken output back to the user. Azure Speech in Foundry Tools real-time speech to text is designed for immediate transcription from streaming audio, which satisfies the incoming-audio side of the interaction. Text to speech provides the outbound spoken response path after the agent generates its answer.

This pattern aligns with Microsoft's real-time voice-agent architecture. The Voice Live API overview explains that low-latency speech-to-speech systems integrate speech recognition, generative reasoning, and text-to-speech functionality to create natural voice experiences. It also identifies contact centers as a key scenario and highlights low perceived latency for end users. Embeddings do not decode audio into conversational speech. Batch transcription introduces file-oriented delay and is not suitable for turn-taking. Speech translation is only appropriate when translating between languages and does not provide the required reasoning-plus-spoken-response loop. Reference topics: Azure Speech in Foundry Tools, real-time speech to text, text to speech, voice agents, low-latency interaction, and conversational turn-taking.

by Floyd at May 30, 2026, 09:14 PM

Limited Time Offer

25%

Off

Get Premium AI-103 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!