A customer-facing, AI-powered chatbot has been jailbroken through prompt injections. As a result, the AI model is offering a 99% discount on the purchase of a new vehicle.
Which of the following should be implemented to enhance the model's robustness against such attacks?
Basic Concept: Jailbreaking through prompt injection exploits the LLM's tendency to follow instructions embedded in user input, overriding its intended behavior. The model was manipulated to offer unauthorized discounts, demonstrating that its operational boundaries were not properly enforced. CompTIA SecAI+ Study Guide identifies guardrails as the primary defense against jailbreaking attacks.
Why D is Correct: Guardrails are robust, layered controls that enforce behavioral boundaries on LLM inputs and outputs. They can detect and block jailbreaking attempts, enforce business logic constraints such as preventing unauthorized discounts, validate outputs against policy rules before delivery, and prevent the model from operating outside its defined scope. Guardrails are specifically designed to make models more robust against prompt injection and jailbreaking.
Why A is Wrong: Bias filtering is designed to detect and remove biased, discriminatory, or offensive content from model outputs. It addresses content fairness issues but does not prevent jailbreaking attacks that manipulate the model into performing unauthorized actions.
Why B is Wrong: A system prompt sets the model's base instructions and persona, but the jailbreak attack already demonstrates that the current prompt can be overridden. Guardrails provide enforcement at a layer that is more resistant to prompt manipulation than the system prompt alone.
Why C is Wrong: Log monitoring detects jailbreaking attempts after they have already succeeded. It is a detective control that enables incident response but does not prevent the model from offering unauthorized discounts in the first place.
Currently there are no comments in this discussion, be the first to comment!