AI red teaming is the deliberate, systematic attempt to break an AI system — probing it from the perspective of a malicious actor to surface security weaknesses, harmful outputs, and policy violations before they reach production. The term is borrowed from traditional cybersecurity red teaming, where an offensive team (the "red team") attacks their own organization's defenses so that the defensive team (the "blue team") can improve them. For AI systems, this adversarial mindset is applied to the unique failure modes of machine learning models: models that can be manipulated through language, that may leak training data, and that can produce dangerous or deceptive content under the right conditions.
Unlike traditional red teaming — which primarily targets network infrastructure, software vulnerabilities, and human social engineering vectors — AI red teaming must account for a fundamentally different attack surface. Traditional penetration testing operates on deterministic systems where a given input reliably produces a given output. AI models are probabilistic: the same prompt can yield different results across sessions, and subtle wording changes can cause wildly different behaviors. AI red teamers must therefore test not just specific exploits but entire categories of model behavior, often requiring creativity and domain expertise rather than off-the-shelf tooling.
**Five Core Attack Categories**
1. **Prompt injection** — Crafting inputs that override a model's system instructions, causing it to ignore safety guardrails, reveal confidential configuration, or act on behalf of an attacker rather than the legitimate user.
2. **Jailbreaking** — Using role-play scenarios, hypothetical framings, multi-step manipulation, or adversarial prompt structures to bypass content safety policies and elicit outputs the model was explicitly trained to refuse.
3. **Data poisoning** — Inserting malicious or misleading examples into a model's training or fine-tuning dataset to degrade performance, introduce backdoors, or bias the model toward specific harmful outputs at inference time.
4. **Model extraction** — Systematically querying a model to reconstruct a functional approximation of its weights or decision boundaries, enabling competitors or attackers to steal proprietary AI capabilities without authorization.
5. **Adversarial inputs** — Applying mathematically crafted perturbations to images, audio, or text that are imperceptible to humans but reliably cause the model to misclassify, mistranscribe, or produce incorrect outputs — a concern especially in high-stakes domains like medical imaging or fraud detection.
**Who Performs AI Red Teaming**
AI red teaming is conducted by three main groups. Internal security teams with AI expertise run continuous assessments as models are updated, integrating red teaming into the MLOps pipeline. AI safety researchers — employed by organizations like OpenAI, Anthropic, Google DeepMind, and government bodies such as the UK AI Safety Institute — perform pre-release evaluations of frontier models to assess risks at the capability level. Third-party auditors and specialized AI security firms provide independent assessments, offering an outside-in perspective that internal teams may miss due to familiarity bias.
Emerging regulatory frameworks are formalizing these requirements. The EU AI Act mandates adversarial testing for high-risk AI systems. The U.S. Executive Order on AI (2023) required red team evaluations for powerful foundation models before public release. NIST's AI Risk Management Framework includes adversarial testing as a core component of the "Measure" function. As enterprises deploy AI agents with access to sensitive systems, AI red teaming is transitioning from a pre-deployment activity to a continuous security discipline — one as fundamental as penetration testing is for traditional software.