For years, content moderation teams have struggled with the rigidity of traditional machine learning models. Static classifiers, expensive retraining cycles, and opaque decision-making have made it difficult to keep up with rapidly evolving online behavior.
Advancements in large language models (LLMs) such as GPT-5 , Gemini 3, and Claude Opus 4.1 enable more flexible, transparent, and high-precision moderation workflows. This practical guide details prompt engineering and performance considerations for 90%+ accuracy, written by expert T&S practitioners.
Why LLMs Are Now a Viable Option For Content Moderation
Before we get started, let's quickly cover why you'd want to use LLMs for content moderation at all. LLMs have reached a level of accuracy, affordability, and ease of use that makes them preferable to traditional machine learning (ML) models in many ways. Costs are especially reasonable when using distilled or non-reasoning variants for classification tasks, often comparable to ML models when deployed at scale through commercial tools. Additionally, LLMs can moderate nuanced content in ways that traditional ML can't.
While in-house ML models may still win on raw compute cost, they require extensive engineering and data science resources. They are also inherently less adaptable, often requiring full retraining to adjust to new definitions of harm or emerging platform behaviors. LLMs, by contrast, are prompt-driven — meaning custom updates can be made in minutes, not months.
LLMs:
- Good for complex, evolving, or nuanced decisions
- Can be customized with just a few examples
- Adaptable to policy changes, unique policies, or emerging language without retraining
- Provide rationales for their outputs
- Completely customizable without data science resources
- Can be more expensive than traditional ML (though this depends on the model) but cheaper than human moderation
- Slower decision-making speed / latency (still <1 second)
Traditional ML/ fixed classifiers:
- Good for well-defined, static classification tasks
- Requires extensive labeled training data
- Must be retrained for each policy update
- Offers limited transparency
- Customization is expensive and time-consuming
- Can be the cheaper option when done in-house
- Faster decision-making speed/ latency
Real-World Examples of using LLMs for Content Moderation
Companies across industries are already using LLMs for content moderation at scale:
- Pinterest uses LLMs to identify emerging violative content patterns and measure policy prevalence, allowing their T&S team to respond to new threats and make strategic decisions faster.
- DoorDash deployed SafeChat, an LLM-powered feature that decreased low- and medium-severity safety incidents in chat by 50%.
- Etsy uses LLMs to understand and classify their vast inventory of handmade and vintage items, improving both search relevance and policy enforcement across millions of unique product listings.
These companies built in-house solutions, but many T&S teams don't have the engineering resources or timeline for custom development. Commercial tools like Musubi's PolicyAI provide the same capabilities without the build overhead, letting you test policies, compare models, and deploy to production in days instead of months.
The Two Levers: Model Selection and Prompt Engineering
LLM-based moderation gives you two primary ways to improve performance:
Lever 1: Model Selection
Different models have different strengths, and the right choice depends on your priorities:
- Large frontier models (e.g., GPT-5, Claude Opus, Gemini 3) offer the highest accuracy and can handle complex, nuanced policies with edge cases. They're slower and more expensive but may be worth it for high-stakes decisions or when you need strong reasoning capabilities.
- Small, efficient models (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash) are faster and cheaper— ideal for high-volume classification where speed matters and policies are more straightforward. They can achieve excellent accuracy on well-defined policies.
- Open-source safety-tuned models like GPT-OSS Safeguard and Nvidia's Nemotron are purpose-built for content moderation and can be self-hosted for data privacy. They offer a middle ground: good accuracy on common safety policies with lower costs than frontier models, but they require more technical support to host and use.
The key tradeoff is accuracy vs. latency vs. price: bigger models will be more accurate (especially on complex policies) but slower and more expensive, even with identical prompts. Most teams start with a mid-tier model and scale up or down based on actual performance needs.
Lever 2: Prompt / Policy Engineering
Once you've selected a model, your prompt is the primary way to tune performance. Even with the same model, a well-engineered prompt can improve accuracy by 20-30 percentage points. The rest of this guide focuses on prompt engineering because it's where T&S teams have the most direct control.
Introducing Policy Engineering
Policy engineering is the process of translating human-written moderation guidelines into LLM-readable instructions. It’s a hybrid of policy design, prompt development, debugging, and iteration.
For LLM moderation, policy engineering is the core lever for improving performance. Unlike traditional ML pipelines where changes require engineers and T&S experts to work together, T&S teams can now test, update, and refine LLM-based systems directly. This shift enables rapid prototyping, A/B testing of policy variants, and continuous improvement — all without engineering.
This is especially helpful for startups and lean Trust & Safety teams, or in situations where speed to update is critical.
How to Format Policies for LLMs
LLMs understand natural language — but how you format that language makes a big difference. There are three effective styles:
- Simplified Natural Language
- Uses plain English
- Easy to write and interpret
- May lack precision for edge cases
- Structured Format with Examples
- Clearly separates violations and non-violations
- Uses headers, bullets, and examples
- Most effective for consistent classification
- Rule-Based Logic
- Provides conditional logic (e.g., IF/THEN)
- Works best with models capable of basic reasoning
- Ideal for policies with lots of exceptions or multi-step qualifiers
Best Practices for Policy Formatting
Do:
- Use plain English, not legalese or platform jargon.
- Keep the policies as concise as possible.
- Clearly describe what counts as a violation or exception.
- Use Markdown, bullets, and sections.
- Include examples of both violations and non-violations.
- Define key terms clearly.
- Evaluate early and adjust based on real results.
Avoid:
- Vague terms like “inappropriate” or “offensive” without definitions.
- Policies that reference data the model can’t access (e.g., intent, user history).
- Dense, unstructured prose (LLMs degrade when parsing dense blocks of text.)
It’s possible to ask an LLM to help you rewrite a policy in one of these styles, which can make the policy engineering process go faster. However, even when using reasoning models, we’ve found that LLMs often need multiple prompts and reminders on best practices (for example, reminders to define key terms, or to cut out extraneous language).
Example of Translating Human Policies
Sometimes the policies that are the best for humans (nuanced, evocative, outlining the spirit of the rule) are the worst for LLMs. Let’s take a look at a great human-readable policy and how it might look when rewritten for an LLM.
Original Policy: Reddit
“Everyone has a right to use Reddit free of harassment, bullying, and threats of violence. Communities and users that incite violence or that promote hate based on identity or vulnerability will be banned.”
LLM Challenges:
- Ambiguous terms: “promote hate,” “vulnerability”
- No definition of identities that are applicable (e.g., protected classes)
Version A: Simplified Natural Language
This version is easy to write and easy to understand, but may still be ambiguous without definitions or examples.
Do not allow posts that:
- Attack someone based on race, gender, religion, or other protected classes
- Encourage violence against any group
- Use slurs or hate terms targeting specific communities
Version B: Structured Format with Examples
This format will yield more precise results and is easier to audit. Examples allow for training in edge-cases and nuanced scenarios but can require maintenance over time.
## Policy: Hate Speech
### Violations:
- Using slurs or derogatory terms targeting **protected groups**:
- Race or ethnicity
- Religion
- Gender or gender identity
- Sexual orientation
- Disability
- Promoting or glorifying violence against **protected groups**
### Not Violations:
- Educational content about hate speech
- Quoting slurs to critique them
- Reclaimed language
### Examples:
- Violation: “All [ethnic group] are criminals.”
- Safe: “Wow, I can't believe someone said “All [ethnic group] are criminals.”
Version C: Rule-Based Logic
This format is also precise and easy to audit, but may be too rigid for some scenarios. It also may require LLMs capable of basic reasoning, depending on how complicated the logic gets.
If a post:
1. Targets a group based on protected characteristics AND
2. Uses slurs, dehumanizing language, or incites violence
→ Label as Hate Speech
Protected characteristics include:
- Race, ethnicity, religion, gender, gender identity, sexual orientation, disability
The process of Prompt Engineering
Start with a single policy area and build a golden dataset of 100–200 labeled examples — including clear violations, clear non-violations, and edge cases.
- For each example, ask the LLM to:
- Make a decision: Violation / Not a Violation
- Explain its reasoning briefly
- Then perform a side-by-side review:
- Are mismatches due to human error? (Sometimes yes- we’ve found examples where an LLM convinces us to change our label on a golden set)
- Are model errors caused by vague policy or mistakes in formatting? (one fun example I debugged: the policy was written as “no hate speech” — the LLM couldn’t decide if this meant “no hate speech allowed” or “no hate speech is present” and gave answers based on both interpretations in the same golden set results)
- Is performance consistent across edge cases?
- Refine and iterate:
- Simplify or clarify policy instructions
- Add representative examples and counterexamples
- Reorder prompts — order matters (for example, accuracy can improve when the most common policy violations are put first, or the most severe are put first)
- A/B test different prompt styles in production
- Set up escalation points or specific audit mechanisms for cases that the LLM underperforms in

Performance Considerations
With a strong prompt and clean golden set, LLMs can achieve mid-to-high 90% accuracy in text-only moderation — often on par with expert human reviewers. LLMs aren’t perfect, though, and moderation teams should remain cautious in a few areas:
- There is always a risk of false positives/negatives, which is especially important in high-stakes domains (e.g., medical misinformation or political speech).
- They can’t assess intent, history, or off-platform behavior unless explicitly provided, and this gets prohibitively expensive and clunky (we recommend a mix of LLM content-labeling and account-level custom ML models that are behavior-aware such as Musubi’s AiMod).
- Commercial LLMs are trained for safety, which means that they may tend to be more conservative on some policies than you expect. These tendencies stem from RLHF (reinforcement learning with human feedback) and safety tuning focused on generic use. You can overcome these with fine-tuning or custom instructions, but off-the-shelf behavior may reflect broader safety guardrails not designed for nuanced T&S use cases. That said, you can use this to your advantage if you want high recall on critical policy areas.
- LLM models can get confused with very long policy documents. If you have lots of complicated policies, you may need to break them into separate prompts for better accuracy (which costs more per review).
- As great as it sounds to have an LLM “escalate to a human” when it has trouble, this is difficult to do in practice, because LLMs tend to be overconfident. There are a few methods to try, but use caution and experiment. Setting up checks and balances (such as manual auditing) is important here.
As with any moderation workflow, LLMs work best when there are checks and balances to ensure performance. For example, regular audits of random decisions and reviewing user moderation appeals. For more information, read The top Challenges of using LLMs for Moderation (and how to overcome them).
How to Get Started Using AI for Content Moderation
For the first time, Trust & Safety professionals can design, test, and iterate on policy enforcement themselves — no engineering bottlenecks required. Policy engineering is the bridge: turning human policy into machine-readable rules, backed by fast feedback loops and scalable evaluation. The teams that adopt this mindset will ship better policies, faster — and build safer, smarter online platforms in the process.
Building in-house: You can call commercial LLMs directly using their APIs, which is relatively straightforward but locks you into one AI ecosystem—a potential issue as new models constantly emerge. You'll need to build your own testing infrastructure, manage model switching, and handle production deployment.
Using a platform: Tools like Musubi's PolicyAI are designed specifically for policy engineers, allowing you to write policies in any style, add examples, test against golden datasets, and compare models side-by-side before deploying to production—all without writing code. If you want to move faster without the engineering overhead, contact us for a demo.