How to use LLMs for Content Moderation (in 2026)

In This Guide:

    Nine months ago, when I wrote the first version of this guide, most Trust & Safety teams I talked to were still trying to decide whether to use LLMs for moderation at all. Today the conversation has moved on. The interesting questions now are about how to do this work well: which models to pick, how to structure policies, how to iterate quickly, which tools to use, and how to make systems auditable enough to defend to leadership and regulators. The answers have changed faster than I expected.

    A quick map of what's changed: more platforms are public about their LLM moderation systems and the architectures behind them. Manual prompt engineering matters less because automatic prompt optimization works and models are less brittle. Model choice has gotten harder, not easier, with more options and faster change. Agentic AI is genuinely transforming what a single practitioner can do without engineering support. And auditability has gone from nice-to-have to table stakes as more teams move LLMs into load-bearing positions in their stack.

    This is a hands-on guide for T&S practitioners actually doing the work. I'll cover each of these shifts in turn, with practical advice for where to invest your time.

    What platforms are sharing publicly

    Good case studies used to be hard to find. Now the examples are everywhere, and they're substantive. Here are two that I find especially interesting.

    Pinterest published a detailed paper this year on their Decision Quality Evaluation Framework, which uses golden datasets curated by subject matter experts to benchmark both human moderators and LLM agents on the same scale. Their core insight: "prompt optimization" remains a subjective art rather than a data-driven science without a rigorous evaluation framework to measure changes against. Their framework also addresses something I haven't seen written about as cleanly elsewhere — separating engineering instability (pipeline bugs, dependency changes, inference non-determinism) from genuine model quality drift caused by evolving content.

    DoorDash has been even more open about its moderation architecture, with two relevant systems documented on their engineering blog. The first is SafeChat, a layered moderation system for messages between Dashers and customers that has contributed to a roughly 50% reduction in low and medium-severity safety incidents since deployment. SafeChat uses a cascade architecture: a low-cost, high-recall filter automatically cleared about 90 percent of messages with minimal latency. Messages not cleared proceeded to a fast, low-cost large language model with higher precision, identifying 99.8 percent of messages as safe. The remaining messages were evaluated by a more precise, higher-cost LLM. The same team has also published on an "LLM-as-jury" pattern for moderating AI-generated content on their homepage, where three LLMs vote independently and any single veto blocks the content. It's a useful pattern to know about for any team that needs high recall on hard cases.

    The takeaway: LLM moderation isn't experimental anymore. Teams are sharing real architectures, metrics, and failure modes, and there's enough out there now that you can ground your own design decisions in what's actually working at other platforms.

    Policy engineering: what's the same, what's different

    The practice of policy engineering (translating human-written moderation guidelines into LLM-readable instructions) has evolved, but the fundamentals haven't.

    The fundamentals:

    • Be concrete about what counts as a violation and what doesn't (vague terms like "inappropriate" or "offensive" without definitions still produce inconsistent results).
    • Include examples of both violations and non-violations. A well-chosen counterexample is often more useful than another rule.
    • Define your terms. If you refer to "hate speech" or "protected classes", spell out exactly what that means for your platform.
    • Avoid asking the LLM to assess things it can't see, like intent, user history, off-platform behavior. None of this is visible from a single piece of content, and if your policy depends on these signals, you need a different architecture.

    There are also three formatting styles that work well for policy prompts, and which one to use depends on your policy's complexity and your tolerance for maintenance:

    • Simplified natural language is the most concise option — a short list of what's allowed and what isn't, written the way you'd explain it to a new hire. It's quick to write and easy to update, but it can leave room for ambiguity on edge cases. Good for policies that are genuinely simple or where you're early in the iteration loop.
    • Structured format with examples uses clear sections (violations, non-violations, examples of each) usually formatted with headers and bullets. It's the most reliable format for consistent classification because the examples train the model on exactly the edge cases you care about. The tradeoff is maintenance: as your policy evolves, your examples need to evolve with it.
    • Rule-based logic uses conditional structure (if the content does X and Y, then Z). It's the most precise format and the easiest to audit, but it's also the most brittle. It works best with reasoning-capable models, and it can feel rigid for policies that depend heavily on context. Good for policies with lots of exceptions or multi-step qualifiers.

      I recommend building or using systems that can switch between these formats depending on the policy and use-case. You want flexibility, not being locked into one system.

    What's different about policy engineering now is the workflow. The old practical advice was: write a prompt, test it against a golden set, iterate by hand, repeat. The "iterate by hand" step is where things have changed:

    • Automatic prompt optimization is now a viable starting point. Tools that take a golden dataset and a target metric and search for better prompt phrasings work well enough to use in production for many policies. They won't always beat a carefully hand-tuned prompt by a domain expert, but they get you to "good" much faster than starting from scratch.
    • Few-shot example selection can sometimes matter more than wording tweaks. Modern models are robust to small changes in phrasing. They're much more sensitive to which examples you put in the prompt. If you have a golden dataset, spending time curating the few-shot examples is usually a higher-leverage activity than rewording the policy itself.
    • Eval-driven iteration is the new core loop. The biggest change in how I work today is that almost everything starts with the evaluation, not the prompt. Build the golden set first. Build the metric you actually care about (often precision and recall at different decision thresholds, not just overall accuracy). Then write a prompt (or let an optimizer write one) and let the eval tell you what to fix.

    Choosing a model in 2026

    There are three main categories of model to consider: large frontier reasoning models, small efficient classifier models, and open-source safety-tuned models.

    • Start small for high-volume, well-defined policies. Small classifier-tier models are now accurate enough for most clear policies (spam, explicit violations, well-defined categories). If you're processing millions of pieces of content a day, this is where you start.
    • Use frontier reasoning models for hard cases. Edge cases, nuanced policies (hate speech, harassment, context-dependent harms), and decisions that need to be defensible to leadership or regulators benefit from a model that can reason.
    • Consider open-source safety-tuned models if data privacy or cost is critical. Self-hosting requires real engineering investment, but for high-volume use cases or sensitive content, it can be the right call.
    • Don't fall in love with one model. The frontier moves every few months. Build your system so you can swap models without rewriting your policies or your evals.
    • Many teams now use a cascade: a small model handles the easy 90%+ of decisions, and a frontier model handles the rest. DoorDash's SafeChat is a public example of this pattern in production. It's the same logic as triage in any other system: cheap and fast where you can, expensive and careful where you must.

    (For a deeper look at when LLMs are and aren't the right tool, and when rule-based systems or fixed ML classifiers could be a better fit, see our piece on rule-based vs. fixed ML vs. LLM moderation.)

    What agentic AI actually changes for practitioners

    I think this agentic AI is the most significant capability change for T&S practitioners since LLMs themselves arrived.

    Until recently, prompt engineering was genuinely a slog. I'd iterate on a policy for hours, tabbing back and forth between the policy and the eval, looking at disagreement cases, trying to figure out what to change, and running it again. A lot of the early advice about LLM moderation (including a lot of what I wrote myself) was effectively about how to make that slog more bearable: format the policy this way, structure your examples like this, and order your rules carefully.

    Agentic workflows have made the slog mostly go away. Instead of tabbing between windows, you can have a conversation. "Run this policy against my golden set. Show me the disagreement cases. What patterns do you see? Propose three policy revisions and tell me what each would change." The agent handles the mechanics; you focus on the policy decisions. What used to feel like a chore now feels fun and fast.

    This isn't about any specific product. It's about a general capability shift. Any platform that exposes the right primitives (read a policy, run an eval, compare results, update a policy, etc) can be driven from an agent today. If you're building internally, building toward agent-compatible APIs (or MCP servers specifically) is one of the highest-leverage architecture decisions you can make right now.

    The caveat is that agents amplify whatever you point them at. An agent running against a bad eval will optimize for the wrong thing very efficiently. An agent updating a policy without strong version control will lose your work. The fundamentals of evaluation and process discipline matter more, not less, when you're moving faster.

    Maintaining LLM moderation systems over time

    As more teams have moved from "we have an LLM doing some moderation" to "the LLM is a load-bearing part of our trust and safety stack," the operational discipline around maintenance has had to catch up. Here are the practices I see working:

    • Version your policies like code. Every change should have a version number, a timestamp, an author, and ideally a note about what changed and why. When a decision goes wrong, you need to be able to answer "which version of which policy made this call?"
    • Run regression evals on every policy change. Before a new policy version goes live, it should be evaluated against the golden set and any existing decisions should be re-scored. A policy change that improves recall on the target category but quietly tanks precision on a related one is the kind of thing only an eval will catch.
    • Refresh your golden set on a schedule. A golden set frozen on day one becomes obsolete as user behavior evolves. Quarterly refreshes are a reasonable starting point; for fast-moving platforms, monthly is better.
    • Benchmark across models periodically. New models arrive constantly. A quarterly benchmark across the current top options (frontier, mid-tier, small, safety-tuned) gives you a sense of whether the model you chose six months ago is still the right one.
    • Audit a random sample of production decisions. Automated metrics tell you the average; spot-checking decisions tells you what's actually happening to your users. This is also where you'll catch the failure modes that don't show up in your golden set.
    • Treat user appeals as a quality signal. Appeals data is one of the best sources of policy gaps and edge cases. Read them. Pattern-match them. Feed the patterns back into your golden set.

    Performance considerations and honest limits

    A well-engineered prompt with a clean golden set and a current model can hit high-90s accuracy on text moderation tasks. That's roughly on par with what expert human reviewers achieve. But "accuracy" isn't the whole story:

    • Confidence calibration is improving, and it matters. A lot of teams were wary of using LLMs for anything but the easy decisions, because an LLM can give you a confident answer that's totally wrong. We've been investing in methods for getting LLMs to produce accurate confidence scores on their decisions, and others in the field are working on this too. When it works, it lets you route the uncertain cases to humans and let the model handle the rest. If your tooling doesn't support reliable confidence routing, that's worth pushing on, because it's the difference between an LLM system that can only handle easy cases and one that can handle hard ones responsibly.
    • LLMs can't see what isn't in front of them. A single piece of content won't give the LLM information about itent, account history, off-platform behavior, or coordinated patterns across users. For these, you need a different layer of the system that looks at account-level signals, behavioral models, or human review on flagged accounts.
    • Commercial models have safety training that may not match your policy. Frontier models tend to be conservative on certain categories because of how they're tuned. This can help you (high recall on critical harms) or hurt you (false positives on legitimate content that touches a sensitive topic). Test, don't assume.
    • Long policy documents degrade accuracy. If your policy is more than a few hundred words, consider splitting it into focused prompts or using a routing layer that picks the relevant section.

    Where to go from here

    If you're starting fresh, the highest-leverage things to invest in are:

    1. A good golden dataset. Everything else builds on this. Spend real time on it.
    2. A clear, well-formatted policy. Plain language, defined terms, examples, counterexamples.
    3. An eval-driven iteration loop. Automate the loop so each change is fast to test.
    4. Version control and audit logging from day one. Future-you will thank present-you.
    5. A model swap strategy. Don't lock yourself into a single provider or model.

    For a deeper look at architecture, build vs. buy decisions, and the operational practices behind a mature LLM moderation system, read the companion implementation guide. If you're still weighing whether LLMs are the right tool at all, our piece comparing rule-based, fixed ML, and LLM-based moderation is a good place to start.

    The shape of this work has changed a lot in nine months. The fundamentals of clear policies, good evals, and honest measurement, haven't. Get those right, and the rest of the tooling is in service to them.

    How we're approaching this at Musubi

    A note on where I'm coming from: I'm a T&S practitioner first. I was tired of clunky tooling, slow iteration, vendor systems I couldn't customize, and no good way to manage policies as they evolved, so I joined Musubi to help design and build the things I wanted to use. The opinions in this article reflect how I think about the work. Other people in the field have different approaches and many of them are also right.

    Musubi has an LLM-based moderation platform, built around the practices in this article: policy versioning, golden datasets, eval-driven iteration, automatic prompt optimization with side-by-side suggestions, model comparison and benchmarking, and an MCP server so you can drive everything from an agent if you prefer. We built it because this is the workflow I wanted, not because it's the only way to do this well. If it sounds like the kind of tool that would help your team,  I'd love to chat. If you'd rather build your own, I hope this article is useful regardless.