How to use LLMs for Content Moderation

For years, content moderation teams have struggled with the rigidity of traditional machine learning models. Static classifiers, expensive retraining cycles, and opaque decision-making have made it difficult to keep up with rapidly evolving online behavior.

Recent advancements in large language models (LLMs) such as GPT-4 and Claude enable more flexible, transparent, and high-precision moderation workflows. In this post, we explore how LLMs are reshaping content moderation, introduce the emerging practice of policy engineering, and outline practical steps for implementation.

Why LLMs Are Now a Viable Option

LLMs have reached a level of accuracy, affordability, and ease of use that makes them preferable to traditional machine learning (ML) models in many ways. Costs are especially reasonable when using distilled or non-reasoning variants for classification tasks, often comparable to ML models when deployed at scale through commercial tools.

While in-house ML models may still win on raw compute cost, they require extensive engineering and data science resources. They are also inherently less adaptable, often requiring full retraining to adjust to new definitions of harm or emerging platform behaviors. LLMs, by contrast, are prompt-driven — meaning custom updates can be made in minutes, not months.

Introducing Policy Engineering

Policy engineering is the process of translating human-written moderation guidelines into LLM-readable instructions. It’s a hybrid of policy design, prompt development, debugging, and iteration.

For LLM moderation, policy engineering is the core lever for improving performance. Unlike traditional ML pipelines where changes require engineers and T&S experts to work together, T&S teams can now test, update, and refine LLM-based systems directly. This shift enables rapid prototyping, A/B testing of policy variants, and continuous improvement — all without engineering.

This is especially helpful for startups and lean Trust & Safety teams, or in situations where speed to update is critical.

How to Format Policies for LLMs

LLMs understand natural language — but how you format that language makes a big difference. There are three effective styles:

  1. Simplified Natural Language
    • Uses plain English
    • Easy to write and interpret
    • May lack precision for edge cases
  2. Structured Format with Examples
    • Clearly separates violations and non-violations
    • Uses headers, bullets, and examples
    • Most effective for consistent classification
  3. Rule-Based Logic
    • Provides conditional logic (e.g., IF/THEN)
    • Works best with models capable of basic reasoning
    • Ideal for policies with lots of exceptions or multi-step qualifiers

Best Practices for Policy Formatting

Do:

  • Use plain English, not legalese or platform jargon.
  • Keep the policies as concise as possible.
  • Clearly describe what counts as a violation or exception.
  • Use Markdown, bullets, and sections.
  • Include examples of both violations and non-violations.
  • Define key terms clearly.
  • Evaluate early and adjust based on real results.

Avoid:

  • Vague terms like “inappropriate” or “offensive” without definitions.
  • Policies that reference data the model can’t access (e.g., intent, user history).
  • Dense, unstructured prose (LLMs degrade when parsing dense blocks of text.)

It’s possible to ask an LLM to help you rewrite a policy in one of these styles, which can make the policy engineering process go faster. However, even when using reasoning models, we’ve found that LLMs often need multiple prompts and reminders on best practices (for example, reminders to define key terms, or to cut out extraneous language).

Example of Translating Human Policies

Sometimes the policies that are the best for humans (nuanced, evocative, outlining the spirit of the rule) are the worst for LLMs. Let’s take a look at a great human-readable policy and how it might look when rewritten for an LLM.

Original Policy: Reddit

“Everyone has a right to use Reddit free of harassment, bullying, and threats of violence. Communities and users that incite violence or that promote hate based on identity or vulnerability will be banned.”

LLM Challenges:

  • Ambiguous terms: “promote hate,” “vulnerability”
  • No definition of identities that are applicable (e.g., protected classes)

Version A: Simplified Natural Language

This version is easy to write and easy to understand, but may still be ambiguous without definitions or examples.

Do not allow posts that:
- Attack someone based on race, gender, religion, or other protected classes
- Encourage violence against any group
- Use slurs or hate terms targeting specific communities

Version B: Structured Format with Examples

This format will yield more precise results and is easier to audit. Examples allow for training in edge-cases and nuanced scenarios but can require maintenance over time.
## Policy: Hate Speech

### Violations:

- Using slurs or derogatory terms targeting **protected groups**:
- Race or ethnicity
- Religion
- Gender or gender identity
- Sexual orientation
- Disability
- Promoting or glorifying violence against **protected groups**

### Not Violations:
- Educational content about hate speech
- Quoting slurs to critique them
- Reclaimed language

### Examples:
- Violation: “All [ethnic group] are criminals.”
- Safe: “Wow, I can't believe someone said “All [ethnic group] are criminals.”

Version C: Rule-Based Logic

This format is also precise and easy to audit, but may be too rigid for some scenarios. It also may require LLMs capable of basic reasoning, depending on how complicated the logic gets.

If a post:
1. Targets a group based on protected characteristics AND
2. Uses slurs, dehumanizing language, or incites violence
→ Label as Hate Speech

Protected characteristics include:
- Race, ethnicity, religion, gender, gender identity, sexual orientation, disability

The process of Prompt Engineering

Start with a single policy area and build a golden set of 100–200 labeled examples — including clear violations, clear non-violations, and edge cases.

  1. For each example, ask the LLM to:
    • Make a decision: Violation / Not a Violation
    • Explain its reasoning briefly
  2. Then perform a side-by-side review:
    • Are mismatches due to human error? (Sometimes yes- we’ve found examples where an LLM convinces us to change our label on a golden set)
    • Are model errors caused by vague policy or mistakes in formatting? (one fun example I debugged: the policy was written as “no hate speech” — the LLM couldn’t decide if this meant “no hate speech allowed” or “no hate speech is present” and gave answers based on both interpretations in the same golden set results)
    • Is performance consistent across edge cases?
  3. Refine and iterate:
    • Simplify or clarify policy instructions
    • Add representative examples and counterexamples
    • Reorder prompts — order matters (for example, accuracy can improve when the most common policy violations are put first, or the most severe are put first)
    • A/B test different prompt styles in production
    • Set up escalation points or specific audit mechanisms for cases that the LLM underperforms in

Performance Considerations

With a strong prompt and clean golden set, LLMs can achieve mid-to-high 90% accuracy in text-only moderation — often on par with expert human reviewers. LLMs aren’t perfect, though, and moderation teams should remain cautious in a few areas:

  • LLMs may hallucinate or lack up-to-date information. There is always a risk of false positives/negatives, which is especially important in high-stakes domains (e.g., medical misinformation or political speech).
  • They can’t assess intent, history, or off-platform behavior unless explicitly provided, and this gets prohibitively expensive and clunky (we recommend a mix of LLM content-labeling and account-level custom ML models that are behavior-aware such as Musubi’s AiMod).
  • Commercial LLMs are trained for safety, which means that they may tend to be more conservative on some policies than you expect. These tendencies stem from RLHF (reinforcement learning with human feedback) and safety tuning focused on generic use. You can overcome these with fine-tuning or custom instructions, but off-the-shelf behavior may reflect broader safety guardrails not designed for nuanced T&S use cases. That said, you can use this to your advantage if you want high recall on critical policy areas.
  • LLM models can get confused with very long policy documents. If you have lots of complicated policies, you may need to break them into separate prompts for better accuracy (which costs more per review).
  • As great as it sounds to have an LLM “escalate to a human” when it has trouble, this is difficult to do in practice, because LLMs tend to be overconfident. There are a few methods to try, but use caution and experiment. Setting up checks and balances (such as manual auditing) is important here.

As with any moderation workflow, LLMs work best when there are checks and balances to ensure performance. For example, regular audits of random decisions and reviewing user moderation appeals.

How to get started using LLMs for content moderation

For the first time, Trust & Safety professionals can design, test, and iterate on policy enforcement themselves — no engineering bottlenecks required. Policy engineering is the bridge: turning human policy into machine-readable rules, backed by fast feedback loops and scalable evaluation. The teams that adopt this mindset will ship better policies, faster — and build safer, smarter online platforms in the process.

Commercial LLMs can be called using an API which is relatively easy to set up, but this fixes you into one AI ecosystem, which can be an issue when new models are constantly being deployed.

Musubi’s PolicyAI tool allows policy engineers to easily write policies in any style they like, add examples, test against golden data sets, and compare policies and models side-by-side for performance, before seamlessly moving into production. If this is something you’d like to learn more about, contact us for a demo.