How to audit your fixed ML classifier

In This Guide:

    Many teams running a fixed Machine Learning classifier (whether that's a commercial tool like Hive or AWS Rekognition or an in-house model) run into friction. I've been that T&S leader. When I was running moderation at Grindr and OkCupid, I had fixed ML classifiers that couldn't handle the nuance our policies needed. Categories were too broad and decisions were blunt, which meant I had to keep throwing more people at the problem if I wanted high moderation quality. I talk to teams in this situation all the time now, and I recognize it immediately, because I lived it.

    No moderation system is completely perfect. Leaders are always balancing budget, time, and what technology is available. The exercises below are one way to understand the hidden costs of running with a less-than-ideal setup. It can be easy to keep going with what you have and patch over the gaps, but sometimes stepping back to see the full impact is helpful.

    Below are four common symptoms, with a diagnostic exercise for each. But before any of that, start with your moderators. They will almost certainly know where the problems are. Ask them two things:

    • What cases keep showing up in their queue that feel repetitive and obvious? Which decisions can they make in two seconds, and wish were just automated? This tells you whether your classifier is doing its job. If moderators are spending significant time on decisions the model should handle, either it's routing too broadly and creating work it shouldn't, or it's missing things people are catching manually.
    • How much of their day is that, versus the genuinely hard calls where they have to think, reason, calibrate, and feel meaningfully useful? This tells you whether your team's judgment is being used well. Moderators doing rote work all day burn out, and it's one of the reasons good people leave. That cost almost never gets traced back to the moderation stack.

    Sometimes that conversation tells you everything you need to know. If you're getting signals that you should keep digging, the diagnostics below help you locate the problem more precisely.

    Symptom 1: False positives and overly broad categories

    What this looks like:

    • A high volume of flagged content that needs a human second pass before anything gets actioned.
    • A lot of user appeals, resulting in your moderation team overturning initial decisions.

    Why it happens. Fixed classifiers are trained on a specific definition of violating content. If you're using an off-the-shelf model, it will be trained on a generic idea of violative content, which may not match your policies or community exactly. Even if a classifier is custom and in-house, it can get out of date or misaligned over time.

    When policy categories are too broad or all-inclusive, your only way to have a more nuanced moderation outcome is to send all flagged content to moderators to sort through. The frustrating thing is that those reversals (overrides, escalations, or appeal resolutions, depending on what your team calls them) never feed back into the model to make it better next week. So your team is making the same moderation decisions over and over again.

    How to figure this out.

    • Of everything the classifier routes to your human team, calculate what percentage they actually confirm as a violation. A low confirmation rate means your classifier is working as a queue, not a decision-maker.
    • Check your user appeals overturn rate. If a high percentage of appeals are winning, the classifier is out of step with what your community expects, not just what your policy says.

    One creator monetization platform (~3.5M items per month) ran this analysis and found $340K per year in false positive cost once they included the support tickets and appeals nobody had been attributing to moderation. That was enough for the exec team to pay attention.

    If your team is already doing second-pass review on most of what comes through, you may not need to run this exercise — you already know the confirmation rate is low, and the headcount required to keep up is telling that story. Skip to the calculator and put a dollar figure on what you're already dealing with.

    Symptom 2: False negatives and the adaptation gap

    What this looks like:

    • New slang, viral trends, or world events shift what content means, and enforcement doesn't catch up for weeks.
    • Content your moderation team is catching manually that the classifier never flagged.

    Why it happens. A fixed classifier knows what it was trained on. Getting it to recognize new patterns means new labeled data, retraining, and redeployment. That cycle can take weeks, sometimes months. In the meantime, the model enforces what it knew the last time it was trained.

    How to figure this out.

    • Ask your moderation team what they've been catching manually in the last 30 days that didn't come through the classifier. Every team has a running list of this in their heads.
    • Pick something specific (a recent news event, a piece of slang circulating in your community) and ask how long it took to get coverage on it, and what that process actually looked like.
    • Look at the number of actioned user reports that didn't get picked up by the classifier.

    Symptom 3: Your moderation team keeps growing

    What this looks like:

    • Headcount grows roughly in step with content volume. Every time the platform doubles, the team has to double too.
    • Hiring, training, and attrition are a permanent line item in your operating cost.
    • When you forecast next year's budget, moderation is dominated by people.

    Why it happens. A fixed classifier has a ceiling. It catches what it was trained to catch, at the precision it was trained to catch it. Once your team is doing second-pass review on most flagged content, and manually catching things the classifier missed, the only lever you have for more coverage or better quality is more moderators. In that setup, the classifier isn't really making moderation decisions, it's just deciding what to put in front of your team.

    How to figure this out.

    • Plot moderator headcount against content volume over the last 24 months. If the lines move together, automation isn't absorbing the growth.
    • Look at what percentage of your moderation budget is people. If it's the vast majority, your classifier is functioning more like a queue than a decision-maker.
    • Ask yourself: if content doubled in the next six months, what's the plan? If the only answer is "hire more," that's the gap.

    Symptom 4: You can't tell why the classifier made the call it did

    What this looks like:

    • A piece of content gets actioned and nobody on your team can reconstruct why.
    • Something obvious slips through and there's no way to ask the model what it was looking at.
    • When you spot a pattern of bad calls, the only workaround is a coarse threshold change, a manual override list, or waiting for the next retrain.

    Why it happens. Most fixed classifiers return a category and a confidence score, not a reason. Even the teams that built them often can't tell you why a specific decision came back the way it did. So when you find a problem, like a creator getting wrongly demonetized or a category underperforming on a specific community, you can't isolate the cause and you can't surgically fix it. Your options are blunt: lower the threshold, queue the category for human review, or wait for a retrain.

    How to figure this out.

    • Pull 20 of your most frustrating misclassifications from the last quarter, in both directions. For each, ask whether anyone on your team or your vendor can explain why the model made that call. If not, you're flying blind.
    • Take a recent appeal that was overturned. Can you trace what feature or signal caused the original action, or do you only know the score was 0.83?
    • Ask your vendor what the path is to fix a specific category of false positive. If the answer involves "retraining" or "a future model release," your iteration loop is months, not days.

    What to do with what you find

    Run these exercises over 30 days and you'll know where your stack stands. Some gaps are fixable within your current setup, like a clearer policy, better thresholds, or a more consistent brief for your team. Others are structural limits of how fixed classifiers work, and tuning won't change them.

    One option, if it’s helpful for you: I run a complimentary second opinion for teams on a sample of their hardest historical decisions. After a quick NDA, send 100–200 cases and your written policy, and I'll return a side-by-side of how a policy-aware system would have decided each one, with reasoning. You get a calibrated read on where your policy needs work and where you may need a different tool, and it's useful even if you don't end up changing anything. No integration, no commitment. Just send me an email and we'll set something up.