AI Content Moderation & Safety API Guide 2026 - Protect Your Platform at Scale

Every platform that accepts user-generated content needs content moderation. Manual review doesn't scale. AI-powered moderation APIs give you automated, real-time content filtering that catches hate speech, violence, sexual content, PII leaks, and more — before it reaches your users. This guide covers every major moderation API in 2026, with implementation patterns for production systems.

Types of Content Moderation

Content moderation isn't one thing — it's several distinct problems:

Type	What It Catches	API Options
Toxicity / Hate	Hate speech, harassment, threats	OpenAI Moderation, Perspective, Azure
Sexual content	NSFW text and images	OpenAI Moderation, Azure, custom
Violence / Gore	Violent content, self-harm references	OpenAI Moderation, Azure
PII detection	SSN, email, phone, credit card	Presidio, AWS Macie, custom NER
Spam / Misinformation	Spam, coordinated attacks	Custom classifiers, LLM-based
Image moderation	NSFW images, violence in images	Azure, Google Safe Search, Clarifai

OpenAI Moderation API

The simplest and most widely used moderation API. Free to use and covers the most common categories:

from openai import OpenAI

client = OpenAI()

response = client.moderations.create(
    model="omni-moderation-latest",
    input="Your text to check here"
)

result = response.results[0]

# Check if flagged
print(f"Flagged: {result.flagged}")

# Individual category scores
categories = result.categories
scores = result.category_scores

print(f"Hate: {scores.hate:.4f} (flagged: {categories.hate})")
print(f"Harassment: {scores.harassment:.4f}")
print(f"Self-harm: {scores.self_harm:.4f}")
print(f"Sexual: {scores.sexual:.4f}")
print(f"Violence: {scores.violence:.4f}")

# Multi-modal moderation (text + image)
response = client.moderations.create(
    model="omni-moderation-latest",
    input=[
        {"type": "text", "text": "Check this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
    ]
)

Moderation Categories

Category	What It Detects
harassment	Content that expresses, incites, or promotes harassing language
harassment/threatening	Harassment content that includes violence or serious harm
hate	Content that expresses, incites, or promotes hate based on identity
hate/threatening	Hate content that includes violence or serious harm
self-harm	Content that promotes or encourages self-harm
self-harm/intent	Content expressing intent to engage in self-harm
self-harm/instructions	Content providing instructions on self-harm
sexual	Content meant to arouse sexual excitement
sexual/minors	Sexual content involving minors
violence	Content depicting death, violence, or physical injury
violence/graphic	Violent content with graphic depictions

OpenAI's Moderation API is free. There's no reason not to add it as a first line of defense for any user-facing AI application. It takes one API call and catches the most dangerous content categories.

Google Perspective API

Google's Perspective API, built by Jigsaw, focuses specifically on toxicity detection in comments and conversations:

import requests

PERSPECTIVE_API_KEY = "YOUR_KEY"
url = f"https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key={PERSPECTIVE_API_KEY}"

data = {
    "comment": {"text": "User's comment here"},
    "requestedAttributes": {
        "TOXICITY": {},
        "SEVERE_TOXICITY": {},
        "IDENTITY_ATTACK": {},
        "INSULT": {},
        "PROFANITY": {},
        "THREAT": {}
    },
    "languages": ["en"],
    "doNotStore": True
}

response = requests.post(url, json=data)
scores = response.json()["attributeScores"]

for attribute, data in scores.items():
    score = data["summaryScore"]["value"]
    print(f"{attribute}: {score:.3f}")

Perspective's strengths: it's specifically trained on conversational text (comments, forums, chat) and provides granular sub-categories. The free tier allows 1 request/second, which is sufficient for many applications.

Azure Content Safety

Microsoft's offering is the most comprehensive for enterprise deployments:

from azure.ai.contentsafety import ContentSafetyClient
from azure.core.credentials import AzureKeyCredential

client = ContentSafetyClient(
    endpoint="https://your-endpoint.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("YOUR_KEY")
)

from azure.ai.contentsafety.models import AnalyzeTextOptions

request = AnalyzeTextOptions(
    text="Text to analyze",
    categories=["Hate", "Sexual", "Violence", "SelfHarm"]
)

response = client.analyze_text(request)

for result in response.categories_analysis:
    print(f"{result.category}: severity={result.severity}")
    # Severity: 0 (safe) to 7 (most severe)

Azure's unique advantage: severity scores on a 0-7 scale instead of binary flagged/not-flagged. This lets you set custom thresholds — block at severity 5+, flag for review at 3+.

PII Detection & Redaction

Preventing PII leaks is a separate but equally important moderation concern:

Microsoft Presidio (Open Source)

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Analyze text for PII
analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text="My SSN is 123-45-6789 and email is john@example.com",
    language="en"
)

for result in results:
    print(f"Found {result.entity_type}: {result.score:.2f}")

# Anonymize/redact PII
anonymizer = AnonymizerEngine()
anonymized = anonymizer.anonymize(
    text="My SSN is 123-45-6789 and email is john@example.com",
    analyzer_results=results
)

print(anonymized.text)
# "My SSN is <SSN> and email is <EMAIL_ADDRESS>"

Presidio detects 30+ PII entity types including SSN, credit cards, phone numbers, email addresses, dates of birth, medical record numbers, and more. It runs locally — no data leaves your infrastructure.

Using LLMs for PII Detection

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class PIIExtraction(BaseModel):
    has_pii: bool
    pii_types: list[str]  # ["email", "phone", "ssn"]
    redacted_text: str

response = client.responses.parse(
    model="gpt-5.4-mini",
    input=[
        {"role": "system", "content": "Detect and redact PII in the text. Replace PII with [TYPE]."},
        {"role": "user", "content": "Call me at 555-1234 or email john@company.com"}
    ],
    text_format=PIIExtraction,
)

result = response.output_parsed
print(result.has_pii)        # True
print(result.pii_types)      # ["phone", "email"]
print(result.redacted_text)  # "Call me at [PHONE] or email [EMAIL]"

Moderation API Comparison

Feature	OpenAI	Google Perspective	Azure Content Safety	Presidio
Cost	Free	Free (1 req/s)	Pay per call	Free (self-host)
PII detection	No	No	Limited	Yes (30+ types)
Image moderation	Yes (omni)	No	Yes	No
Custom thresholds	Score-based	Score-based	0-7 severity	Configurable
Languages	Multi	Primarily English	Multi	Multi
Self-hosted	No	No	No	Yes

Production Patterns

1. Multi-Layer Moderation Pipeline

Layer multiple moderation systems for comprehensive coverage:

class ModerationPipeline:
    def __init__(self, openai_client, analyzer, threshold=0.7):
        self.openai = openai_client
        self.analyzer = analyzer
        self.threshold = threshold
    
    def moderate(self, text: str) -> dict:
        """Run multi-layer moderation."""
        results = {
            "text": text,
            "passed": True,
            "flags": [],
            "actions": []
        }
        
        # Layer 1: PII detection (run first — catch leaks before they reach any API)
        pii_results = self.analyzer.analyze(text=text, language="en")
        if pii_results:
            results["flags"].append({
                "type": "pii",
                "details": [{"type": r.entity_type, "score": r.score} for r in pii_results]
            })
            results["actions"].append("redact_pii")
        
        # Layer 2: OpenAI Moderation (toxicity, violence, sexual, hate)
        moderation = self.openai.moderations.create(
            model="omni-moderation-latest",
            input=text
        )
        mod_result = moderation.results[0]
        
        if mod_result.flagged:
            results["passed"] = False
            for cat, flagged in mod_result.categories:
                if flagged:
                    results["flags"].append({
                        "type": cat,
                        "score": getattr(mod_result.category_scores, cat)
                    })
            results["actions"].append("block")
        
        # Layer 3: Custom business rules
        # (e.g., competitor mentions, brand safety, etc.)
        
        return results

2. Threshold-Based Decisioning

Don't just use binary flagged/not-flagged. Set graduated thresholds:

def decide_action(moderation_result, thresholds):
    """Decide action based on score thresholds."""
    actions = {}
    
    for category, score in moderation_result.category_scores:
        if score >= thresholds.get(category, {}).get("block", 0.9):
            actions[category] = "block"
        elif score >= thresholds.get(category, {}).get("review", 0.5):
            actions[category] = "human_review"
        else:
            actions[category] = "allow"
    
    return actions

# Example configuration
THRESHOLDS = {
    "sexual/minors": {"block": 0.1, "review": 0.01},  # Zero tolerance
    "violence": {"block": 0.7, "review": 0.3},
    "hate": {"block": 0.8, "review": 0.4},
    "sexual": {"block": 0.8, "review": 0.5},
    "harassment": {"block": 0.7, "review": 0.4},
}

3. Moderation with User Context

The same content can be acceptable or harmful depending on context:

def moderate_with_context(text, user_context):
    """Consider user context in moderation decisions."""
    
    # Base moderation
    result = moderate(text)
    
    # Escalate for repeat offenders
    if user_context["previous_violations"] > 2:
        result["threshold_multiplier"] = 0.7  # Lower thresholds
    
    # Be more lenient in private conversations
    if user_context["channel"] == "dm":
        result["threshold_multiplier"] = 1.3  # Higher thresholds
    
    # Always strict for public-facing content
    if user_context["channel"] == "public_post":
        result["threshold_multiplier"] = 0.8
    
    return result

4. Async Moderation for High Volume

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def moderate_batch(texts, batch_size=50):
    """Moderate many texts concurrently."""
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        tasks = [
            async_client.moderations.create(
                model="omni-moderation-latest",
                input=text
            )
            for text in batch
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        
        for text, response in zip(batch, responses):
            if isinstance(response, Exception):
                results.append({"text": text, "error": str(response)})
            else:
                results.append({
                    "text": text,
                    "flagged": response.results[0].flagged,
                    "categories": response.results[0].categories
                })
    
    return results

Moderating LLM Outputs

Don't just moderate user input — moderate model output too. LLMs can generate harmful content, especially when users craft adversarial prompts:

async def safe_generate(client, model, messages, **kwargs):
    """Generate with output moderation."""
    
    # 1. Moderate input
    user_message = messages[-1]["content"]
    input_check = client.moderations.create(
        model="omni-moderation-latest",
        input=user_message
    )
    if input_check.results[0].flagged:
        return {"error": "Input violates safety guidelines", "flagged_categories": [...]}
    
    # 2. Generate response
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        **kwargs
    )
    output_text = response.choices[0].message.content
    
    # 3. Moderate output
    output_check = client.moderations.create(
        model="omni-moderation-latest",
        input=output_text
    )
    if output_check.results[0].flagged:
        # Log the incident for analysis
        log_safety_incident(messages, output_text, output_check.results[0])
        return {"error": "Response filtered for safety", "fallback": "I can't help with that."}
    
    return {"response": output_text}

Common Pitfalls

Only moderating user input — Always moderate model output too. Prompt injection can cause LLMs to generate harmful content
Using binary thresholds — Score-based thresholds with graduated actions (allow/review/block) are far more effective
Ignoring PII in prompts — Users paste sensitive data into chatbots constantly. Detect and redact before processing
Not logging moderation decisions — You need logs to tune thresholds and understand false positives/negatives
Over-blocking — Aggressive moderation frustrates users. Start conservative with human review for borderline cases
Not handling moderation API failures — If the moderation API is down, what happens? Fail open (allow) or fail closed (block)?
English-only moderation — Harmful content in other languages slips through English-trained models

Conclusion

Content moderation is a non-negotiable part of any AI application that handles user-generated content. Start with OpenAI's free Moderation API for toxicity and harmful content detection. Add Presidio for PII protection. Use Azure Content Safety if you need enterprise-grade severity scoring. And always moderate both input and output — a safe LLM application is one that filters content in both directions.

The key principle: moderation should be layered, graduated, and logged. No single API catches everything, binary decisions create bad user experiences, and you can't improve what you don't measure.