Guide May 13, 2026

AI Content Moderation & Safety API Guide 2026

Protect your platform from harmful content. OpenAI Moderation, Google Perspective, Azure Content Safety, PII detection, and production moderation patterns.

Every platform that accepts user-generated content needs content moderation. Manual review doesn't scale. AI-powered moderation APIs give you automated, real-time content filtering that catches hate speech, violence, sexual content, PII leaks, and more — before it reaches your users. This guide covers every major moderation API in 2026, with implementation patterns for production systems.

Types of Content Moderation

Content moderation isn't one thing — it's several distinct problems:

TypeWhat It CatchesAPI Options
Toxicity / HateHate speech, harassment, threatsOpenAI Moderation, Perspective, Azure
Sexual contentNSFW text and imagesOpenAI Moderation, Azure, custom
Violence / GoreViolent content, self-harm referencesOpenAI Moderation, Azure
PII detectionSSN, email, phone, credit cardPresidio, AWS Macie, custom NER
Spam / MisinformationSpam, coordinated attacksCustom classifiers, LLM-based
Image moderationNSFW images, violence in imagesAzure, Google Safe Search, Clarifai

OpenAI Moderation API

The simplest and most widely used moderation API. Free to use and covers the most common categories:

from openai import OpenAI

client = OpenAI()

response = client.moderations.create(
    model="omni-moderation-latest",
    input="Your text to check here"
)

result = response.results[0]

# Check if flagged
print(f"Flagged: {result.flagged}")

# Individual category scores
categories = result.categories
scores = result.category_scores

print(f"Hate: {scores.hate:.4f} (flagged: {categories.hate})")
print(f"Harassment: {scores.harassment:.4f}")
print(f"Self-harm: {scores.self_harm:.4f}")
print(f"Sexual: {scores.sexual:.4f}")
print(f"Violence: {scores.violence:.4f}")

# Multi-modal moderation (text + image)
response = client.moderations.create(
    model="omni-moderation-latest",
    input=[
        {"type": "text", "text": "Check this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
    ]
)

Moderation Categories

CategoryWhat It Detects
harassmentContent that expresses, incites, or promotes harassing language
harassment/threateningHarassment content that includes violence or serious harm
hateContent that expresses, incites, or promotes hate based on identity
hate/threateningHate content that includes violence or serious harm
self-harmContent that promotes or encourages self-harm
self-harm/intentContent expressing intent to engage in self-harm
self-harm/instructionsContent providing instructions on self-harm
sexualContent meant to arouse sexual excitement
sexual/minorsSexual content involving minors
violenceContent depicting death, violence, or physical injury
violence/graphicViolent content with graphic depictions
OpenAI's Moderation API is free. There's no reason not to add it as a first line of defense for any user-facing AI application. It takes one API call and catches the most dangerous content categories.

Google Perspective API

Google's Perspective API, built by Jigsaw, focuses specifically on toxicity detection in comments and conversations:

import requests

PERSPECTIVE_API_KEY = "YOUR_KEY"
url = f"https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key={PERSPECTIVE_API_KEY}"

data = {
    "comment": {"text": "User's comment here"},
    "requestedAttributes": {
        "TOXICITY": {},
        "SEVERE_TOXICITY": {},
        "IDENTITY_ATTACK": {},
        "INSULT": {},
        "PROFANITY": {},
        "THREAT": {}
    },
    "languages": ["en"],
    "doNotStore": True
}

response = requests.post(url, json=data)
scores = response.json()["attributeScores"]

for attribute, data in scores.items():
    score = data["summaryScore"]["value"]
    print(f"{attribute}: {score:.3f}")

Perspective's strengths: it's specifically trained on conversational text (comments, forums, chat) and provides granular sub-categories. The free tier allows 1 request/second, which is sufficient for many applications.

Azure Content Safety

Microsoft's offering is the most comprehensive for enterprise deployments:

from azure.ai.contentsafety import ContentSafetyClient
from azure.core.credentials import AzureKeyCredential

client = ContentSafetyClient(
    endpoint="https://your-endpoint.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("YOUR_KEY")
)

from azure.ai.contentsafety.models import AnalyzeTextOptions

request = AnalyzeTextOptions(
    text="Text to analyze",
    categories=["Hate", "Sexual", "Violence", "SelfHarm"]
)

response = client.analyze_text(request)

for result in response.categories_analysis:
    print(f"{result.category}: severity={result.severity}")
    # Severity: 0 (safe) to 7 (most severe)

Azure's unique advantage: severity scores on a 0-7 scale instead of binary flagged/not-flagged. This lets you set custom thresholds — block at severity 5+, flag for review at 3+.

PII Detection & Redaction

Preventing PII leaks is a separate but equally important moderation concern:

Microsoft Presidio (Open Source)

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Analyze text for PII
analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text="My SSN is 123-45-6789 and email is john@example.com",
    language="en"
)

for result in results:
    print(f"Found {result.entity_type}: {result.score:.2f}")

# Anonymize/redact PII
anonymizer = AnonymizerEngine()
anonymized = anonymizer.anonymize(
    text="My SSN is 123-45-6789 and email is john@example.com",
    analyzer_results=results
)

print(anonymized.text)
# "My SSN is <SSN> and email is <EMAIL_ADDRESS>"

Presidio detects 30+ PII entity types including SSN, credit cards, phone numbers, email addresses, dates of birth, medical record numbers, and more. It runs locally — no data leaves your infrastructure.

Using LLMs for PII Detection

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class PIIExtraction(BaseModel):
    has_pii: bool
    pii_types: list[str]  # ["email", "phone", "ssn"]
    redacted_text: str

response = client.responses.parse(
    model="gpt-5.4-mini",
    input=[
        {"role": "system", "content": "Detect and redact PII in the text. Replace PII with [TYPE]."},
        {"role": "user", "content": "Call me at 555-1234 or email john@company.com"}
    ],
    text_format=PIIExtraction,
)

result = response.output_parsed
print(result.has_pii)        # True
print(result.pii_types)      # ["phone", "email"]
print(result.redacted_text)  # "Call me at [PHONE] or email [EMAIL]"

Moderation API Comparison

Feature OpenAI Google Perspective Azure Content Safety Presidio
Cost Free Free (1 req/s) Pay per call Free (self-host)
PII detection No No Limited Yes (30+ types)
Image moderation Yes (omni) No Yes No
Custom thresholds Score-based Score-based 0-7 severity Configurable
Languages Multi Primarily English Multi Multi
Self-hosted No No No Yes

Production Patterns

1. Multi-Layer Moderation Pipeline

Layer multiple moderation systems for comprehensive coverage:

class ModerationPipeline:
    def __init__(self, openai_client, analyzer, threshold=0.7):
        self.openai = openai_client
        self.analyzer = analyzer
        self.threshold = threshold
    
    def moderate(self, text: str) -> dict:
        """Run multi-layer moderation."""
        results = {
            "text": text,
            "passed": True,
            "flags": [],
            "actions": []
        }
        
        # Layer 1: PII detection (run first — catch leaks before they reach any API)
        pii_results = self.analyzer.analyze(text=text, language="en")
        if pii_results:
            results["flags"].append({
                "type": "pii",
                "details": [{"type": r.entity_type, "score": r.score} for r in pii_results]
            })
            results["actions"].append("redact_pii")
        
        # Layer 2: OpenAI Moderation (toxicity, violence, sexual, hate)
        moderation = self.openai.moderations.create(
            model="omni-moderation-latest",
            input=text
        )
        mod_result = moderation.results[0]
        
        if mod_result.flagged:
            results["passed"] = False
            for cat, flagged in mod_result.categories:
                if flagged:
                    results["flags"].append({
                        "type": cat,
                        "score": getattr(mod_result.category_scores, cat)
                    })
            results["actions"].append("block")
        
        # Layer 3: Custom business rules
        # (e.g., competitor mentions, brand safety, etc.)
        
        return results

2. Threshold-Based Decisioning

Don't just use binary flagged/not-flagged. Set graduated thresholds:

def decide_action(moderation_result, thresholds):
    """Decide action based on score thresholds."""
    actions = {}
    
    for category, score in moderation_result.category_scores:
        if score >= thresholds.get(category, {}).get("block", 0.9):
            actions[category] = "block"
        elif score >= thresholds.get(category, {}).get("review", 0.5):
            actions[category] = "human_review"
        else:
            actions[category] = "allow"
    
    return actions

# Example configuration
THRESHOLDS = {
    "sexual/minors": {"block": 0.1, "review": 0.01},  # Zero tolerance
    "violence": {"block": 0.7, "review": 0.3},
    "hate": {"block": 0.8, "review": 0.4},
    "sexual": {"block": 0.8, "review": 0.5},
    "harassment": {"block": 0.7, "review": 0.4},
}

3. Moderation with User Context

The same content can be acceptable or harmful depending on context:

def moderate_with_context(text, user_context):
    """Consider user context in moderation decisions."""
    
    # Base moderation
    result = moderate(text)
    
    # Escalate for repeat offenders
    if user_context["previous_violations"] > 2:
        result["threshold_multiplier"] = 0.7  # Lower thresholds
    
    # Be more lenient in private conversations
    if user_context["channel"] == "dm":
        result["threshold_multiplier"] = 1.3  # Higher thresholds
    
    # Always strict for public-facing content
    if user_context["channel"] == "public_post":
        result["threshold_multiplier"] = 0.8
    
    return result

4. Async Moderation for High Volume

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def moderate_batch(texts, batch_size=50):
    """Moderate many texts concurrently."""
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        tasks = [
            async_client.moderations.create(
                model="omni-moderation-latest",
                input=text
            )
            for text in batch
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        
        for text, response in zip(batch, responses):
            if isinstance(response, Exception):
                results.append({"text": text, "error": str(response)})
            else:
                results.append({
                    "text": text,
                    "flagged": response.results[0].flagged,
                    "categories": response.results[0].categories
                })
    
    return results

Moderating LLM Outputs

Don't just moderate user input — moderate model output too. LLMs can generate harmful content, especially when users craft adversarial prompts:

async def safe_generate(client, model, messages, **kwargs):
    """Generate with output moderation."""
    
    # 1. Moderate input
    user_message = messages[-1]["content"]
    input_check = client.moderations.create(
        model="omni-moderation-latest",
        input=user_message
    )
    if input_check.results[0].flagged:
        return {"error": "Input violates safety guidelines", "flagged_categories": [...]}
    
    # 2. Generate response
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        **kwargs
    )
    output_text = response.choices[0].message.content
    
    # 3. Moderate output
    output_check = client.moderations.create(
        model="omni-moderation-latest",
        input=output_text
    )
    if output_check.results[0].flagged:
        # Log the incident for analysis
        log_safety_incident(messages, output_text, output_check.results[0])
        return {"error": "Response filtered for safety", "fallback": "I can't help with that."}
    
    return {"response": output_text}

Common Pitfalls

  1. Only moderating user input — Always moderate model output too. Prompt injection can cause LLMs to generate harmful content
  2. Using binary thresholds — Score-based thresholds with graduated actions (allow/review/block) are far more effective
  3. Ignoring PII in prompts — Users paste sensitive data into chatbots constantly. Detect and redact before processing
  4. Not logging moderation decisions — You need logs to tune thresholds and understand false positives/negatives
  5. Over-blocking — Aggressive moderation frustrates users. Start conservative with human review for borderline cases
  6. Not handling moderation API failures — If the moderation API is down, what happens? Fail open (allow) or fail closed (block)?
  7. English-only moderation — Harmful content in other languages slips through English-trained models

Conclusion

Content moderation is a non-negotiable part of any AI application that handles user-generated content. Start with OpenAI's free Moderation API for toxicity and harmful content detection. Add Presidio for PII protection. Use Azure Content Safety if you need enterprise-grade severity scoring. And always moderate both input and output — a safe LLM application is one that filters content in both directions.

The key principle: moderation should be layered, graduated, and logged. No single API catches everything, binary decisions create bad user experiences, and you can't improve what you don't measure.