AI Content Moderation & Safety API Guide 2026
Protect your platform from harmful content. OpenAI Moderation, Google Perspective, Azure Content Safety, PII detection, and production moderation patterns.
Every platform that accepts user-generated content needs content moderation. Manual review doesn't scale. AI-powered moderation APIs give you automated, real-time content filtering that catches hate speech, violence, sexual content, PII leaks, and more — before it reaches your users. This guide covers every major moderation API in 2026, with implementation patterns for production systems.
Types of Content Moderation
Content moderation isn't one thing — it's several distinct problems:
| Type | What It Catches | API Options |
|---|---|---|
| Toxicity / Hate | Hate speech, harassment, threats | OpenAI Moderation, Perspective, Azure |
| Sexual content | NSFW text and images | OpenAI Moderation, Azure, custom |
| Violence / Gore | Violent content, self-harm references | OpenAI Moderation, Azure |
| PII detection | SSN, email, phone, credit card | Presidio, AWS Macie, custom NER |
| Spam / Misinformation | Spam, coordinated attacks | Custom classifiers, LLM-based |
| Image moderation | NSFW images, violence in images | Azure, Google Safe Search, Clarifai |
OpenAI Moderation API
The simplest and most widely used moderation API. Free to use and covers the most common categories:
from openai import OpenAI
client = OpenAI()
response = client.moderations.create(
model="omni-moderation-latest",
input="Your text to check here"
)
result = response.results[0]
# Check if flagged
print(f"Flagged: {result.flagged}")
# Individual category scores
categories = result.categories
scores = result.category_scores
print(f"Hate: {scores.hate:.4f} (flagged: {categories.hate})")
print(f"Harassment: {scores.harassment:.4f}")
print(f"Self-harm: {scores.self_harm:.4f}")
print(f"Sexual: {scores.sexual:.4f}")
print(f"Violence: {scores.violence:.4f}")
# Multi-modal moderation (text + image)
response = client.moderations.create(
model="omni-moderation-latest",
input=[
{"type": "text", "text": "Check this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
)
Moderation Categories
| Category | What It Detects |
|---|---|
| harassment | Content that expresses, incites, or promotes harassing language |
| harassment/threatening | Harassment content that includes violence or serious harm |
| hate | Content that expresses, incites, or promotes hate based on identity |
| hate/threatening | Hate content that includes violence or serious harm |
| self-harm | Content that promotes or encourages self-harm |
| self-harm/intent | Content expressing intent to engage in self-harm |
| self-harm/instructions | Content providing instructions on self-harm |
| sexual | Content meant to arouse sexual excitement |
| sexual/minors | Sexual content involving minors |
| violence | Content depicting death, violence, or physical injury |
| violence/graphic | Violent content with graphic depictions |
OpenAI's Moderation API is free. There's no reason not to add it as a first line of defense for any user-facing AI application. It takes one API call and catches the most dangerous content categories.
Google Perspective API
Google's Perspective API, built by Jigsaw, focuses specifically on toxicity detection in comments and conversations:
import requests
PERSPECTIVE_API_KEY = "YOUR_KEY"
url = f"https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key={PERSPECTIVE_API_KEY}"
data = {
"comment": {"text": "User's comment here"},
"requestedAttributes": {
"TOXICITY": {},
"SEVERE_TOXICITY": {},
"IDENTITY_ATTACK": {},
"INSULT": {},
"PROFANITY": {},
"THREAT": {}
},
"languages": ["en"],
"doNotStore": True
}
response = requests.post(url, json=data)
scores = response.json()["attributeScores"]
for attribute, data in scores.items():
score = data["summaryScore"]["value"]
print(f"{attribute}: {score:.3f}")
Perspective's strengths: it's specifically trained on conversational text (comments, forums, chat) and provides granular sub-categories. The free tier allows 1 request/second, which is sufficient for many applications.
Azure Content Safety
Microsoft's offering is the most comprehensive for enterprise deployments:
from azure.ai.contentsafety import ContentSafetyClient
from azure.core.credentials import AzureKeyCredential
client = ContentSafetyClient(
endpoint="https://your-endpoint.cognitiveservices.azure.com/",
credential=AzureKeyCredential("YOUR_KEY")
)
from azure.ai.contentsafety.models import AnalyzeTextOptions
request = AnalyzeTextOptions(
text="Text to analyze",
categories=["Hate", "Sexual", "Violence", "SelfHarm"]
)
response = client.analyze_text(request)
for result in response.categories_analysis:
print(f"{result.category}: severity={result.severity}")
# Severity: 0 (safe) to 7 (most severe)
Azure's unique advantage: severity scores on a 0-7 scale instead of binary flagged/not-flagged. This lets you set custom thresholds — block at severity 5+, flag for review at 3+.
PII Detection & Redaction
Preventing PII leaks is a separate but equally important moderation concern:
Microsoft Presidio (Open Source)
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
# Analyze text for PII
analyzer = AnalyzerEngine()
results = analyzer.analyze(
text="My SSN is 123-45-6789 and email is john@example.com",
language="en"
)
for result in results:
print(f"Found {result.entity_type}: {result.score:.2f}")
# Anonymize/redact PII
anonymizer = AnonymizerEngine()
anonymized = anonymizer.anonymize(
text="My SSN is 123-45-6789 and email is john@example.com",
analyzer_results=results
)
print(anonymized.text)
# "My SSN is <SSN> and email is <EMAIL_ADDRESS>"
Presidio detects 30+ PII entity types including SSN, credit cards, phone numbers, email addresses, dates of birth, medical record numbers, and more. It runs locally — no data leaves your infrastructure.
Using LLMs for PII Detection
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class PIIExtraction(BaseModel):
has_pii: bool
pii_types: list[str] # ["email", "phone", "ssn"]
redacted_text: str
response = client.responses.parse(
model="gpt-5.4-mini",
input=[
{"role": "system", "content": "Detect and redact PII in the text. Replace PII with [TYPE]."},
{"role": "user", "content": "Call me at 555-1234 or email john@company.com"}
],
text_format=PIIExtraction,
)
result = response.output_parsed
print(result.has_pii) # True
print(result.pii_types) # ["phone", "email"]
print(result.redacted_text) # "Call me at [PHONE] or email [EMAIL]"
Moderation API Comparison
| Feature | OpenAI | Google Perspective | Azure Content Safety | Presidio |
|---|---|---|---|---|
| Cost | Free | Free (1 req/s) | Pay per call | Free (self-host) |
| PII detection | No | No | Limited | Yes (30+ types) |
| Image moderation | Yes (omni) | No | Yes | No |
| Custom thresholds | Score-based | Score-based | 0-7 severity | Configurable |
| Languages | Multi | Primarily English | Multi | Multi |
| Self-hosted | No | No | No | Yes |
Production Patterns
1. Multi-Layer Moderation Pipeline
Layer multiple moderation systems for comprehensive coverage:
class ModerationPipeline:
def __init__(self, openai_client, analyzer, threshold=0.7):
self.openai = openai_client
self.analyzer = analyzer
self.threshold = threshold
def moderate(self, text: str) -> dict:
"""Run multi-layer moderation."""
results = {
"text": text,
"passed": True,
"flags": [],
"actions": []
}
# Layer 1: PII detection (run first — catch leaks before they reach any API)
pii_results = self.analyzer.analyze(text=text, language="en")
if pii_results:
results["flags"].append({
"type": "pii",
"details": [{"type": r.entity_type, "score": r.score} for r in pii_results]
})
results["actions"].append("redact_pii")
# Layer 2: OpenAI Moderation (toxicity, violence, sexual, hate)
moderation = self.openai.moderations.create(
model="omni-moderation-latest",
input=text
)
mod_result = moderation.results[0]
if mod_result.flagged:
results["passed"] = False
for cat, flagged in mod_result.categories:
if flagged:
results["flags"].append({
"type": cat,
"score": getattr(mod_result.category_scores, cat)
})
results["actions"].append("block")
# Layer 3: Custom business rules
# (e.g., competitor mentions, brand safety, etc.)
return results
2. Threshold-Based Decisioning
Don't just use binary flagged/not-flagged. Set graduated thresholds:
def decide_action(moderation_result, thresholds):
"""Decide action based on score thresholds."""
actions = {}
for category, score in moderation_result.category_scores:
if score >= thresholds.get(category, {}).get("block", 0.9):
actions[category] = "block"
elif score >= thresholds.get(category, {}).get("review", 0.5):
actions[category] = "human_review"
else:
actions[category] = "allow"
return actions
# Example configuration
THRESHOLDS = {
"sexual/minors": {"block": 0.1, "review": 0.01}, # Zero tolerance
"violence": {"block": 0.7, "review": 0.3},
"hate": {"block": 0.8, "review": 0.4},
"sexual": {"block": 0.8, "review": 0.5},
"harassment": {"block": 0.7, "review": 0.4},
}
3. Moderation with User Context
The same content can be acceptable or harmful depending on context:
def moderate_with_context(text, user_context):
"""Consider user context in moderation decisions."""
# Base moderation
result = moderate(text)
# Escalate for repeat offenders
if user_context["previous_violations"] > 2:
result["threshold_multiplier"] = 0.7 # Lower thresholds
# Be more lenient in private conversations
if user_context["channel"] == "dm":
result["threshold_multiplier"] = 1.3 # Higher thresholds
# Always strict for public-facing content
if user_context["channel"] == "public_post":
result["threshold_multiplier"] = 0.8
return result
4. Async Moderation for High Volume
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def moderate_batch(texts, batch_size=50):
"""Moderate many texts concurrently."""
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
tasks = [
async_client.moderations.create(
model="omni-moderation-latest",
input=text
)
for text in batch
]
responses = await asyncio.gather(*tasks, return_exceptions=True)
for text, response in zip(batch, responses):
if isinstance(response, Exception):
results.append({"text": text, "error": str(response)})
else:
results.append({
"text": text,
"flagged": response.results[0].flagged,
"categories": response.results[0].categories
})
return results
Moderating LLM Outputs
Don't just moderate user input — moderate model output too. LLMs can generate harmful content, especially when users craft adversarial prompts:
async def safe_generate(client, model, messages, **kwargs):
"""Generate with output moderation."""
# 1. Moderate input
user_message = messages[-1]["content"]
input_check = client.moderations.create(
model="omni-moderation-latest",
input=user_message
)
if input_check.results[0].flagged:
return {"error": "Input violates safety guidelines", "flagged_categories": [...]}
# 2. Generate response
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
output_text = response.choices[0].message.content
# 3. Moderate output
output_check = client.moderations.create(
model="omni-moderation-latest",
input=output_text
)
if output_check.results[0].flagged:
# Log the incident for analysis
log_safety_incident(messages, output_text, output_check.results[0])
return {"error": "Response filtered for safety", "fallback": "I can't help with that."}
return {"response": output_text}
Common Pitfalls
- Only moderating user input — Always moderate model output too. Prompt injection can cause LLMs to generate harmful content
- Using binary thresholds — Score-based thresholds with graduated actions (allow/review/block) are far more effective
- Ignoring PII in prompts — Users paste sensitive data into chatbots constantly. Detect and redact before processing
- Not logging moderation decisions — You need logs to tune thresholds and understand false positives/negatives
- Over-blocking — Aggressive moderation frustrates users. Start conservative with human review for borderline cases
- Not handling moderation API failures — If the moderation API is down, what happens? Fail open (allow) or fail closed (block)?
- English-only moderation — Harmful content in other languages slips through English-trained models
Conclusion
Content moderation is a non-negotiable part of any AI application that handles user-generated content. Start with OpenAI's free Moderation API for toxicity and harmful content detection. Add Presidio for PII protection. Use Azure Content Safety if you need enterprise-grade severity scoring. And always moderate both input and output — a safe LLM application is one that filters content in both directions.
The key principle: moderation should be layered, graduated, and logged. No single API catches everything, binary decisions create bad user experiences, and you can't improve what you don't measure.
Related Guides: AI Safety & Privacy for Developers · Structured Outputs Guide · Evaluation & Testing Guide