AI
Atlas AI
JK
← Template library
Content Safetymediumv1.0.0 · System

Output Toxicity Filter

Detects toxic, harmful, hateful, or harassing content in model responses and blocks before display.

📘Clone & start observing

Creates a Guideline policy. Observation only — nothing is blocked until you promote to Strict.

Mode on clone: log
Defaults to template name. Customise to distinguish multiple instances of the same template.
Leave empty to apply broadly via the template's default data-classification / risk-tier filters.
Rationale

User-facing AI systems must not produce harmful content. This is both a safety and a brand requirement.

Example violation
Model generates response containing racial slurs or harassment
Triggers (1)
  • outputScan responses for toxic content
Detectors (1)
  • classifiertoxicity-classifier
    Multi-label toxicity classifier
Actions (2)
  • blockReplace with safe refusal
  • logRecord category and confidence
Tunable parameters (2)
Toxicity threshold
basicnumber
Lower = more aggressive blocking.
Default: 0.7
Blocked categories
basiclist
Which toxicity categories to block.
Default: ["hate","harassment","self_harm","violence"]
Regulatory references
EU AI Act Art. 5
Template defaults (suggested target after promotion)
Suggested mode
block
Risk tiers
Data classifications
Departments

Cloned policies start in Guideline mode. Use the promotion wizard to flip to Strict once you trust the false-positive rate.