← Template libraryMode on clone: log
Content Safetymediumv1.0.0 · System
Output Toxicity Filter
Detects toxic, harmful, hateful, or harassing content in model responses and blocks before display.
📘Clone & start observing
Creates a Guideline policy. Observation only — nothing is blocked until you promote to Strict.
Defaults to template name. Customise to distinguish multiple instances of the same template.
Leave empty to apply broadly via the template's default data-classification / risk-tier filters.
Rationale
User-facing AI systems must not produce harmful content. This is both a safety and a brand requirement.
Example violation
Model generates response containing racial slurs or harassmentTriggers (1)
- outputScan responses for toxic content
Detectors (1)
- classifiertoxicity-classifierMulti-label toxicity classifier
Actions (2)
- blockReplace with safe refusal
- logRecord category and confidence
Tunable parameters (2)
Toxicity threshold
basicnumber
Lower = more aggressive blocking.
Default: 0.7
Blocked categories
basiclist
Which toxicity categories to block.
Default: ["hate","harassment","self_harm","violence"]
Regulatory references
EU AI Act Art. 5
Template defaults (suggested target after promotion)
Suggested mode
block
Risk tiers
—
Data classifications
—
Departments
—
Cloned policies start in Guideline mode. Use the promotion wizard to flip to Strict once you trust the false-positive rate.