Content Safetymediumv1.0.0 · System

Output Toxicity Filter

Detects toxic, harmful, hateful, or harassing content in model responses and blocks before display.

📘Clone & start observing

Creates a Guideline policy. Observation only — nothing is blocked until you promote to Strict.

Mode on clone: log

Policy name

Defaults to template name. Customise to distinguish multiple instances of the same template.

Leave empty to apply broadly via the template's default data-classification / risk-tier filters.

Rationale

User-facing AI systems must not produce harmful content. This is both a safety and a brand requirement.

Example violation

Model generates response containing racial slurs or harassment

Triggers (1)

Detectors (1)

Actions (2)

Tunable parameters (2)

Toxicity threshold

basicnumber

Lower = more aggressive blocking.

Default: 0.7

Blocked categories

basiclist

Which toxicity categories to block.

Default: ["hate","harassment","self_harm","violence"]

Regulatory references

EU AI Act Art. 5

Template defaults (suggested target after promotion)

Suggested mode

block

Risk tiers

—

Data classifications

—

Departments

—

Cloned policies start in Guideline mode. Use the promotion wizard to flip to Strict once you trust the false-positive rate.