Skip to content

Guardrail Definition Specification

File naming: guardrails/<guardrail_id>.guardrail.md

Audience: Platform engineers, AI safety teams, product owners


Overview

A note on naming: The term guardrail is borrowed from the AI safety world, where it primarily connotes blocking and filtering. In AML, the term is intentionally broader: a guardrail is any runtime-invoked backend that inspects, transforms, tags, or enriches agent data. Use cases range from PII redaction and content moderation to topic classification, sentiment scoring, and metadata injection. If the name feels too narrow for your use case, the underlying concept is simply a runtime pipeline hook.

A guardrail is a special-purpose backend that the agent runtime invokes automatically against agent inputs and outputs — before passing data to the model, or before returning results to the caller. Think of a guardrail as a special tool that the runtime calls, rather than one the model decides to call.

Guardrails share the same invocation architecture as tools: they receive a structured payload, call a transport backend, and return a structured result. The key differences from tools are:

  • Invoked by the runtime, not the model. The model is unaware that guardrails are running.
  • Description is informational only. Unlike tool descriptions, meta.description is never injected into the model's context.
  • Input is implicit. The agent runtime automatically passes parameters of compatible content types from the agent's interface to the guardrail — no input schema needs to be declared.
  • Result determines pipeline behavior. The guardrail's output drives a decision: halt, transform the payload, annotate it, or continue.

A guardrail definition file maps a stable identifier (e.g., pii-scan) to a transport backend, invocation policy, and result type. Agent files reference guardrails by ID — they never copy guardrail configuration inline. When a guardrail backend changes, only the guardrail definition file changes; the compiler re-validates all referencing agents automatically.


How guardrails are invoked

When the agent runtime reaches a guardrail-instrumented pipeline position (defined in the agent's guardrails section), it passes the relevant content as the guardrail's input payload. Four positions are supported:

Position Content scanned Primary threat addressed
input Agent input fields matching behaviour.content_types, before the model sees them Direct injection, unsafe user input
tool_input Tool call arguments generated by the model, before dispatch Injected instructions crafted by a compromised model
tool_output Tool results returned to the runtime, before the model processes them Indirect prompt injection via retrieved content
output Agent output fields matching behaviour.content_types, before returning to the caller Model-generated harmful content, PII leakage

For input and output positions, the runtime collects all parameters from the agent's interface that match the guardrail's declared behaviour.content_types:

  • A guardrail with behaviour.content_types: ["text"] receives all string-typed fields from the agent's input or output object.
  • A guardrail with behaviour.content_types: ["image"] receives all binary image-typed fields.
  • A guardrail with behaviour.content_types: ["text", "image"] receives both.

For tool_input and tool_output positions, the content is the serialised tool call arguments (as a JSON object) or the tool result (as a JSON object). The behaviour.content_types matching logic applies the same way — a guardrail with content_types: ["text"] receives string-valued fields from the tool argument or result object.

If no parameters at the attached position match the guardrail's behaviour.content_types, it is a hard validation error at compile time.

The guardrail receives a normalized input object of the form:

{
  "content": { "<field_name>": "<field_value>", ... },
  "position": "input",
  "agent_id": "support-agent",
  "run_id": "run-abc123"
}

The guardrail response follows the standard output format described below.


Design decisions

AML's guardrail spec makes two deliberate choices that diverge from how managed AI platforms typically expose guardrails. This section explains the reasoning so the constraints are understood rather than just followed.

Only rest-api and lambda are supported transport types

Platforms such as AWS Bedrock, Azure AI Content Safety, and GCP Natural Language API offer native guardrail or moderation endpoints with their own request/response shapes, versioning schemes, and severity scales. Supporting them as first-class transport types would require the AML runtime to understand each provider's response envelope — and update that logic every time a provider changes their API.

Instead, AML defines a standard output format and requires all guardrail backends to return that shape. Provider-specific calls — including any normalisation from a provider's native scale (e.g., Azure's 0–6) to AML's 0–10 scale — belong in a Lambda function or REST API that the team owns and versions independently.

This makes the runtime simple and stable. It also makes the normalisation logic explicit, testable, and auditable: it lives in a Lambda you control, not inside the platform.

Using AWS Bedrock, Azure or GCP?

Write a Lambda that calls the provider API, maps the response to the AML output format, and configure this guardrail to call that Lambda. The Lambda's IAM role can hold the provider credentials. The Lambda ARN is pinned in the guardrail definition — versioning and rollback are handled via Lambda aliases or versions.

Built-in guardrails omit transport — no special type needed

Platform-built checks (jailbreak detection, token budget enforcement, etc.) follow the same pattern as tool type: "function": they are in-process by definition and do not need an external call. Declaring their absence via a transport: { type: "platform-native" } block would be noise — it adds syntax to say "no transport".

The rule is simple: if transport is present, the guardrail makes an external call and invocation is required. If transport is absent, the runtime invokes it in-process. No special transport type, no exception-handling in the schema.


File structure

---
[YAML front matter — all structured fields]
---

[optional prose — what this guardrail checks, when to use it, usage guidance]

The Markdown body is optional and intended for the guardrail registry documentation UI. It does not affect runtime behavior.


YAML front matter — complete field reference

Top-level required fields

spec_version: "1.2"
The AML format version. Must equal a platform-approved version string.

guardrail_id: "pii-scan"
Stable, immutable identifier for this guardrail entry. Used as the value of ref in agent definition guardrails sections. Lowercase kebab-case. Must match ^[a-z0-9_-]{3,64}$. Once published, the guardrail_id cannot change — create a new entry for a different check.

version: "1.0.0"
Semantic version of this guardrail definition (MAJOR.MINOR.PATCH). Increment whenever transport config or behavior changes.

status: "active"
Lifecycle state. Enum: active | deprecated | disabled. A deprecated guardrail triggers a lint warning on any agent that references it. A disabled guardrail causes a hard compile error.


meta — descriptive metadata (required)

meta:
  name: "PII Scan (Bedrock)"              # Required — display name in the registry UI
  description: >                          # Optional — informational only, never sent to the model
    Detects and redacts personally identifiable information in agent inputs
    and outputs using AWS Bedrock Guardrails.
  owner: "ai-safety-team"                 # Optional
  tags: ["pii", "privacy", "gdpr"]        # Optional — searchable labels
  last_updated: "2026-04-12"             # Optional — ISO 8601 date

meta.name is the only required sub-field. All others are optional but recommended.

meta.description is informational only — it is displayed in the guardrail registry UI and used for documentation, but is never injected into the model's context. The model has no knowledge that guardrails are running.


behaviour — guardrail contract (required)

behaviour:
  result_type: "score"
  content_types: ["text"]

Groups the two fields that define what this guardrail does and what content it can process.

behaviour.result_type declares what kind of output this guardrail produces. The runtime uses this to determine how to interpret the result and which call-site parameters in the agent definition are valid.

result_type What it returns Effect Typical use cases
score A 0–10 severity integer The agent compares the severity against its configured severity_threshold; if met or exceeded, on_fail triggers Content moderation, jailbreak detection, prompt injection
transform Modified content The original content is replaced with the guardrail's output before the pipeline continues PII redaction, text anonymization
annotate Metadata tags Key-value tags attached to the payload for downstream routing or observability; the payload itself is unchanged Language detection, topic classification
enrich Content to append Additional context injected alongside the existing payload Disclaimer injection, compliance watermarking

behaviour.content_types lists the content modalities this guardrail is able to process. Supported values: text | image | video | document.

The runtime uses this to match the guardrail against agent interface parameters. At the attached pipeline position, only parameters whose type matches an entry in behaviour.content_types are passed to the guardrail. Attaching a guardrail to a position where no parameters match its declared content types is a hard compile error. Note that types that can be easily converted to test like numbers are considered as text by guardrails.


transport — invocation details (required for external guardrails)

The transport block defines how the guardrail is called. Two transport types are supported, both identical in definition to their tool transport counterparts:

Type Description
rest-api HTTP/REST endpoint
lambda AWS Lambda function

Refer to Transport & Credentials for the full field reference, credential schemes, and secret sources.

Provider-specific backends (AWS Bedrock Guardrails, Azure AI Content Safety, GCP Natural Language, etc.) are not directly supported as transport types. Wrap them in a Lambda function or REST API that calls the provider and returns a response conforming to the standard output format. This keeps the AML runtime agnostic to provider-specific response envelopes and normalisation logic.

Platform-built checks follow the same pattern as tool type: "function": omit the transport block entirely. When transport is absent, the runtime invokes the guardrail in-process. Built-in guardrails never time out, never require credentials, and do not need an invocation block.


invocation — execution settings (required when transport is present)

invocation:
  timeout_ms: 300
  on_timeout:
    severity: 10                  # Severity returned to the agent if the call times out
  on_provider_error:
    severity: 10                  # Severity returned to the agent on provider 5xx or network failure
  retry_policy:
    max_attempts: 2
    backoff_ms: 100
Field Default Description
timeout_ms 500 Maximum time in milliseconds to wait for the transport response.
on_timeout.severity 10 The severity value (0–10) returned to the agent when the call times out before the retry policy is exhausted.
on_provider_error.severity 10 The severity value (0–10) returned to the agent on a provider error (5xx, network failure) after retries.
retry_policy.max_attempts 1 Total call attempts including the initial one.
retry_policy.backoff_ms 100 Initial backoff delay in milliseconds between retries (exponential).

Severity-based error policy

Rather than a binary fail_open / fail_closed on the guardrail, error conditions return a synthetic severity score. The agent's configured severity_threshold determines whether that synthetic score triggers a pipeline action — all decision logic stays on the agent side.

The conventional mapping is:

Synthetic severity Equivalent intent
10 Equivalent to fail_closed — any configured threshold will trigger
0 Equivalent to fail_open — no threshold will trigger
5 Permissive default — triggers on strict agents, passes on lenient ones

For security-critical guardrails (PII, prompt injection, unsafe content) set both on_timeout.severity and on_provider_error.severity to 10. For supplementary or quality checks, consider 5 to allow lenient agents to continue while strict agents still block.


fallback:
  enabled: true
  fallback_guardrail_id: "pii-scan-lite"   # guardrail_id of the fallback guardrail
  emit_warning: true                       # Log a warning event when fallback activates
Field Default Description
enabled false Whether to activate a fallback when the primary transport is unavailable.
fallback_guardrail_id The guardrail_id of the guardrail to invoke as a fallback. The fallback guardrail's own definition — including its transport, invocation policy, and error behavior — governs how it is called.
emit_warning true Emit a structured warning event to the audit log when the fallback activates.

The fallback guardrail is typically a built-in (no transport) check that provides lighter-weight coverage and never fails. Its behaviour.result_type must match the primary guardrail's.


Standard guardrail output format

AML defines a canonical response format for all guardrails. The runtime enforces this format for agents compiled and deployed with AML Studio. When implementing a custom guardrail transport manually (e.g., a rest-api or lambda backend), the service should return a response conforming to this shape.

{
  "result_type": "score",
  "severity": 7,
  "triggered": true,
  "category_scores": {
    "violence": 7,
    "harassment": 1
  },
  "content": null,
  "annotations": null,
  "enrichment": null,
  "raw": { }
}
Field Present when Description
result_type Always Mirrors the guardrail's declared result_type.
severity score Normalized 0–10 severity. Provider-native scales (e.g., Azure 0–6, OpenAI 0–1) are mapped to 0–10 by the runtime.
triggered score true if severity meets or exceeds the agent's configured severity_threshold.
category_scores score (optional) Per-category severity map (0–10 each). Useful for audit and routing decisions.
content transform The modified content that replaces the original field value in the pipeline.
annotations annotate Key-value metadata tags added to the payload. Downstream agents or routing logic may read these.
enrichment enrich Content to be appended alongside the existing payload before the next pipeline stage.
raw Always Full provider-native response, preserved for audit trail and debugging.

Agent call-site reference

The guardrails section in an agent definition attaches guardrails to pipeline positions. The same per-result_type parameters apply at all four positions (input, tool_input, tool_output, output). This section documents the expected call-site configuration for each type.

Note

The agent definition specification (01-agent-definition.md) is the authoritative reference for the guardrails section structure. This section documents the parameters that are specific to each guardrail result_type.

score guardrails

guardrails:
  input:
    - ref: "content-moderation"
      severity_threshold: 3        # Trigger on_fail if the returned severity >= 3
      on_fail: "block"             # block | warn | log | escalate
  tool_output:
    - ref: "indirect-injection-scan"
      severity_threshold: 6
      on_fail: "block"
  output:
    - ref: "content-moderation"
      severity_threshold: 6        # More permissive on output
      on_fail: "warn"
Parameter Required Description
severity_threshold yes Integer 0–10. If the guardrail's returned severity is greater than or equal to this value, on_fail is triggered.
on_fail yes block — halt the pipeline and return an error to the caller. warn — log a warning event and continue. log — silently record the event and continue. escalate — route the run to the matching rule in orchestration.escalation.

transform guardrails

guardrails:
  input:
    - ref: "pii-redact"
      on_fail: "apply"             # apply | reject
  tool_output:
    - ref: "pii-redact"            # strip PII from tool results before model sees them
      on_fail: "apply"
Parameter Required Description
on_fail yes apply — replace the original content with the guardrail's transformed output and continue. reject — if there is any transformation given or the transformation fails or cannot be applied, halt the pipeline.

annotate and enrich guardrails

guardrails:
  output:
    - ref: "language-detect"
      on_fail: "skip"              # skip | fail_closed
Parameter Required Description
on_fail yes skip — if the guardrail fails, continue the pipeline without the annotation or enrichment. fail_closed — if the guardrail fails, halt the pipeline.

Validation rules

Hard validation failures

  • Missing any required field (spec_version, guardrail_id, version, status, behaviour, meta.name).
  • behaviour.result_type is not one of score | transform | annotate | enrich.
  • behaviour.content_types contains a value other than text | image | video | document.
  • transport.type is present but not one of rest-api | lambda.
  • transport.credentials missing when transport is present.
  • invocation missing when transport is present.
  • on_timeout.severity or on_provider_error.severity is outside the 0–10 range.
  • fallback.fallback_guardrail_id does not resolve to a registered guardrail.
  • Agent attaches a guardrail to a position where no parameters at that position match its behaviour.content_types.
  • on_timeout.severity or on_provider_error.severity set to 0 for a score guardrail (effectively fail_open for all agents).
  • fallback.enabled: false for guardrails with a transport block.
  • status: "deprecated" without a meta.last_updated date.
  • Agent call-site severity_threshold absent for a score guardrail.

Full example

---
spec_version: "1.2"
guardrail_id: "pii-scan"
version: "1.0.0"
status: "active"

behaviour:
  result_type: "transform"
  content_types: ["text"]

meta:
  name: "PII Scan (Bedrock)"
  owner: "ai-safety-team"
  last_updated: "2026-04-12"
  description: >
    Detects and redacts personally identifiable information — names, email addresses,
    phone numbers, national identifiers, and similar — using AWS Bedrock Guardrails.
    Suitable for agents operating under GDPR, HIPAA, or any data policy that restricts
    exposure of personal data.
  tags: ["pii", "privacy", "gdpr", "hipaa"]

transport:
  type: "lambda"
  function_arn: "arn:aws:lambda:eu-west-1:123456789:function:pii-scan-v3"
  invocation_type: "RequestResponse"
  payload_format: "json"
  credentials:
    scheme: "iam-role"

invocation:
  timeout_ms: 300
  on_timeout:
    severity: 10
  on_provider_error:
    severity: 10
  retry_policy:
    max_attempts: 2
    backoff_ms: 100

fallback:
  enabled: true
  fallback_guardrail_id: "pii-scan-lite"
  emit_warning: true
---

Uses AWS Bedrock Guardrails to detect and redact personally identifiable information
before the model processes input and before responses are returned to callers.

## Recommended agent configuration

```yaml
guardrails:
  input:
    - ref: "pii-scan"
      on_fail: "apply"          # Apply the redacted version before the model sees the data
  output:
    - ref: "pii-scan"
      on_fail: "reject"         # If output redaction fails, halt rather than leak PII

Relationship to policies.pii_redaction

The policies.pii_redaction: true flag in the agent definition and a ref: "pii-scan" guardrail can coexist and both run. They serve complementary roles:

policies.pii_redaction: true ref: "pii-scan" guardrail
What runs it Platform-built redaction pass (always available) External transport (Bedrock, Azure, etc.)
Configuration None — on or off Full transport config, version pinning, fallback
Precision Broad coverage, heuristic-based Provider-specific, often tunable
Failure mode Never errors (platform-built) Can fail — governed by on_provider_error

In a defense-in-depth configuration, set both: pii_redaction: true for baseline coverage that never fails, and a transport-backed guardrail for higher-precision detection on sensitive workflows.