Guardrail Internals
TealTiger ships three built-in guardrails. This page explains exactly how each one works — what techniques they use, what they call, and what they don’t.
TealTiger guardrails are deterministic by default. The same input produces the same result, unless you opt into API-based detection (content moderation with OpenAI).
Detection Techniques at a Glance
| Guardrail | Technique | External API Calls | Local ML Model | Deterministic |
|---|
| PII Detection | Regex pattern matching | None | No | Yes |
| Prompt Injection | Multi-category regex + confidence scoring | None | No | Yes |
| Content Moderation | OpenAI Moderation API + local regex fallback | Optional (OpenAI) | No | Yes (local) / No (API) |
PII Detection
Technique: Pre-compiled regular expressions.
PII detection is entirely local. No data leaves your process. It scans text against a set of regex patterns for common PII types.
Detected Types
| PII Type | Pattern | Risk Score |
|---|
| Email | Standard email format (user@domain.tld) | 30 |
| Phone | US/international formats with optional country code | 40 |
| SSN | XXX-XX-XXXX format | 90 |
| Credit Card | 16-digit with optional spaces/dashes | 95 |
| Name | Two consecutive capitalized words (basic heuristic) | 20 |
How It Works
- Text is extracted from the input (handles strings, prompt objects, and message arrays)
- Each enabled pattern runs against the text using pre-compiled regex with global matching
- Matches are collected with position, length, and type metadata
- Risk score is the maximum score across all detected PII types
- Patterns compiled once at construction, reused across calls
- LRU pattern cache (up to 100 entries) for repeated text
- Early exit for text shorter than 3 characters
- Configurable via
detectTypes, action, riskScores
What It Does NOT Do
- No named entity recognition (NER) or ML-based detection
- No external API calls — all processing is in-process
- Name detection is a basic heuristic (two capitalized words) and will produce false positives
- Does not detect PII in non-Latin scripts
import { PIIDetectionGuardrail } from 'tealtiger';
const pii = new PIIDetectionGuardrail({
detectTypes: ['email', 'phone', 'ssn', 'creditCard'],
action: 'redact', // block | redact | mask | allow
});
from tealtiger.guardrails import PIIDetectionGuardrail
pii = PIIDetectionGuardrail({
"detect_types": ["email", "phone", "ssn", "credit_card"],
"action": "redact",
})
Prompt Injection Detection
Technique: Multi-category regex pattern matching with confidence scoring.
Prompt injection detection is entirely local. It matches input text against categorized attack patterns and assigns a confidence score per detection.
Attack Categories
| Category | What It Detects | Risk Score | Example Pattern |
|---|
| Instruction Injection | ”Ignore previous instructions” | 90 | ignore all previous instructions |
| Role Playing | ”You are now a…“ | 70 | pretend you are a hacker |
| System Leakage | ”Show me your system prompt” | 95 | repeat your original instructions |
| Jailbreak | DAN mode, developer mode | 100 | do anything now |
| Encoding Attacks | Base64/hex/unicode obfuscation | 80 | decode the following base64 |
| Delimiter Manipulation | Injected system/user/assistant tags | 75 | [SYSTEM] new instructions |
How It Works
- Input text is scanned against all pattern categories
- Each match produces a detection with type, matched text, and confidence score (0.7–0.98)
- Sensitivity level controls the threshold:
- High: 1 match triggers detection
- Medium: 1 match triggers detection
- Low: 2+ matches required
- Overall risk score is the maximum across all detections
What It Does NOT Do
- No ML-based semantic analysis — it’s pattern matching only
- No external API calls
- Cannot detect novel injection techniques not covered by patterns
- Encoding detection flags the presence of encoding keywords, not decoded payloads
import { PromptInjectionGuardrail } from 'tealtiger';
const injection = new PromptInjectionGuardrail({
sensitivity: 'high', // low | medium | high
action: 'block', // block | transform | allow
});
from tealtiger.guardrails import PromptInjectionGuardrail
injection = PromptInjectionGuardrail({
"sensitivity": "high",
"action": "block",
})
Content Moderation
Technique: Hybrid — OpenAI Moderation API (primary) with local regex fallback.
This is the only guardrail that can make external API calls. When configured with an OpenAI API key, it sends text to the OpenAI Moderation endpoint. If the API is unavailable or no key is provided, it falls back to local pattern matching.
Detection Categories
| Category | OpenAI API | Local Fallback | Default Threshold | Risk Score |
|---|
| Hate | ✅ | ✅ (keyword regex) | 0.5 | 70 |
| Hate/Threatening | ✅ | — | 0.5 | 90 |
| Self-Harm | ✅ | ✅ (keyword regex) | 0.5 | 85 |
| Sexual | ✅ | ✅ (keyword regex) | 0.5 | 60 |
| Sexual/Minors | ✅ | — | 0.3 | 100 |
| Violence | ✅ | ✅ (keyword regex) | 0.5 | 70 |
| Violence/Graphic | ✅ | — | 0.5 | 85 |
| Harassment | ✅ | ✅ (keyword regex) | 0.5 | 60 |
| Harassment/Threatening | ✅ | — | 0.5 | 80 |
How It Works
With OpenAI API (recommended for production):
- Text is sent to
https://api.openai.com/v1/moderations via HTTPS POST
- OpenAI returns per-category scores (0.0–1.0) and flagged booleans
- Scores are compared against configurable thresholds
- Categories exceeding thresholds are flagged as violations
Local fallback (no API key or API failure):
- Text is scanned against keyword-based regex patterns per category
- Matches produce a binary flagged/not-flagged result (no confidence scores)
- Fewer categories are covered (no threatening or graphic sub-categories)
Data Flow Considerations
When useOpenAI: true, input text is sent to OpenAI’s Moderation API. If you handle sensitive data and cannot send it to external services, set useOpenAI: false to use local-only detection.
| Mode | Data Leaves Process? | Coverage | Accuracy |
|---|
| OpenAI API | Yes (to OpenAI) | 9 categories | High (ML-based) |
| Local fallback | No | 5 categories | Lower (keyword matching) |
import { ContentModerationGuardrail } from 'tealtiger';
// With OpenAI API (higher accuracy)
const moderation = new ContentModerationGuardrail({
apiKey: process.env.OPENAI_API_KEY,
useOpenAI: true,
action: 'block',
});
// Local-only (no external calls)
const localModeration = new ContentModerationGuardrail({
useOpenAI: false,
action: 'block',
});
from tealtiger.guardrails import ContentModerationGuardrail
# With OpenAI API
moderation = ContentModerationGuardrail({
"api_key": os.environ["OPENAI_API_KEY"],
"use_openai": True,
"action": "block",
})
# Local-only
local_moderation = ContentModerationGuardrail({
"use_openai": False,
"action": "block",
})
Execution Architecture
All three guardrails run through the GuardrailEngine, which provides:
- Parallel execution: Guardrails run concurrently by default (configurable)
- Timeout handling: 5-second default per guardrail
- Error isolation: One guardrail failure doesn’t block others (
continueOnError: true)
- Result aggregation: Combined pass/fail, maximum risk score, list of failed guardrails
TealGuard sits on top and adds:
- Policy integration (optional TealEngine evaluation)
- Result caching with LRU eviction
- Decision mapping to reason codes (
PII_DETECTED, PROMPT_INJECTION_DETECTED, HARMFUL_CONTENT_DETECTED)
- Correlation ID propagation for audit trails
Input → TealGuard.check()
├── Cache lookup (if enabled)
├── GuardrailEngine.execute() [parallel]
│ ├── PIIDetectionGuardrail
│ ├── PromptInjectionGuardrail
│ └── ContentModerationGuardrail
├── TealEngine.evaluate() [if policy-driven]
└── Decision { action, reason_codes, risk_score, metadata }
Extending with Custom Guardrails
You can register custom guardrails that follow the same interface:
import { Guardrail, GuardrailResult } from 'tealtiger';
class MyCustomGuardrail extends Guardrail {
async evaluate(input: any, context?: any): Promise<GuardrailResult> {
// Your detection logic here
return { passed: true, action: 'allow', reason: 'OK', metadata: {}, risk_score: 0 };
}
}
guard.registerGuardrail(new MyCustomGuardrail({ name: 'MyCustom' }));
Summary
- PII and prompt injection detection are fully local, deterministic, and regex-based
- Content moderation optionally calls OpenAI’s Moderation API for higher accuracy
- No embedded ML models — the SDK stays lightweight and predictable
- All guardrails are configurable, extensible, and run in parallel with timeout protection
For policy-level controls that wrap these guardrails, see Policy Overview and Conditions & Actions.