Skip to main content

Guardrail Internals

TealTiger ships three built-in guardrails. This page explains exactly how each one works — what techniques they use, what they call, and what they don’t.
TealTiger guardrails are deterministic by default. The same input produces the same result, unless you opt into API-based detection (content moderation with OpenAI).

Detection Techniques at a Glance

GuardrailTechniqueExternal API CallsLocal ML ModelDeterministic
PII DetectionRegex pattern matchingNoneNoYes
Prompt InjectionMulti-category regex + confidence scoringNoneNoYes
Content ModerationOpenAI Moderation API + local regex fallbackOptional (OpenAI)NoYes (local) / No (API)

PII Detection

Technique: Pre-compiled regular expressions. PII detection is entirely local. No data leaves your process. It scans text against a set of regex patterns for common PII types.

Detected Types

PII TypePatternRisk Score
EmailStandard email format (user@domain.tld)30
PhoneUS/international formats with optional country code40
SSNXXX-XX-XXXX format90
Credit Card16-digit with optional spaces/dashes95
NameTwo consecutive capitalized words (basic heuristic)20

How It Works

  1. Text is extracted from the input (handles strings, prompt objects, and message arrays)
  2. Each enabled pattern runs against the text using pre-compiled regex with global matching
  3. Matches are collected with position, length, and type metadata
  4. Risk score is the maximum score across all detected PII types

Performance Optimizations

  • Patterns compiled once at construction, reused across calls
  • LRU pattern cache (up to 100 entries) for repeated text
  • Early exit for text shorter than 3 characters
  • Configurable via detectTypes, action, riskScores

What It Does NOT Do

  • No named entity recognition (NER) or ML-based detection
  • No external API calls — all processing is in-process
  • Name detection is a basic heuristic (two capitalized words) and will produce false positives
  • Does not detect PII in non-Latin scripts
import { PIIDetectionGuardrail } from 'tealtiger';

const pii = new PIIDetectionGuardrail({
  detectTypes: ['email', 'phone', 'ssn', 'creditCard'],
  action: 'redact',  // block | redact | mask | allow
});
from tealtiger.guardrails import PIIDetectionGuardrail

pii = PIIDetectionGuardrail({
    "detect_types": ["email", "phone", "ssn", "credit_card"],
    "action": "redact",
})

Prompt Injection Detection

Technique: Multi-category regex pattern matching with confidence scoring. Prompt injection detection is entirely local. It matches input text against categorized attack patterns and assigns a confidence score per detection.

Attack Categories

CategoryWhat It DetectsRisk ScoreExample Pattern
Instruction Injection”Ignore previous instructions”90ignore all previous instructions
Role Playing”You are now a…“70pretend you are a hacker
System Leakage”Show me your system prompt”95repeat your original instructions
JailbreakDAN mode, developer mode100do anything now
Encoding AttacksBase64/hex/unicode obfuscation80decode the following base64
Delimiter ManipulationInjected system/user/assistant tags75[SYSTEM] new instructions

How It Works

  1. Input text is scanned against all pattern categories
  2. Each match produces a detection with type, matched text, and confidence score (0.7–0.98)
  3. Sensitivity level controls the threshold:
    • High: 1 match triggers detection
    • Medium: 1 match triggers detection
    • Low: 2+ matches required
  4. Overall risk score is the maximum across all detections

What It Does NOT Do

  • No ML-based semantic analysis — it’s pattern matching only
  • No external API calls
  • Cannot detect novel injection techniques not covered by patterns
  • Encoding detection flags the presence of encoding keywords, not decoded payloads
import { PromptInjectionGuardrail } from 'tealtiger';

const injection = new PromptInjectionGuardrail({
  sensitivity: 'high',  // low | medium | high
  action: 'block',      // block | transform | allow
});
from tealtiger.guardrails import PromptInjectionGuardrail

injection = PromptInjectionGuardrail({
    "sensitivity": "high",
    "action": "block",
})

Content Moderation

Technique: Hybrid — OpenAI Moderation API (primary) with local regex fallback. This is the only guardrail that can make external API calls. When configured with an OpenAI API key, it sends text to the OpenAI Moderation endpoint. If the API is unavailable or no key is provided, it falls back to local pattern matching.

Detection Categories

CategoryOpenAI APILocal FallbackDefault ThresholdRisk Score
Hate✅ (keyword regex)0.570
Hate/Threatening0.590
Self-Harm✅ (keyword regex)0.585
Sexual✅ (keyword regex)0.560
Sexual/Minors0.3100
Violence✅ (keyword regex)0.570
Violence/Graphic0.585
Harassment✅ (keyword regex)0.560
Harassment/Threatening0.580

How It Works

With OpenAI API (recommended for production):
  1. Text is sent to https://api.openai.com/v1/moderations via HTTPS POST
  2. OpenAI returns per-category scores (0.0–1.0) and flagged booleans
  3. Scores are compared against configurable thresholds
  4. Categories exceeding thresholds are flagged as violations
Local fallback (no API key or API failure):
  1. Text is scanned against keyword-based regex patterns per category
  2. Matches produce a binary flagged/not-flagged result (no confidence scores)
  3. Fewer categories are covered (no threatening or graphic sub-categories)

Data Flow Considerations

When useOpenAI: true, input text is sent to OpenAI’s Moderation API. If you handle sensitive data and cannot send it to external services, set useOpenAI: false to use local-only detection.
ModeData Leaves Process?CoverageAccuracy
OpenAI APIYes (to OpenAI)9 categoriesHigh (ML-based)
Local fallbackNo5 categoriesLower (keyword matching)
import { ContentModerationGuardrail } from 'tealtiger';

// With OpenAI API (higher accuracy)
const moderation = new ContentModerationGuardrail({
  apiKey: process.env.OPENAI_API_KEY,
  useOpenAI: true,
  action: 'block',
});

// Local-only (no external calls)
const localModeration = new ContentModerationGuardrail({
  useOpenAI: false,
  action: 'block',
});
from tealtiger.guardrails import ContentModerationGuardrail

# With OpenAI API
moderation = ContentModerationGuardrail({
    "api_key": os.environ["OPENAI_API_KEY"],
    "use_openai": True,
    "action": "block",
})

# Local-only
local_moderation = ContentModerationGuardrail({
    "use_openai": False,
    "action": "block",
})

Execution Architecture

All three guardrails run through the GuardrailEngine, which provides:
  • Parallel execution: Guardrails run concurrently by default (configurable)
  • Timeout handling: 5-second default per guardrail
  • Error isolation: One guardrail failure doesn’t block others (continueOnError: true)
  • Result aggregation: Combined pass/fail, maximum risk score, list of failed guardrails
TealGuard sits on top and adds:
  • Policy integration (optional TealEngine evaluation)
  • Result caching with LRU eviction
  • Decision mapping to reason codes (PII_DETECTED, PROMPT_INJECTION_DETECTED, HARMFUL_CONTENT_DETECTED)
  • Correlation ID propagation for audit trails
Input → TealGuard.check()
         ├── Cache lookup (if enabled)
         ├── GuardrailEngine.execute() [parallel]
         │    ├── PIIDetectionGuardrail
         │    ├── PromptInjectionGuardrail
         │    └── ContentModerationGuardrail
         ├── TealEngine.evaluate() [if policy-driven]
         └── Decision { action, reason_codes, risk_score, metadata }

Extending with Custom Guardrails

You can register custom guardrails that follow the same interface:
import { Guardrail, GuardrailResult } from 'tealtiger';

class MyCustomGuardrail extends Guardrail {
  async evaluate(input: any, context?: any): Promise<GuardrailResult> {
    // Your detection logic here
    return { passed: true, action: 'allow', reason: 'OK', metadata: {}, risk_score: 0 };
  }
}

guard.registerGuardrail(new MyCustomGuardrail({ name: 'MyCustom' }));

Summary

  • PII and prompt injection detection are fully local, deterministic, and regex-based
  • Content moderation optionally calls OpenAI’s Moderation API for higher accuracy
  • No embedded ML models — the SDK stays lightweight and predictable
  • All guardrails are configurable, extensible, and run in parallel with timeout protection
For policy-level controls that wrap these guardrails, see Policy Overview and Conditions & Actions.