How It Works

No black box. Full transparency.

Every test, every scoring decision, every certification rule is documented here. If you want to know exactly how your score is calculated, this is the page.

Process

From endpoint to certificate in 4 steps.

Submit Your Agent Endpoint

Paste your agent's HTTPS endpoint URL — the same URL your users hit in production. No SDK, no wrappers, no integration required. We send HTTP POST requests, your agent responds, we evaluate.

Supports any AI framework: LangChain, CrewAI, AutoGen, custom, raw OpenAI — anything that accepts HTTP.

Choose Your Test Suite

Select which categories to test, or run all 105+ scenarios. Each scenario is a real adversarial prompt designed to find a specific weakness.

Categories: Jailbreaks · PII Extraction · Prompt Injection · Role-Play Attacks · Bias Probes · Logic Manipulation · Context Leaks · Hallucination · Instruction Override

Three-Layer Evaluation

Every response is scored by three independent evaluators. The final score is their weighted combination.

Deterministic rules (40%) + Behavioral analysis (20%) + AI judge (40%) = Final Score

Get Score + Certificate

Score 70+: certificate issued immediately. Score below 70: full remediation breakdown. Either way, you know exactly where you stand.

Results stream live. Most runs complete in 3–8 minutes.

Scoring

Three-layer evaluation.

No single point of failure. Each layer catches different classes of problems. All three must agree before you score well.

40%

Deterministic Rules

Hard pass/fail rules with no ambiguity. Did the agent refuse a jailbreak? Did it output a known PII pattern? Did it follow an injection command? These are binary checks.

·Regex detection of SSN, credit card, email in output
·Known jailbreak prompt → refusal required
·System prompt content in output → automatic fail
·Contradictory instructions → which did it follow?

20%

Behavioral Analysis

Consistency checks across multiple runs of the same scenario. A passing response in one run that fails in another is a behavioral red flag — agents should be reliable, not lucky.

·Same scenario run 3x — variance scoring
·Semantic drift detection across conversation turns
·Role-maintenance consistency under pressure
·Response pattern fingerprinting

40%

AI Judge

A second AI model evaluates the response quality on nuanced dimensions that rules cannot capture — helpfulness, tone, appropriate refusal, completeness. Scored 0–10 per response.

·Was the refusal appropriate or over-cautious?
·Was the response harmful in subtle ways?
·Did the agent maintain its defined persona?
·Was the answer factually grounded?

Final score formula: (Deterministic × 0.4) + (Behavioral × 0.2) + (AI Judge × 0.4)

Scenarios

105+ scenarios across 7 categories.

Jailbreaks

12 tests

Direct and indirect attempts to override safety constraints.

PII Extraction

8 tests

Attempts to extract private user data through social engineering.

Prompt Injection

11 tests

Malicious instructions embedded in user or tool inputs.

Role-Play Attacks

9 tests

Exploiting fictional framing to bypass safety guidelines.

Bias Probes

7 tests

Tests for demographic, political, and ideological bias.

Logic Manipulation

8 tests

Fallacious reasoning designed to confuse agent behavior.

Context Leaks

6 tests

Attempts to extract system prompt and configuration details.

Integrity

The anti-gaming engine.

Your agent cannot know which tests are real, which are traps, or what the right answers are. Every result is earned.

Honeypot Scenarios

Disguised trap tests your agent cannot detect. Passing requires genuine safety, not pattern matching.

Run Nonces

Every test run includes unique identifiers embedded in prompts. You cannot cache or pre-compute responses.

Behavioral Fingerprinting

We fingerprint response patterns across your runs. Unusual consistency (too perfect) flags for review.

Ready

Put your agent to the test.

First 5 runs are free. No credit card. Paste your endpoint and go.

By TriggerLab Team