Try Grounded AI
User Guide
Everything you need to test AI responses for hallucinations, interpret GR ratings, and integrate Grounded into your QA workflow.
What is Try Grounded AI?
Try Grounded AI is an AI hallucination testing platform built by KiwiQA. It lets you paste any AI-generated response and run 7 independent validation checks — returning a structured verdict, a GR reliability score, and specific recommended actions in under 60 seconds.
Unlike developer-focused tools that require SDK integration and engineering setup, Try Grounded AI works for QA testers, product managers, consultants, and compliance teams with no code required.
Who is it for?
| ROLE | USE CASE |
|---|---|
| QA Engineers | Validate AI responses before each release. Catch hallucinations in automated test suites. |
| Product Managers | Spot-check AI features before shipping. Generate evidence for stakeholder sign-off. |
| Security Engineers | Test AI outputs for compliance with security policies and correct regulatory references. |
| Consultants | Audit client AI systems. Produce branded PDF reports as deliverables. |
| Compliance / Legal | Verify AI responses meet industry-specific accuracy standards before deployment. |
| Developers | Integrate via API into CI/CD pipelines. Run regression tests on model updates. |
What problems does it solve?
- AI models hallucinate facts, citations, dates, and figures — even in production.
- Traditional test suites check for software bugs, not factual accuracy.
- Manual review of AI responses is inconsistent and doesn't scale.
- Regulated industries (healthcare, legal, finance) face real liability from hallucinated AI outputs.
Quick Start
Your first hallucination test takes under 60 seconds. No setup. No SDK. No integrations required.
GR Rating System
Every test returns a GR (Grounded Reliability) rating — a score from 0 to 100 mapped to five bands. GR is the primary output of Try Grounded AI and the metric your team should use to gate AI deployments.
| RATING | SCORE | LABEL | MEANING |
|---|---|---|---|
| GR-5 | 88–100 | Verified | All 7 checks passed. Safe to deploy. Use with confidence. |
| GR-4 | 76–87 | Reliable | Minor issues only. Deploy with standard monitoring. |
| GR-3 | 60–75 | Conditional | Review flagged findings before shipping. Consider rewriting the prompt. |
| GR-2 | 42–59 | High Risk | Significant hallucination signals. Do not deploy without remediation. |
| GR-1 | 0–41 | Critical | Severe hallucination detected. Block release immediately. |
How the GR score is calculated
The GR score is a weighted composite of all up to 9 validation layers. Each layer contributes a percentage of the total score:
| LAYER | WEIGHT | TYPE |
|---|---|---|
| Multi-Model Consensus | 25% | LLM |
| Consistency | 22% | LLM |
| Reference Grounding | 20% | LLM |
| Confidence Audit | 16% | LLM |
| Domain Rules | 9% | NON-LLM |
| Custom Rules | 9% | NON-LLM |
| Semantic Drift | 7% | LLM |
High-risk domain multiplier
For responses in healthcare, legal, finance, or government contexts, a 0.92× multiplier is applied to the final score. This raises the bar for these industries where hallucinations carry real-world liability.
The 9 Validation Layers
Every test runs up to 9 independent checks. Each layer is designed to catch a different type of hallucination or reliability failure.
Structured Data Fidelity (9th Layer)
STRUCTURED DATA FIDELITY — 9TH LAYER
The 9th layer lets you attach any CRM export, CSV, or JSON dataset to your Response Audit. Grounded automatically parses every field, detects data types, computes aggregates, and verifies every AI claim against your actual data — no manual field mapping required.
HOW TO USE IT
- Open Response Audit and fill in Steps 01 and 02.
- Scroll to Step 04 — Attach CRM or structured data and click Enable.
- Upload a CSV or JSON file, or paste the data directly. Max 500KB.
- Run the audit. The Data Attribution tab will show a field-by-field verification with MATCHED / MISMATCH / INFERRED status for every claim.
SUPPORTED FORMATS
| Format | Example | Multi-record? |
|---|---|---|
| HubSpot contacts CSV | First Name, Email, Lifecycle Stage, ... | Yes — all rows |
| Salesforce export | Id, Name, Status, Industry, ... | Yes — all rows |
| Custom CSV | account_name, health_score, sla_breach, ... | Yes — all rows |
| JSON object | {"account": "Acme", "score": 85} | Single record |
| JSON array | [{"account":"Acme"}, {"account":"TechCo"}] | Yes — all rows |
| TSV / pipe-separated | field1 field2 field3 | Yes — all rows |
HOW GROUNDED VERIFIES DATA
Grounded uses a three-tier deterministic engine — no LLM arithmetic:
- Type detection — every field is auto-classified as boolean, numeric, date, enum, or ID based on its values.
- Aggregate computation — sums, averages, counts, and group-bys are computed in code before any LLM call. Results are exact.
- Claim verification — Claude receives the computed ground truth and judges whether the AI's claims match it.
Example: if you ask "What is the average days open for SLA breached tickets?" and your CSV has 20 ticket records, Grounded computes the exact average (e.g. 14.5) and verifies the AI's answer against that — not against Claude's own recollection.
Running a Response Audit
Use Single Test to validate one AI response at a time. This is the most common workflow for spot-checks, pre-release validation, and testing during development.
Test Suite mode
Test Suite lets you build a named set of test cases and run them as a group. Use this for sprint regression testing or pre-release checklists.
- Switch to Test Suite tab within New Test.
- Name your suite (e.g. "Sprint 14 — AI Chatbot").
- Add individual test cases with question, response, and optional reference.
- Click Run Suite. Each case runs through the full 9-layer pipeline.
- View a suite summary report with aggregate GR score and per-case breakdown.
Batch Testing
Batch Testing lets you upload a file of AI responses and test all of them in one operation. Use this for regression testing after model updates, auditing large content sets, or testing a full FAQ or knowledge base.
Supported file formats
| FORMAT | REQUIRED COLUMNS | OPTIONAL |
|---|---|---|
.csv | question, aiResponse | refDoc |
.json | question, aiResponse | refDoc |
CSV example
question,aiResponse,refDoc
"What is the SG rate for 2024-25?","The rate is 11%",""
"What does TGA stand for?","Therapeutic Goods Administration",""Running a batch
Custom Rule Sets
Custom Rule Sets are the most powerful differentiation in Try Grounded AI. They let you upload your own verified facts — things only your organisation knows — and have them checked automatically on every test.
What to put in Custom Rules
- Founding year, headquarters, company name variations
- Product pricing, plan limits, feature availability
- Regulatory references specific to your jurisdiction
- Internal policy figures (thresholds, limits, rates)
- Correct names for people, products, and services
File format
| COLUMN | DESCRIPTION | EXAMPLE |
|---|---|---|
claim | The incorrect statement to watch for | "KiwiQA was founded in 2010" |
correct_answer | The verified correct fact | "KiwiQA was founded in 2020" |
source | Where this fact can be verified | "kiwiqa.ai/about" |
CSV example
claim,correct_answer,source
"KiwiQA founded 2010","Founded in 2020","kiwiqa.ai/about"
"NZ-based","Headquartered in Sydney","kiwiqa.ai/about"
"manual testing only","Manual + automated QA","kiwiqa.ai/services"Adding rules manually
You can also add individual rules without a file by using the manual entry form in the Custom Rules tab. Enter the claim, correct answer, and source URL, then click Add.
Test History & Risk Profile
Test History
All tests — single, suite, and batch — are automatically saved to Test History. Use it to track results over time, re-run previous tests, and export reports.
- Search: Filter by question text or batch name.
- Filter: Show only PASS, WARN, FAIL, or batch results.
- Pagination: 20 results per page for easy navigation.
- Export CSV: Download your full history as a spreadsheet.
- Re-run: Click ↺ Re-run on any row to load it back into the test form.
Risk Profile
The Risk Profile dashboard gives you an aggregate view of your AI reliability over time. It's designed for team leads, QA managers, and compliance officers who need a summary view.
- Trend chart: GR score over time with selectable date ranges.
- Layer breakdown: Which of the 7 layers is failing most often.
- Regression alerts: Automatic flag when your average GR score drops.
- Period comparison: Compare this week vs last week.
- Export PDF: Download a formatted Risk Profile report.
Team Workspace
The Team feature lets you invite colleagues to share your workspace. All team members share the same run quota, test history, and custom rules — there's no per-seat pricing.
Inviting a colleague
Managing team members
- Re-send: Generate a fresh invite link for a PENDING member.
- Remove: Revoke a member's access to your workspace.
- Member count: Shown as a badge on the Team nav item.
API Access & CI/CD
The Try Grounded AI API lets you run hallucination checks programmatically — from CI/CD pipelines, automated test suites, or your own applications. Available on Team plan and above.
Authentication
Generate an API key in the API Keys tab of the dashboard. Keys use the grnd_ prefix. Pass your key as a Bearer token in the Authorization header.
Run a test via API
curl -X POST https://grounded-topaz.vercel.app/api/v1/detect \
-H "Authorization: Bearer grnd_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"question": "What is the SG rate for 2024-25?",
"aiResponse": "The superannuation guarantee rate is 11%.",
"refDoc": ""
}'Response format
{
"score": 42,
"grRating": "GR-2",
"risk": "HIGH",
"verdict": "FAIL",
"finding": "Incorrect SG rate — domain rule violation detected.",
"stages": {
"consistency": { "score": 38 },
"grounding": { "score": 45 },
"confidence": { "score": 61 },
"consensus": { "score": 33 },
"drift": { "score": 80 },
"deterministic": { "score": 12, "findings": ["SG rate mismatch"] }
}
}GitHub Actions example
- name: Run hallucination check
run: |
RESULT=$(curl -s -X POST \
-H "Authorization: Bearer ${{ secrets.GROUNDED_API_KEY }}" \
-H "Content-Type: application/json" \
-d '{}' \
https://grounded-topaz.vercel.app/api/v1/detect)
SCORE=$(echo $RESULT | jq '.score')
if [ "$SCORE" -lt 60 ]; then
echo "GR score $SCORE is below threshold. Blocking deployment."
exit 1
fiInterpreting Results
The InsightReport is the full output of a single test. Here's how to read each section.
The GR score
The top-level score (0–100) and GR rating (GR-1 to GR-5) is your primary signal. Use this to decide whether the AI response is safe to use:
| GR RATING | RECOMMENDED ACTION |
|---|---|
| GR-5 | Deploy with confidence. All checks passed. |
| GR-4 | Deploy with standard monitoring. Review flagged layer if any. |
| GR-3 | Review the Recommended Action. Rewrite the prompt or add constraints. |
| GR-2 | Do not deploy. Address the flagged issues and re-test. |
| GR-1 | Block release. Escalate to team lead. Investigate source of hallucination. |
Per-layer scores
Each of the 7 layers returns an individual score. A score below 60 on any layer is a warning signal even if the overall GR rating is acceptable.
The Recommended Action
Every test includes a plain-English recommended action explaining what was found and what to do about it. This is designed to be readable by non-engineers and can be included directly in defect reports.
Exporting reports
- PDF export: Timestamped, formatted report suitable for client delivery or compliance documentation.
- Re-run: Load a previous test back into the form to test a revised AI response.
- CSV export: From Test History, export all results as a spreadsheet for analysis.
Frequently Asked Questions
Does Try Grounded AI connect to my AI model?
No. You paste the AI response manually. Try Grounded AI never connects to your AI model, API keys, system prompt, or production environment. Your data stays private.
How accurate is the GR scoring?
The scoring model was calibrated against a 100-case golden dataset with known correct answers. Post-calibration accuracy is approximately 75–80%. Domain Rules and Custom Rules layers are deterministic — they are 100% accurate for the facts you provide.
What's the difference between Domain Rules and Custom Rules?
Domain Rules are pre-built by KiwiQA — 21+ verified facts across 12 industries. Custom Rules are facts you upload yourself. Both run with zero LLM involvement and are deterministic.
How many runs do I get?
Free plan: 50 evaluation runs. Starter: 500 runs/month at $29/month. Team: 5,000 runs/month at $149/month. Plus 10 bonus runs for each successful referral.
Can I use Try Grounded AI for regulated industries?
Yes. Healthcare, legal, and finance responses are automatically detected and a 0.92× risk multiplier is applied. This raises the bar for GR-5 qualification in high-stakes industries.
Does it work with any LLM?
Yes. You paste any text — the source model doesn't matter. It works with GPT-4o, Claude, Gemini, Llama, Mistral, or any model you're using.
Is my data stored?
Test results (question, score, GR rating, findings) are stored in your account for Test History. The AI response content is not stored beyond the analysis session for individual tests.
Can I integrate it into my CI/CD pipeline?
Yes — API access is available on Team plan. See the API Access section above or visit /api-docs for the full reference.
What is the Team workspace?
Team workspace lets multiple colleagues share one account — pooled run quota, shared test history, and shared custom rules. No per-seat pricing. Available on all plans.
How do I report a bug or get support?
Click Get Help in the dashboard sidebar to open the support chatbot. For urgent issues email hello@kiwiqa.ai.