DOCUMENTATION

Try Grounded AI
User Guide

Everything you need to test AI responses for hallucinations, interpret GR ratings, and integrate Grounded into your QA workflow.

Version 1.0March 2026By KiwiQA
OVERVIEW

What is Try Grounded AI?

Try Grounded AI is an AI hallucination testing platform built by KiwiQA. It lets you paste any AI-generated response and run 7 independent validation checks — returning a structured verdict, a GR reliability score, and specific recommended actions in under 60 seconds.

Unlike developer-focused tools that require SDK integration and engineering setup, Try Grounded AI works for QA testers, product managers, consultants, and compliance teams with no code required.

Who is it for?

ROLEUSE CASE
QA EngineersValidate AI responses before each release. Catch hallucinations in automated test suites.
Product ManagersSpot-check AI features before shipping. Generate evidence for stakeholder sign-off.
Security EngineersTest AI outputs for compliance with security policies and correct regulatory references.
ConsultantsAudit client AI systems. Produce branded PDF reports as deliverables.
Compliance / LegalVerify AI responses meet industry-specific accuracy standards before deployment.
DevelopersIntegrate via API into CI/CD pipelines. Run regression tests on model updates.

What problems does it solve?

  • AI models hallucinate facts, citations, dates, and figures — even in production.
  • Traditional test suites check for software bugs, not factual accuracy.
  • Manual review of AI responses is inconsistent and doesn't scale.
  • Regulated industries (healthcare, legal, finance) face real liability from hallucinated AI outputs.
Try Grounded AI caught hallucinations in AI responses that had passed conventional QA checks in 100% of the test cases used to calibrate the scoring model.
GETTING STARTED

Quick Start

Your first hallucination test takes under 60 seconds. No setup. No SDK. No integrations required.

1
Create a free account
Go to grounded-topaz.vercel.app/sign-up. Enter your email and create a password. You receive 50 free evaluation runs — no credit card required.
2
Open the dashboard
After signing in you land on the Home screen. Click New Test in the left sidebar.
3
Paste the question
Enter the exact prompt that was sent to your AI. The more precise, the more accurate the verdict.
4
Paste the AI response
Copy the AI-generated response and paste it into the second field. Works with GPT, Claude, Gemini, Llama, or any LLM.
5
Click Run test
The 9-layer validation runs automatically. Results appear in under 60 seconds.
6
Read your GR verdict
You receive a GR-1 to GR-5 rating, a score out of 100, per-layer findings, and a recommended action.
Click Load example → in the Run Test form to pre-fill a real hallucination scenario and see how the results look before testing your own content.
CORE CONCEPTS

GR Rating System

Every test returns a GR (Grounded Reliability) rating — a score from 0 to 100 mapped to five bands. GR is the primary output of Try Grounded AI and the metric your team should use to gate AI deployments.

RATINGSCORELABELMEANING
GR-588–100VerifiedAll 7 checks passed. Safe to deploy. Use with confidence.
GR-476–87ReliableMinor issues only. Deploy with standard monitoring.
GR-360–75ConditionalReview flagged findings before shipping. Consider rewriting the prompt.
GR-242–59High RiskSignificant hallucination signals. Do not deploy without remediation.
GR-10–41CriticalSevere hallucination detected. Block release immediately.

How the GR score is calculated

The GR score is a weighted composite of all up to 9 validation layers. Each layer contributes a percentage of the total score:

LAYERWEIGHTTYPE
Multi-Model Consensus25%LLM
Consistency22%LLM
Reference Grounding20%LLM
Confidence Audit16%LLM
Domain Rules9%NON-LLM
Custom Rules9%NON-LLM
Semantic Drift7%LLM
18% of your GR score comes from non-LLM checks — fully deterministic, zero AI subjectivity. This makes the score auditable and reproducible.

High-risk domain multiplier

For responses in healthcare, legal, finance, or government contexts, a 0.92× multiplier is applied to the final score. This raises the bar for these industries where hallucinations carry real-world liability.

CORE CONCEPTS

The 9 Validation Layers

Every test runs up to 9 independent checks. Each layer is designed to catch a different type of hallucination or reliability failure.

LAYER 01Consistency Check
LLM CHECK22%

Rephrases your question 3 ways and checks whether the AI gives the same facts each time. Inconsistency is a primary hallucination signal.

Facts that change when rewordedNumbers that vary across phrasingsSelf-contradicting answers
LAYER 02Reference Grounding
LLM CHECK20%

When a reference document is provided, checks every claim in the AI response against it. Flags anything unsupported or contradicted by your source.

Claims not in your policy or specFacts that contradict your documentKnowledge gaps the AI silently filled
LAYER 03Confidence Audit
LLM CHECK16%

Detects when the AI sounds more certain than it should — overconfident language is a strong signal for fabricated content.

Invented statistics stated as factDefinitive claims on uncertain topicsMissing hedging language
LAYER 04Multi-Model Consensus
LLM CHECK25%

Sends the same question to GPT-4o independently. When two models disagree on a specific fact, that disagreement is flagged as a hallucination signal.

Facts where GPT-4o gives a different answerFigures or dates both models can't agree onModel-specific confabulations
LAYER 05Semantic Drift
LLM CHECK7%

Measures whether the AI stayed on topic or wandered into unverifiable territory it wasn't asked about.

Responses that answer a different questionUnrequested training-data contentTangential claims inserted to seem complete
LAYER 06Domain Rules
NON-LLM · ZERO AI9%

Zero-LLM. Checks the response against 21+ verified facts across 12 industries — healthcare, legal, finance, HR, security, and more. Deterministic and auditable.

Wrong drug dosages or tax ratesInvented court citationsIncorrect software specs or compliance claims
LAYER 07Custom Rule Sets
NON-LLM · ZERO AI9%

Zero-LLM. You upload your own verified facts. Grounded checks every response against them automatically — with no AI involvement, so results are reproducible.

Wrong founding year or company detailsIncorrect pricing or plan limitsAny company-specific fact your AI gets wrong
NEW FEATURE

Structured Data Fidelity (9th Layer)

STRUCTURED DATA FIDELITY — 9TH LAYER

The 9th layer lets you attach any CRM export, CSV, or JSON dataset to your Response Audit. Grounded automatically parses every field, detects data types, computes aggregates, and verifies every AI claim against your actual data — no manual field mapping required.

HOW TO USE IT

  1. Open Response Audit and fill in Steps 01 and 02.
  2. Scroll to Step 04 — Attach CRM or structured data and click Enable.
  3. Upload a CSV or JSON file, or paste the data directly. Max 500KB.
  4. Run the audit. The Data Attribution tab will show a field-by-field verification with MATCHED / MISMATCH / INFERRED status for every claim.

SUPPORTED FORMATS

FormatExampleMulti-record?
HubSpot contacts CSVFirst Name, Email, Lifecycle Stage, ...Yes — all rows
Salesforce exportId, Name, Status, Industry, ...Yes — all rows
Custom CSVaccount_name, health_score, sla_breach, ...Yes — all rows
JSON object{"account": "Acme", "score": 85}Single record
JSON array[{"account":"Acme"}, {"account":"TechCo"}]Yes — all rows
TSV / pipe-separatedfield1 field2 field3Yes — all rows

HOW GROUNDED VERIFIES DATA

Grounded uses a three-tier deterministic engine — no LLM arithmetic:

  1. Type detection — every field is auto-classified as boolean, numeric, date, enum, or ID based on its values.
  2. Aggregate computation — sums, averages, counts, and group-bys are computed in code before any LLM call. Results are exact.
  3. Claim verification — Claude receives the computed ground truth and judges whether the AI's claims match it.

Example: if you ask "What is the average days open for SLA breached tickets?" and your CSV has 20 ticket records, Grounded computes the exact average (e.g. 14.5) and verifies the AI's answer against that — not against Claude's own recollection.

HOW TO

Running a Response Audit

Use Single Test to validate one AI response at a time. This is the most common workflow for spot-checks, pre-release validation, and testing during development.

1
Navigate to New Test
Click New Test in the left sidebar.
2
Enter the question
Paste the exact prompt sent to the AI. Be precise — vague questions reduce scoring accuracy.
3
Paste the AI response
Copy the full AI response. Don't truncate it — partial responses score differently.
4
(Optional) Add a reference document
Click + Add reference document to upload or paste a policy doc, product spec, or clinical guideline. This enables the Reference Grounding layer.
5
Click Run test
The 9-layer pipeline executes. You'll see a real-time progress bar showing each layer completing.
6
Review the InsightReport
Your full results appear — GR rating, score, per-layer breakdown, flagged claims, and recommended action.

Test Suite mode

Test Suite lets you build a named set of test cases and run them as a group. Use this for sprint regression testing or pre-release checklists.

  1. Switch to Test Suite tab within New Test.
  2. Name your suite (e.g. "Sprint 14 — AI Chatbot").
  3. Add individual test cases with question, response, and optional reference.
  4. Click Run Suite. Each case runs through the full 9-layer pipeline.
  5. View a suite summary report with aggregate GR score and per-case breakdown.
Completed test suites are saved to Test History and can be exported as PDF reports.
HOW TO

Batch Testing

Batch Testing lets you upload a file of AI responses and test all of them in one operation. Use this for regression testing after model updates, auditing large content sets, or testing a full FAQ or knowledge base.

Supported file formats

FORMATREQUIRED COLUMNSOPTIONAL
.csvquestion, aiResponserefDoc
.jsonquestion, aiResponserefDoc

CSV example

question,aiResponse,refDoc
"What is the SG rate for 2024-25?","The rate is 11%",""
"What does TGA stand for?","Therapeutic Goods Administration",""

Running a batch

1
Go to Batch Tests
Click Batch Tests in the left sidebar.
2
Name your batch
Give it a descriptive name — it appears in Test History and PDF exports.
3
Upload your file
Click Upload CSV or JSON. Maximum 50 rows on Starter, unlimited on Team.
4
Review the preview
Check the parsed rows before running. Fix any format issues in your file.
5
Click Run Batch
Each row runs through the full 9-layer pipeline sequentially.
6
Export results
Download the full report as CSV or PDF.
Batch testing counts against your monthly run quota — one row = one run. A 50-row batch uses 50 runs.
HOW TO

Custom Rule Sets

Custom Rule Sets are the most powerful differentiation in Try Grounded AI. They let you upload your own verified facts — things only your organisation knows — and have them checked automatically on every test.

What to put in Custom Rules

  • Founding year, headquarters, company name variations
  • Product pricing, plan limits, feature availability
  • Regulatory references specific to your jurisdiction
  • Internal policy figures (thresholds, limits, rates)
  • Correct names for people, products, and services

File format

COLUMNDESCRIPTIONEXAMPLE
claimThe incorrect statement to watch for"KiwiQA was founded in 2010"
correct_answerThe verified correct fact"KiwiQA was founded in 2020"
sourceWhere this fact can be verified"kiwiqa.ai/about"

CSV example

claim,correct_answer,source
"KiwiQA founded 2010","Founded in 2020","kiwiqa.ai/about"
"NZ-based","Headquartered in Sydney","kiwiqa.ai/about"
"manual testing only","Manual + automated QA","kiwiqa.ai/services"
Custom Rules run with zero LLM involvement — checks are deterministic string-matching. Results are identical every time and fully auditable.

Adding rules manually

You can also add individual rules without a file by using the manual entry form in the Custom Rules tab. Enter the claim, correct answer, and source URL, then click Add.

HOW TO

Test History & Risk Profile

Test History

All tests — single, suite, and batch — are automatically saved to Test History. Use it to track results over time, re-run previous tests, and export reports.

  • Search: Filter by question text or batch name.
  • Filter: Show only PASS, WARN, FAIL, or batch results.
  • Pagination: 20 results per page for easy navigation.
  • Export CSV: Download your full history as a spreadsheet.
  • Re-run: Click ↺ Re-run on any row to load it back into the test form.

Risk Profile

The Risk Profile dashboard gives you an aggregate view of your AI reliability over time. It's designed for team leads, QA managers, and compliance officers who need a summary view.

  • Trend chart: GR score over time with selectable date ranges.
  • Layer breakdown: Which of the 7 layers is failing most often.
  • Regression alerts: Automatic flag when your average GR score drops.
  • Period comparison: Compare this week vs last week.
  • Export PDF: Download a formatted Risk Profile report.
HOW TO

Team Workspace

The Team feature lets you invite colleagues to share your workspace. All team members share the same run quota, test history, and custom rules — there's no per-seat pricing.

Team workspaces are different from referral sign-ups. Invited team members join your existing workspace — they don't create a separate account with a separate quota.

Inviting a colleague

1
Go to Team in the sidebar
Click Team under CONFIGURATION.
2
Enter their email address
Type your colleague's work email and click Create invite link.
3
Share the invite link
Copy the generated link and send it to your colleague, or click Open in mail to draft an email.
4
They accept the invite
Your colleague clicks the link, signs in or creates an account, and is automatically added to your workspace.
5
Status updates to Accepted
Refresh the Team page — their status changes from PENDING to ACCEPTED.
Invite links are single-use and not eligible for referral bonus runs. Share only with your intended colleague.

Managing team members

  • Re-send: Generate a fresh invite link for a PENDING member.
  • Remove: Revoke a member's access to your workspace.
  • Member count: Shown as a badge on the Team nav item.
DEVELOPER

API Access & CI/CD

The Try Grounded AI API lets you run hallucination checks programmatically — from CI/CD pipelines, automated test suites, or your own applications. Available on Team plan and above.

Authentication

Generate an API key in the API Keys tab of the dashboard. Keys use the grnd_ prefix. Pass your key as a Bearer token in the Authorization header.

Run a test via API

curl -X POST https://grounded-topaz.vercel.app/api/v1/detect \
  -H "Authorization: Bearer grnd_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the SG rate for 2024-25?",
    "aiResponse": "The superannuation guarantee rate is 11%.",
    "refDoc": ""
  }'

Response format

{
  "score": 42,
  "grRating": "GR-2",
  "risk": "HIGH",
  "verdict": "FAIL",
  "finding": "Incorrect SG rate — domain rule violation detected.",
  "stages": {
    "consistency": { "score": 38 },
    "grounding":   { "score": 45 },
    "confidence":  { "score": 61 },
    "consensus":   { "score": 33 },
    "drift":       { "score": 80 },
    "deterministic": { "score": 12, "findings": ["SG rate mismatch"] }
  }
}

GitHub Actions example

- name: Run hallucination check
  run: |
    RESULT=$(curl -s -X POST \
      -H "Authorization: Bearer ${{ secrets.GROUNDED_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d '{}'  \
      https://grounded-topaz.vercel.app/api/v1/detect)

    SCORE=$(echo $RESULT | jq '.score')
    if [ "$SCORE" -lt 60 ]; then
      echo "GR score $SCORE is below threshold. Blocking deployment."
      exit 1
    fi
Full API reference with all parameters, error codes, and SDK examples is available at /api-docs.
BEST PRACTICES

Interpreting Results

The InsightReport is the full output of a single test. Here's how to read each section.

The GR score

The top-level score (0–100) and GR rating (GR-1 to GR-5) is your primary signal. Use this to decide whether the AI response is safe to use:

GR RATINGRECOMMENDED ACTION
GR-5Deploy with confidence. All checks passed.
GR-4Deploy with standard monitoring. Review flagged layer if any.
GR-3Review the Recommended Action. Rewrite the prompt or add constraints.
GR-2Do not deploy. Address the flagged issues and re-test.
GR-1Block release. Escalate to team lead. Investigate source of hallucination.

Per-layer scores

Each of the 7 layers returns an individual score. A score below 60 on any layer is a warning signal even if the overall GR rating is acceptable.

The Recommended Action

Every test includes a plain-English recommended action explaining what was found and what to do about it. This is designed to be readable by non-engineers and can be included directly in defect reports.

Exporting reports

  • PDF export: Timestamped, formatted report suitable for client delivery or compliance documentation.
  • Re-run: Load a previous test back into the form to test a revised AI response.
  • CSV export: From Test History, export all results as a spreadsheet for analysis.
FAQ

Frequently Asked Questions

Does Try Grounded AI connect to my AI model?

No. You paste the AI response manually. Try Grounded AI never connects to your AI model, API keys, system prompt, or production environment. Your data stays private.

How accurate is the GR scoring?

The scoring model was calibrated against a 100-case golden dataset with known correct answers. Post-calibration accuracy is approximately 75–80%. Domain Rules and Custom Rules layers are deterministic — they are 100% accurate for the facts you provide.

What's the difference between Domain Rules and Custom Rules?

Domain Rules are pre-built by KiwiQA — 21+ verified facts across 12 industries. Custom Rules are facts you upload yourself. Both run with zero LLM involvement and are deterministic.

How many runs do I get?

Free plan: 50 evaluation runs. Starter: 500 runs/month at $29/month. Team: 5,000 runs/month at $149/month. Plus 10 bonus runs for each successful referral.

Can I use Try Grounded AI for regulated industries?

Yes. Healthcare, legal, and finance responses are automatically detected and a 0.92× risk multiplier is applied. This raises the bar for GR-5 qualification in high-stakes industries.

Does it work with any LLM?

Yes. You paste any text — the source model doesn't matter. It works with GPT-4o, Claude, Gemini, Llama, Mistral, or any model you're using.

Is my data stored?

Test results (question, score, GR rating, findings) are stored in your account for Test History. The AI response content is not stored beyond the analysis session for individual tests.

Can I integrate it into my CI/CD pipeline?

Yes — API access is available on Team plan. See the API Access section above or visit /api-docs for the full reference.

What is the Team workspace?

Team workspace lets multiple colleagues share one account — pooled run quota, shared test history, and shared custom rules. No per-seat pricing. Available on all plans.

How do I report a bug or get support?

Click Get Help in the dashboard sidebar to open the support chatbot. For urgent issues email hello@kiwiqa.ai.

Still have questions?
Our team at KiwiQA is happy to help.