DOCUMENTATION

Try Grounded AI
User Guide

Everything you need to test AI responses for hallucinations, interpret GR ratings, and integrate Grounded into your QA workflow.

Version 2.0April 2026By KiwiQA10 Detection Layers

OVERVIEW

What is Try Grounded AI?

Try Grounded AI is an AI hallucination testing platform built by KiwiQA. It lets you paste any AI-generated response and run up to 10 independent detection layers — returning a structured verdict, a GR reliability score, and specific recommended actions in under 60 seconds.

Unlike developer-focused tools that require SDK integration and engineering setup, Try Grounded AI works for QA testers, product managers, consultants, and compliance teams with no code required.

Who is it for?

ROLE	USE CASE
QA Engineers	Validate AI responses before each release. Catch hallucinations in automated test suites.
Product Managers	Spot-check AI features before shipping. Generate evidence for stakeholder sign-off.
Security Engineers	Test AI outputs for compliance with security policies and correct regulatory references.
Consultants	Audit client AI systems. Produce branded PDF reports as deliverables.
Compliance / Legal	Verify AI responses meet industry-specific accuracy standards before deployment.
Developers	Integrate via API into CI/CD pipelines. Run regression tests on model updates.

✓

Try Grounded AI caught hallucinations in AI responses that had passed conventional QA checks in 100% of the test cases used to calibrate the scoring model.

GETTING STARTED

Quick Start

Your first hallucination test takes under 60 seconds. No setup. No SDK. No integrations required.

1

Create a free account

Go to the sign-up page. Enter your email and create a password. You receive 50 free evaluation runs — no credit card required.

2

Open the dashboard

After signing in you land on the Home screen. Click New Test in the left sidebar.

3

Paste the question

Enter the exact prompt that was sent to your AI. The more precise, the more accurate the verdict.

4

Paste the AI response

Copy the AI-generated response and paste it into the second field. Works with GPT, Claude, Gemini, Llama, or any LLM.

5

Click Run test

The 10-layer validation runs automatically. Results appear in under 60 seconds.

6

Read your GR verdict

You receive a GR-1 to GR-5 rating, a score out of 100, per-layer findings, and a recommended action.

✓

Click Load example → in the Run Test form to pre-fill a real hallucination scenario and see how the results look before testing your own content.

CORE CONCEPTS

GR Rating System

Every test returns a GR (Grounded Reliability) rating — a score from 0 to 100 mapped to five bands. GR is the primary output of Try Grounded AI and the metric your team should use to gate AI deployments.

RATING	SCORE	LABEL	MEANING
GR-5	88–100	Verified	All checks passed. Safe to deploy. Use with confidence.
GR-4	76–87	Reliable	Minor issues only. Deploy with standard monitoring.
GR-3	60–75	Conditional	Review flagged findings before shipping. Consider rewriting the prompt.
GR-2	42–59	High Risk	Significant hallucination signals. Do not deploy without remediation.
GR-1	0–41	Critical	Severe hallucination detected. Block release immediately.

High-risk domain multiplier

For responses in healthcare, legal, finance, or government contexts, a 0.92× multiplier is applied to the final score after all layer calculations. This raises the bar for these industries where hallucinations carry real-world liability. A score of 83 in a healthcare response becomes 76 after the multiplier — just at the GR-4 pass threshold.

CORE CONCEPTS

How Scoring Works

The GR score is a dynamic weighted average of up to 10 detection layers. Weights are not fixed — they adjust automatically depending on which optional layers activate for a given test. Layers that are skipped (not applicable) are excluded entirely, and the remaining weights renormalise to sum to 100%.

ℹ

The key principle: the most direct signal always gets the highest weight. When structured data is provided, field-by-field comparison dominates. When a reference document is uploaded, grounding dominates. When neither is present, Model Agreement and Consistency share the lead.

Final Score Formula

Score = clamp(0–100,  round( (weighted_avg − penalties) × domain_multiplier ))

where  weighted_avg = Σ(layer_score × weight) / Σ(active_weights)

Layer weights by scenario

The table below shows how weights shift across the three main test configurations:

LAYER	Standard (no optional inputs)	With Reference Doc	With Structured Data
L1 Consistency	22%	19%	13%
L2 Doc Grounding	—	20%	12%
L3 Confidence	16%	14%	10%
L4 Model Agreement	25%	22%	15%
L5 Semantic Drift	7%	6%	5%
L6 Domain Rules	18%	16%	12%
L7 Custom Rules	incl. in L6	incl. in L6	incl. in L6
L8 RAG Citation Map	—	incl. in L2	—
L9 Structured Data Fidelity	—	—	30%
L10 Source Attribution	—	—	12% (when citations detected)

Post-score penalties

Three adjustments are applied after the weighted average is computed:

PENALTY	AMOUNT	TRIGGER
Suspicious claims	−8 pts each, max −24	L3 Confidence flags a claim as vague, circular, or unverifiable (e.g. "according to studies…").
Fabricated citations	−15 pts each, max −30	L10 Source Attribution confirms a named citation does not exist or is misattributed. The most severe penalty.
High-risk domain multiplier	× 0.92 (−8% effective)	Detected domain is Healthcare, Legal, Finance, or Government.

Deterministic checks (L6 Domain Rules)

The Domain Rules layer runs without any LLM — fully deterministic pattern matching. The combined L6 score is the average of three sub-checks: Generic Rules + Domain-specific Rules + Custom Rules.

FINDING TYPE	SEVERITY	HOW IT FIRES
Arithmetic errors	HIGH	Validates A+B=C, A−B=C, A÷B≈C with 2% tolerance. Each error: −25 pts.
Count mismatches	HIGH / MED	AI claims "5 reasons" but lists fewer or more. Each mismatch: −10 to −20 pts.
List numbering gaps	MED	Numbered list skips an item (1, 2, 4 — missing 3). Penalty: −10 pts.
Overconfident language	MED	Flags: definitively, unquestionably, always, never, guaranteed, 100%, it is proven that. Penalty: −5 pts.
Temporal claims	MED	Time-sensitive language without a date anchor: "currently", "as of today", "the current CEO". Penalty: −8 pts each.
Future date references	HIGH	References years 2030+ as present facts. Penalty: −10 pts each.
Unverifiable statistics	LOW	Unsourced percentage claims. Flagged for review — no automatic score penalty.

CORE CONCEPTS

The 10 Detection Layers

Every test runs up to 10 independent checks. Layers L1–L7 always run. L8–L10 activate only when the relevant optional input is present (reference document, structured data, or detected citations).

LAYER 01Consistency

LLM CHECK22%

Rephrases your question and checks whether the AI gives the same facts each time. Inconsistency is the primary hallucination signal — if an AI changes its answer when the question is reworded, the original answer was likely fabricated.

Facts that change when rewordedNumbers that vary across phrasingsSelf-contradicting answers

LAYER 02Doc GroundingOPTIONAL

LLM CHECK20%

When a reference document is provided, checks every claim in the AI response against it. Flags anything unsupported or contradicted by your source. Activates the RAG Citation Map (L8) automatically.

Claims not in your policy or specFacts that contradict your documentKnowledge gaps the AI silently filled

LAYER 03Confidence Audit

LLM CHECK16%

Detects when the AI sounds more certain than it should. Overconfident language is a strong signal for fabricated content. Each suspicious claim flagged here carries a −8 pt penalty on the final score (max −24).

Invented statistics stated as factDefinitive claims on uncertain topicsMissing hedging language

LAYER 04Model Agreement

LLM CHECK25%

Sends the same question to a secondary AI model independently. When two models disagree on a specific fact, that disagreement is flagged as a hallucination signal. This is the highest-weighted standard layer because cross-model disagreement is the strongest predictor of hallucination.

Facts where GPT-4o gives a different answerFigures or dates both models disagree onModel-specific confabulations

LAYER 05Semantic Drift

LLM CHECK7%

Measures whether the AI stayed on topic or wandered into unverifiable territory it was not asked about.

Responses that answer a different questionUnrequested training-data contentTangential claims inserted to seem complete

LAYER 06Domain Rules

NON-LLM · ZERO AI18%

Zero-LLM. Deterministic rule engine checks the AI response for specific error patterns: arithmetic errors, count mismatches, overconfident language, temporal claims, and future date references. Results are 100% reproducible. See "How Scoring Works" above for the full penalty table.

Arithmetic errorsCount mismatchesOverconfident languageTemporal claims without date anchors

LAYER 07Custom RulesOPTIONAL

NON-LLM · ZERO AIIncl. in L6

Zero-LLM. You upload your own verified facts — things only your organisation knows. Grounded checks every response against them automatically. Each violation scores as: max(0, 100 − violations × 20). See the Custom Rule Sets section for setup.

Wrong company factsIncorrect pricing or plan limitsAny fact your AI gets wrong repeatedly

LAYER 08RAG Citation MapOPTIONAL

LLM CHECKIncl. in L2

Maps reference document passages to AI claims. Flags assertions that have no corresponding evidence in the uploaded document. Activates automatically when a reference document is provided (alongside L2).

Unsupported assertionsClaims the AI inserted beyond your documentEvidence mapping per claim

LAYER 09Structured Data FidelityOPTIONAL

LLM CHECK30% (when active)

The most direct accuracy signal available. Compares AI output field-by-field against structured data you provide (CSV, JSON, tables). When active, this layer carries 30% weight — higher than any other single layer — because field-by-field ground-truth comparison is more reliable than any LLM inference check.

Field mismatches vs source dataWrong aggregates or totalsInferred vs stated field values

LAYER 10Source AttributionNEWOPTIONAL

LLM CHECK12% (when active)

Detects fabricated citations and misattributed sources. When the AI response contains named references (authors, URLs, study titles), this layer verifies them. Each fabricated citation carries a −15 pt penalty on the final score (max −30). This is the most severe single penalty in the scoring system.

Fabricated citationsMisattributed sourcesInvented study references or report IDs

FEATURE GUIDE

Structured Data Fidelity (L9)

The 9th layer lets you attach any CRM export, CSV, or JSON dataset to your Response Audit. Grounded automatically parses every field, detects data types, computes aggregates, and verifies every AI claim against your actual data — no manual field mapping required.

How to use it

Open Response Audit and fill in Steps 01 and 02.
Scroll to Step 04 — Attach CRM or structured data and click Enable.
Upload a CSV or JSON file, or paste the data directly. Max 500 KB.
Run the audit. The Data Attribution tab shows a field-by-field verification with MATCHED / MISMATCH / INFERRED status for every claim.

Supported formats

FORMAT	EXAMPLE	MULTI-RECORD?
HubSpot contacts CSV	`First Name, Email, Lifecycle Stage, ...`	Yes — all rows
Salesforce export	`Id, Name, Status, Industry, ...`	Yes — all rows
Custom CSV	`account_name, health_score, sla_breach`	Yes — all rows
JSON object	`{"account": "Acme", "score": 85}`	Single record
JSON array	`[{"account":"Acme"}, {"account":"TechCo"}]`	Yes — all rows
TSV / pipe-separated	`field1 field2 field3`	Yes — all rows

How Grounded verifies data

A three-tier deterministic engine — no LLM arithmetic:

Type detection — every field is auto-classified as boolean, numeric, date, enum, or ID based on its values.
Aggregate computation — sums, averages, counts, and group-bys are computed in code before any LLM call. Results are exact.
Claim verification — the computed ground truth is passed to the LLM to judge whether the AI's claims match it.

ℹ

Example: if you ask "What is the average days open for SLA-breached tickets?" and your CSV has 20 ticket records, Grounded computes the exact average (e.g. 14.5) and verifies the AI's answer against that number — not against the model's own recollection.

HOW TO

Running a Response Audit

Use Single Test to validate one AI response at a time. This is the most common workflow for spot-checks, pre-release validation, and testing during development.

1

Navigate to New Test

Click New Test in the left sidebar.

2

Enter the question

Paste the exact prompt sent to the AI. Be precise — vague questions reduce scoring accuracy.

3

Paste the AI response

Copy the full AI response. Don't truncate it — partial responses score differently.

4

(Optional) Add a reference document

Click + Add reference document to upload or paste a policy doc, product spec, or clinical guideline. This enables the Doc Grounding (L2) and RAG Citation Map (L8) layers.

5

(Optional) Attach structured data

Enable Step 04 to attach a CSV or JSON dataset. This activates Structured Data Fidelity (L9) and becomes the dominant scoring signal.

6

Click Run test

The 10-layer pipeline executes. A real-time progress bar shows each layer completing.

7

Review the InsightReport

Your full results appear — GR rating, score, per-layer breakdown, flagged claims, and recommended action.

Test Suite mode

Test Suite lets you build a named set of test cases and run them as a group. Use this for sprint regression testing or pre-release checklists.

Switch to the Test Suite tab within New Test.
Name your suite (e.g. "Sprint 14 — AI Chatbot").
Add individual test cases with question, response, and optional reference.
Click Run Suite. Each case runs through the full 10-layer pipeline.
View a suite summary report with aggregate GR score and per-case breakdown.

ℹ

Completed test suites are saved to Test History and can be exported as PDF reports.

HOW TO

Batch Testing

Batch Testing lets you upload a file of AI responses and test all of them in one operation. Use this for regression testing after model updates, auditing large content sets, or testing a full FAQ or knowledge base.

Supported file formats

FORMAT	REQUIRED COLUMNS	OPTIONAL
`.csv`	`question`, `aiResponse`	`refDoc`
`.json`	`question`, `aiResponse`	`refDoc`

CSV example

question,aiResponse,refDoc
"What is the SG rate for 2024-25?","The rate is 11%",""
"What does TGA stand for?","Therapeutic Goods Administration",""

Running a batch

1

Go to Batch Tests

Click Batch Tests in the left sidebar.

2

Name your batch

Give it a descriptive name — it appears in Test History and PDF exports.

3

Upload your file

Click Upload CSV or JSON. Maximum 50 rows on Starter, unlimited on Team.

4

Review the preview

Check the parsed rows before running. Fix any format issues in your file.

5

Click Run Batch

Each row runs through the full 10-layer pipeline sequentially.

6

Export results

Download the full report as CSV or PDF.

⚠

Batch testing counts against your monthly run quota — one row = one run. A 50-row batch uses 50 runs.

HOW TO

Custom Rule Sets

Custom Rule Sets are the most powerful differentiation in Try Grounded AI. They let you upload your own verified facts — things only your organisation knows — and have them checked automatically on every test.

What to put in Custom Rules

Founding year, headquarters, company name variations
Product pricing, plan limits, feature availability
Regulatory references specific to your jurisdiction
Internal policy figures (thresholds, limits, rates)
Correct names for people, products, and services

File format

COLUMN	DESCRIPTION	EXAMPLE
`claim`	The incorrect statement to watch for	"KiwiQA was founded in 2010"
`correct_answer`	The verified correct fact	"KiwiQA was founded in 2020"
`source`	Where this fact can be verified	"kiwiqa.ai/about"

Scoring

Each custom rule violation scores as: max(0, 100 − violations × 20). This means: 0 violations = 100, 1 violation = 80, 2 violations = 60, 3+ violations = 40 or below. Custom Rules contribute to the combined L6 score alongside Domain Rules and Generic Rules.

✓

Custom Rules run with zero LLM involvement — checks are deterministic string-matching. Results are identical every time and fully auditable.

HOW TO

Test History & Risk Profile

Test History

All tests — single, suite, and batch — are automatically saved to Test History. Use it to track results over time, re-run previous tests, and export reports.

Search: Filter by question text or batch name.
Filter: Show only PASS, WARN, FAIL, or batch results.
Export CSV: Download your full history as a spreadsheet.
Re-run: Click ↺ Re-run on any row to load it back into the test form.

Risk Profile

The Risk Profile dashboard gives you an aggregate view of your AI reliability over time. It is designed for team leads, QA managers, and compliance officers who need a summary view.

Overview tab: Score trend bars, detection layer health bars, top finding types, industry risk, and priority action plan — all in one view.
Detection Layers tab: Full table showing avg score, fail rate, and test count per layer.
Finding Types tab: Breakdown of deterministic findings (arithmetic errors, count mismatches, etc.) with counts and remediation advice.
Score Trend tab: Historical score chart with regression/improvement detection.
Industry tab: Avg GR score broken down by detected domain.
Worst Tests tab: Bottom 20 tests by score with finding type labels.
Action Plan tab: Auto-generated P1/P2/P3 recommendations based on your weakest layers and finding patterns.
Export PDF: Full formatted Risk Profile report ready to share with stakeholders.
Regression badge: Fires automatically when the last-7-day average drops more than 10 points vs the prior period.

HOW TO

Team Workspace

The Team feature lets you invite colleagues to share your workspace. All team members share the same run quota, test history, and custom rules — there is no per-seat pricing.

ℹ

Team workspaces are different from referral sign-ups. Invited team members join your existing workspace — they do not create a separate account with a separate quota.

Inviting a colleague

1

Go to Team in the sidebar

Click Team under TOOLS.

2

Enter their email address

Type your colleague's work email and click Create invite link.

3

Share the invite link

Copy the generated link and send it, or click Open in mail to draft an email.

4

They accept the invite

Your colleague clicks the link, signs in or creates an account, and is added to your workspace.

5

Status updates to Accepted

Refresh the Team page — their status changes from PENDING to ACCEPTED.

⚠

Invite links are single-use and not eligible for referral bonus runs. Share only with your intended colleague.

DEVELOPER

API Access & CI/CD

The Try Grounded AI API lets you run hallucination checks programmatically — from CI/CD pipelines, automated test suites, or your own applications. Available on Team plan and above.

Authentication

Generate an API key in the API Keys tab of the dashboard. Keys use the grnd_ prefix. Pass your key as a Bearer token in the Authorization header.

Run a test via API

curl -X POST https://grounded-topaz.vercel.app/api/v1/detect \
  -H "Authorization: Bearer grnd_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the SG rate for 2024-25?",
    "aiResponse": "The superannuation guarantee rate is 11%.",
    "refDoc": ""
  }'

Response format

{
  "score": 42,
  "grRating": "GR-2",
  "risk": "HIGH",
  "verdict": "FAIL",
  "finding": "Incorrect SG rate — domain rule violation detected.",
  "stages": {
    "consistency":   { "score": 38 },
    "grounding":     { "score": 45 },
    "confidence":    { "score": 61 },
    "modelAgreement":{ "score": 33 },
    "semanticDrift": { "score": 80 },
    "deterministic": { "score": 12, "findings": ["SG rate mismatch"] }
  }
}

GitHub Actions example

- name: Run hallucination check
  run: |
    RESULT=$(curl -s -X POST \
      -H "Authorization: Bearer ${{ secrets.GROUNDED_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d '{}' \
      https://grounded-topaz.vercel.app/api/v1/detect)

    SCORE=$(echo $RESULT | jq '.score')
    if [ "$SCORE" -lt 60 ]; then
      echo "GR score $SCORE is below threshold. Blocking deployment."
      exit 1
    fi

ℹ

Full API reference with all parameters, error codes, and SDK examples is available at /api-docs.

BEST PRACTICES

Interpreting Results

The InsightReport is the full output of a single test. Here is how to read each section.

The GR score

GR RATING	RECOMMENDED ACTION
GR-5	Deploy with confidence. All checks passed.
GR-4	Deploy with standard monitoring. Review any flagged layer.
GR-3	Review the Recommended Action. Rewrite the prompt or add constraints.
GR-2	Do not deploy. Address flagged issues and re-test.
GR-1	Block release. Escalate to team lead. Investigate source of hallucination.

Per-layer scores

Each of the 10 layers returns an individual score. A score below 60 on any layer is a warning signal even if the overall GR rating is acceptable. Layers that were skipped (not applicable for this test) show as N/A.

The Recommended Action

Every test includes a plain-English recommended action explaining what was found and what to do about it. Designed to be readable by non-engineers and can be included directly in defect reports.

Exporting reports

PDF export: Timestamped, formatted report suitable for client delivery or compliance documentation.
Re-run: Load a previous test back into the form to test a revised AI response.
CSV export: From Test History, export all results as a spreadsheet for analysis.

FAQ

Frequently Asked Questions

Does Try Grounded AI connect to my AI model?

No. You paste the AI response manually. Try Grounded AI never connects to your AI model, API keys, system prompt, or production environment. Your data stays private.

How accurate is the GR scoring?

The scoring model was calibrated against a 100-case golden dataset with known correct answers. Post-calibration accuracy is approximately 75–80%. Domain Rules and Custom Rules layers are deterministic — they are 100% accurate for the facts you provide.

What is the difference between Domain Rules and Custom Rules?

Domain Rules are pre-built by KiwiQA — patterns for overconfidence, arithmetic errors, temporal claims, and domain-specific checks across 12 industries. Custom Rules are facts you upload yourself. Both run with zero LLM involvement and are deterministic.

Why does the score sometimes seem lower than expected?

Check the post-score penalties: suspicious claims (−8 each), fabricated citations (−15 each), and the 0.92× domain multiplier for Healthcare/Legal/Finance/Government. These apply after the weighted average and can significantly reduce a score that looks acceptable at the layer level.

How many runs do I get?

Free plan: 50 evaluation runs. Starter: 500 runs/month at $29/month. Team: 5,000 runs/month at $149/month. Plus 10 bonus runs for each successful referral.

Can I use Try Grounded AI for regulated industries?

Yes. Healthcare, legal, and finance responses are automatically detected and a 0.92× risk multiplier is applied. This raises the bar for GR-5 qualification in high-stakes industries.

Does it work with any LLM?

Yes. You paste any text — the source model does not matter. It works with GPT-4o, Claude, Gemini, Llama, Mistral, or any model you are using.

Is my data stored?

Test results (question, score, GR rating, findings) are stored in your account for Test History. The AI response content is not stored beyond the analysis session for individual tests.

Can I integrate it into my CI/CD pipeline?

Yes — API access is available on Team plan. See the API Access section above or visit /api-docs for the full reference.

What is the Team workspace?

Team workspace lets multiple colleagues share one account — pooled run quota, shared test history, and shared custom rules. No per-seat pricing. Available on all plans.

How do I report a bug or get support?

Click Get Help in the dashboard sidebar to open the support chatbot. For urgent issues email hello@kiwiqa.ai.

Still have questions?

Our team at KiwiQA is happy to help.

Try Grounded AIUser Guide