Try Grounded AI
User Guide
Everything you need to test AI responses for hallucinations, interpret GR ratings, and integrate Grounded into your QA workflow.
What is Try Grounded AI?
Try Grounded AI is an AI hallucination testing platform built by KiwiQA. It lets you paste any AI-generated response and run up to 10 independent detection layers — returning a structured verdict, a GR reliability score, and specific recommended actions in under 60 seconds.
Unlike developer-focused tools that require SDK integration and engineering setup, Try Grounded AI works for QA testers, product managers, consultants, and compliance teams with no code required.
Who is it for?
| ROLE | USE CASE |
|---|---|
| QA Engineers | Validate AI responses before each release. Catch hallucinations in automated test suites. |
| Product Managers | Spot-check AI features before shipping. Generate evidence for stakeholder sign-off. |
| Security Engineers | Test AI outputs for compliance with security policies and correct regulatory references. |
| Consultants | Audit client AI systems. Produce branded PDF reports as deliverables. |
| Compliance / Legal | Verify AI responses meet industry-specific accuracy standards before deployment. |
| Developers | Integrate via API into CI/CD pipelines. Run regression tests on model updates. |
Quick Start
Your first hallucination test takes under 60 seconds. No setup. No SDK. No integrations required.
GR Rating System
Every test returns a GR (Grounded Reliability) rating — a score from 0 to 100 mapped to five bands. GR is the primary output of Try Grounded AI and the metric your team should use to gate AI deployments.
| RATING | SCORE | LABEL | MEANING |
|---|---|---|---|
| GR-5 | 88–100 | Verified | All checks passed. Safe to deploy. Use with confidence. |
| GR-4 | 76–87 | Reliable | Minor issues only. Deploy with standard monitoring. |
| GR-3 | 60–75 | Conditional | Review flagged findings before shipping. Consider rewriting the prompt. |
| GR-2 | 42–59 | High Risk | Significant hallucination signals. Do not deploy without remediation. |
| GR-1 | 0–41 | Critical | Severe hallucination detected. Block release immediately. |
High-risk domain multiplier
For responses in healthcare, legal, finance, or government contexts, a 0.92× multiplier is applied to the final score after all layer calculations. This raises the bar for these industries where hallucinations carry real-world liability. A score of 83 in a healthcare response becomes 76 after the multiplier — just at the GR-4 pass threshold.
How Scoring Works
The GR score is a dynamic weighted average of up to 10 detection layers. Weights are not fixed — they adjust automatically depending on which optional layers activate for a given test. Layers that are skipped (not applicable) are excluded entirely, and the remaining weights renormalise to sum to 100%.
Final Score Formula
Layer weights by scenario
The table below shows how weights shift across the three main test configurations:
| LAYER | Standard (no optional inputs) | With Reference Doc | With Structured Data |
|---|---|---|---|
| L1 Consistency | 22% | 19% | 13% |
| L2 Doc Grounding | — | 20% | 12% |
| L3 Confidence | 16% | 14% | 10% |
| L4 Model Agreement | 25% | 22% | 15% |
| L5 Semantic Drift | 7% | 6% | 5% |
| L6 Domain Rules | 18% | 16% | 12% |
| L7 Custom Rules | incl. in L6 | incl. in L6 | incl. in L6 |
| L8 RAG Citation Map | — | incl. in L2 | — |
| L9 Structured Data Fidelity | — | — | 30% |
| L10 Source Attribution | — | — | 12% (when citations detected) |
Post-score penalties
Three adjustments are applied after the weighted average is computed:
| PENALTY | AMOUNT | TRIGGER |
|---|---|---|
| Suspicious claims | −8 pts each, max −24 | L3 Confidence flags a claim as vague, circular, or unverifiable (e.g. "according to studies…"). |
| Fabricated citations | −15 pts each, max −30 | L10 Source Attribution confirms a named citation does not exist or is misattributed. The most severe penalty. |
| High-risk domain multiplier | × 0.92 (−8% effective) | Detected domain is Healthcare, Legal, Finance, or Government. |
Deterministic checks (L6 Domain Rules)
The Domain Rules layer runs without any LLM — fully deterministic pattern matching. The combined L6 score is the average of three sub-checks: Generic Rules + Domain-specific Rules + Custom Rules.
| FINDING TYPE | SEVERITY | HOW IT FIRES |
|---|---|---|
| Arithmetic errors | HIGH | Validates A+B=C, A−B=C, A÷B≈C with 2% tolerance. Each error: −25 pts. |
| Count mismatches | HIGH / MED | AI claims "5 reasons" but lists fewer or more. Each mismatch: −10 to −20 pts. |
| List numbering gaps | MED | Numbered list skips an item (1, 2, 4 — missing 3). Penalty: −10 pts. |
| Overconfident language | MED | Flags: definitively, unquestionably, always, never, guaranteed, 100%, it is proven that. Penalty: −5 pts. |
| Temporal claims | MED | Time-sensitive language without a date anchor: "currently", "as of today", "the current CEO". Penalty: −8 pts each. |
| Future date references | HIGH | References years 2030+ as present facts. Penalty: −10 pts each. |
| Unverifiable statistics | LOW | Unsourced percentage claims. Flagged for review — no automatic score penalty. |
The 10 Detection Layers
Every test runs up to 10 independent checks. Layers L1–L7 always run. L8–L10 activate only when the relevant optional input is present (reference document, structured data, or detected citations).
Structured Data Fidelity (L9)
The 9th layer lets you attach any CRM export, CSV, or JSON dataset to your Response Audit. Grounded automatically parses every field, detects data types, computes aggregates, and verifies every AI claim against your actual data — no manual field mapping required.
How to use it
- Open Response Audit and fill in Steps 01 and 02.
- Scroll to Step 04 — Attach CRM or structured data and click Enable.
- Upload a CSV or JSON file, or paste the data directly. Max 500 KB.
- Run the audit. The Data Attribution tab shows a field-by-field verification with MATCHED / MISMATCH / INFERRED status for every claim.
Supported formats
| FORMAT | EXAMPLE | MULTI-RECORD? |
|---|---|---|
| HubSpot contacts CSV | First Name, Email, Lifecycle Stage, ... | Yes — all rows |
| Salesforce export | Id, Name, Status, Industry, ... | Yes — all rows |
| Custom CSV | account_name, health_score, sla_breach | Yes — all rows |
| JSON object | {"account": "Acme", "score": 85} | Single record |
| JSON array | [{"account":"Acme"}, {"account":"TechCo"}] | Yes — all rows |
| TSV / pipe-separated | field1 field2 field3 | Yes — all rows |
How Grounded verifies data
A three-tier deterministic engine — no LLM arithmetic:
- Type detection — every field is auto-classified as boolean, numeric, date, enum, or ID based on its values.
- Aggregate computation — sums, averages, counts, and group-bys are computed in code before any LLM call. Results are exact.
- Claim verification — the computed ground truth is passed to the LLM to judge whether the AI's claims match it.
Running a Response Audit
Use Single Test to validate one AI response at a time. This is the most common workflow for spot-checks, pre-release validation, and testing during development.
Test Suite mode
Test Suite lets you build a named set of test cases and run them as a group. Use this for sprint regression testing or pre-release checklists.
- Switch to the Test Suite tab within New Test.
- Name your suite (e.g. "Sprint 14 — AI Chatbot").
- Add individual test cases with question, response, and optional reference.
- Click Run Suite. Each case runs through the full 10-layer pipeline.
- View a suite summary report with aggregate GR score and per-case breakdown.
Batch Testing
Batch Testing lets you upload a file of AI responses and test all of them in one operation. Use this for regression testing after model updates, auditing large content sets, or testing a full FAQ or knowledge base.
Supported file formats
| FORMAT | REQUIRED COLUMNS | OPTIONAL |
|---|---|---|
.csv | question, aiResponse | refDoc |
.json | question, aiResponse | refDoc |
CSV example
question,aiResponse,refDoc
"What is the SG rate for 2024-25?","The rate is 11%",""
"What does TGA stand for?","Therapeutic Goods Administration",""Running a batch
Custom Rule Sets
Custom Rule Sets are the most powerful differentiation in Try Grounded AI. They let you upload your own verified facts — things only your organisation knows — and have them checked automatically on every test.
What to put in Custom Rules
- Founding year, headquarters, company name variations
- Product pricing, plan limits, feature availability
- Regulatory references specific to your jurisdiction
- Internal policy figures (thresholds, limits, rates)
- Correct names for people, products, and services
File format
| COLUMN | DESCRIPTION | EXAMPLE |
|---|---|---|
claim | The incorrect statement to watch for | "KiwiQA was founded in 2010" |
correct_answer | The verified correct fact | "KiwiQA was founded in 2020" |
source | Where this fact can be verified | "kiwiqa.ai/about" |
Scoring
Each custom rule violation scores as: max(0, 100 − violations × 20). This means: 0 violations = 100, 1 violation = 80, 2 violations = 60, 3+ violations = 40 or below. Custom Rules contribute to the combined L6 score alongside Domain Rules and Generic Rules.
Test History & Risk Profile
Test History
All tests — single, suite, and batch — are automatically saved to Test History. Use it to track results over time, re-run previous tests, and export reports.
- Search: Filter by question text or batch name.
- Filter: Show only PASS, WARN, FAIL, or batch results.
- Export CSV: Download your full history as a spreadsheet.
- Re-run: Click ↺ Re-run on any row to load it back into the test form.
Risk Profile
The Risk Profile dashboard gives you an aggregate view of your AI reliability over time. It is designed for team leads, QA managers, and compliance officers who need a summary view.
- Overview tab: Score trend bars, detection layer health bars, top finding types, industry risk, and priority action plan — all in one view.
- Detection Layers tab: Full table showing avg score, fail rate, and test count per layer.
- Finding Types tab: Breakdown of deterministic findings (arithmetic errors, count mismatches, etc.) with counts and remediation advice.
- Score Trend tab: Historical score chart with regression/improvement detection.
- Industry tab: Avg GR score broken down by detected domain.
- Worst Tests tab: Bottom 20 tests by score with finding type labels.
- Action Plan tab: Auto-generated P1/P2/P3 recommendations based on your weakest layers and finding patterns.
- Export PDF: Full formatted Risk Profile report ready to share with stakeholders.
- Regression badge: Fires automatically when the last-7-day average drops more than 10 points vs the prior period.
Team Workspace
The Team feature lets you invite colleagues to share your workspace. All team members share the same run quota, test history, and custom rules — there is no per-seat pricing.
Inviting a colleague
API Access & CI/CD
The Try Grounded AI API lets you run hallucination checks programmatically — from CI/CD pipelines, automated test suites, or your own applications. Available on Team plan and above.
Authentication
Generate an API key in the API Keys tab of the dashboard. Keys use the grnd_ prefix. Pass your key as a Bearer token in the Authorization header.
Run a test via API
curl -X POST https://grounded-topaz.vercel.app/api/v1/detect \
-H "Authorization: Bearer grnd_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"question": "What is the SG rate for 2024-25?",
"aiResponse": "The superannuation guarantee rate is 11%.",
"refDoc": ""
}'Response format
{
"score": 42,
"grRating": "GR-2",
"risk": "HIGH",
"verdict": "FAIL",
"finding": "Incorrect SG rate — domain rule violation detected.",
"stages": {
"consistency": { "score": 38 },
"grounding": { "score": 45 },
"confidence": { "score": 61 },
"modelAgreement":{ "score": 33 },
"semanticDrift": { "score": 80 },
"deterministic": { "score": 12, "findings": ["SG rate mismatch"] }
}
}GitHub Actions example
- name: Run hallucination check
run: |
RESULT=$(curl -s -X POST \
-H "Authorization: Bearer ${{ secrets.GROUNDED_API_KEY }}" \
-H "Content-Type: application/json" \
-d '{}' \
https://grounded-topaz.vercel.app/api/v1/detect)
SCORE=$(echo $RESULT | jq '.score')
if [ "$SCORE" -lt 60 ]; then
echo "GR score $SCORE is below threshold. Blocking deployment."
exit 1
fiInterpreting Results
The InsightReport is the full output of a single test. Here is how to read each section.
The GR score
| GR RATING | RECOMMENDED ACTION |
|---|---|
| GR-5 | Deploy with confidence. All checks passed. |
| GR-4 | Deploy with standard monitoring. Review any flagged layer. |
| GR-3 | Review the Recommended Action. Rewrite the prompt or add constraints. |
| GR-2 | Do not deploy. Address flagged issues and re-test. |
| GR-1 | Block release. Escalate to team lead. Investigate source of hallucination. |
Per-layer scores
Each of the 10 layers returns an individual score. A score below 60 on any layer is a warning signal even if the overall GR rating is acceptable. Layers that were skipped (not applicable for this test) show as N/A.
The Recommended Action
Every test includes a plain-English recommended action explaining what was found and what to do about it. Designed to be readable by non-engineers and can be included directly in defect reports.
Exporting reports
- PDF export: Timestamped, formatted report suitable for client delivery or compliance documentation.
- Re-run: Load a previous test back into the form to test a revised AI response.
- CSV export: From Test History, export all results as a spreadsheet for analysis.
Frequently Asked Questions
Does Try Grounded AI connect to my AI model?
No. You paste the AI response manually. Try Grounded AI never connects to your AI model, API keys, system prompt, or production environment. Your data stays private.
How accurate is the GR scoring?
The scoring model was calibrated against a 100-case golden dataset with known correct answers. Post-calibration accuracy is approximately 75–80%. Domain Rules and Custom Rules layers are deterministic — they are 100% accurate for the facts you provide.
What is the difference between Domain Rules and Custom Rules?
Domain Rules are pre-built by KiwiQA — patterns for overconfidence, arithmetic errors, temporal claims, and domain-specific checks across 12 industries. Custom Rules are facts you upload yourself. Both run with zero LLM involvement and are deterministic.
Why does the score sometimes seem lower than expected?
Check the post-score penalties: suspicious claims (−8 each), fabricated citations (−15 each), and the 0.92× domain multiplier for Healthcare/Legal/Finance/Government. These apply after the weighted average and can significantly reduce a score that looks acceptable at the layer level.
How many runs do I get?
Free plan: 50 evaluation runs. Starter: 500 runs/month at $29/month. Team: 5,000 runs/month at $149/month. Plus 10 bonus runs for each successful referral.
Can I use Try Grounded AI for regulated industries?
Yes. Healthcare, legal, and finance responses are automatically detected and a 0.92× risk multiplier is applied. This raises the bar for GR-5 qualification in high-stakes industries.
Does it work with any LLM?
Yes. You paste any text — the source model does not matter. It works with GPT-4o, Claude, Gemini, Llama, Mistral, or any model you are using.
Is my data stored?
Test results (question, score, GR rating, findings) are stored in your account for Test History. The AI response content is not stored beyond the analysis session for individual tests.
Can I integrate it into my CI/CD pipeline?
Yes — API access is available on Team plan. See the API Access section above or visit /api-docs for the full reference.
What is the Team workspace?
Team workspace lets multiple colleagues share one account — pooled run quota, shared test history, and shared custom rules. No per-seat pricing. Available on all plans.
How do I report a bug or get support?
Click Get Help in the dashboard sidebar to open the support chatbot. For urgent issues email hello@kiwiqa.ai.