This is a hands-on, data-first tutorial for engineers and researchers who want to test bold claims like "Gemini 2.0 Flash achieved 0.7% hallucination on basic summarization" AI model with lowest hallucination rate or "Gemini 3.1 Pro cut hallucinations by 38 points." You will get a reproducible evaluation plan, a step-by-step execution roadmap, and diagnostic checks to explain why published numbers often disagree. I assume you care about exact model versions, test dates, and measurement bias instead of vendor marketing language.
What You Will Confirm in 30 Days: Concrete Outcomes for This Replication
In one month you will be able to:
- Re-run a controlled comparison between two model versions (for example, Gemini 2.0 Flash and Gemini 3.1 Pro) using the same prompts, decoding settings, and human evaluation rubric. Produce a transparent hallucination rate with confidence intervals and inter-annotator agreement numbers. Decide whether a reported 38-point reduction is reproducible under strict controls, and identify the technical levers most likely to produce the difference. Document methodological differences that explain conflicting headline numbers so you can make an evidence-based procurement or research decision.
Before You Start: Data, Models, and Tools Required to Test LLM Hallucinations
What do you need to run a meaningful replication? Here is a minimal kit and some optional extras that matter a lot for results.
Required
- Access to the specific model checkpoints you want to compare (model name and timestamp recorded). If vendor access is restricted, record API version and model tag exactly. A clear definition of "hallucination" for your task. Is it any unsupported factual claim, or only verifiable falsehoods? Write the rubric before you run any examples. A dataset of evaluation prompts that match the task. For "basic summarization" use source texts and gold summaries; for factual QA use source documents and an answer key. Annotation platform or spreadsheet plus at least two human raters per example for agreement measurement. Code to call models with deterministic settings: record temperature, top_p, top_k, max tokens, and seeds.
Highly recommended
- Versioned prompt templates and system messages saved in a repo (so prompts don't drift). Logging of API responses with full metadata: timestamps, model version, request parameters. Statistical tools for proportion tests and bootstrap CIs (Python scipy/statsmodels or R). A budget for at least 200 labeled examples per model if you want robust, publication-quality claims.
Tools and resources checklist
PurposeSuggested tools Model calls and orchestrationHTTP client + vendor SDK, or LangChain-style orchestrator AnnotationLabel Studio, Prodigy, Google Sheets Stats and plotsPython (pandas, scipy, statsmodels), R ReproducibilityGit for prompts and config, environment.ymlYour LLM Evaluation Roadmap: 9 Steps from Dataset to Statistical Significance
Follow these steps in order to avoid common methodological traps that inflate or suppress hallucination rates.

Define the measurement unit
Is the metric "percentage of outputs with at least one hallucinated fact" or "percentage of facts that are hallucinated"? Pick one and stick with it. For comparisons across models, per-output is easier and less noisy.
Create a matched test set
Use the same prompts and source material for both models. If the vendor claims used proprietary data, note that and try to reconstruct comparable prompts. How many examples? If you expect a large effect (30+ points), you can detect it with fewer examples, but annotation noise still argues for 200+ cases.
Lock the prompt and decoding settings
Set temperature, top_p, and sampling seeds. If the published claim doesn't report these, test multiple settings. Ask: did the published experiment use temperature 0.0 (deterministic) or higher sampling? Logging matters because sampling increases variability and usually raises hallucination rates.
Run a pilot
Collect 30-50 examples per model and label them. Compute preliminary hallucination rates and inter-annotator agreement. If agreement is low, refine the rubric before scaling.
Scale to target sample size
Decide your target sample size with a power calculation. Example: to detect a 38-point drop from 45% to 7% with alpha=0.05 and power=0.8, you only need about 20 examples per group in ideal conditions. In practice, set 200+ examples to account for annotation noise and dataset heterogeneity.
Blind annotation
Mask which model generated each output. If you must provide model metadata to annotators, randomize the order of examples so bias is minimized.
Compute metrics and confidence intervals
Report point estimates, 95% confidence intervals (bootstrap or exact binomial), and a two-proportion test p-value. Also report Cohen's kappa and raw disagreement rates.
Run sensitivity analyses
Recompute results under alternative definitions of hallucination and show how the headline number moves. Did Gemini 3.1 Pro's 38-point improvement depend on a narrow definition? Show that explicitly.
Document everything
Save prompts, instructions, random seeds, and the full annotation spreadsheet. Publish analysis scripts so others can reproduce your pipeline.
Avoid These 7 Evaluation Mistakes That Inflate Hallucination Gains
What commonly causes inflated claims like "0.7% hallucination"? Ask these questions before you accept any headline.
- Was the dataset cherry-picked? If examples came from sources the model was fine-tuned on, hallucination will be artificially low. Was the metric too narrow? Some reports count only major factual errors and ignore small inaccuracies that matter for downstream use. Were prompts or system messages engineered differently across models? Even a single extra constraint can suppress hallucinations. Was sampling turned off? Deterministic decoding (temperature 0) reduces variability and can lower hallucination rates compared with sampled outputs. Were annotations single-rater and unblinded? That introduces confirmation bias in favor of the new model. Were outputs post-processed or filtered? Removing uncertain outputs before scoring inflates apparent accuracy. Was the reported measure averaged across heterogeneous tasks? Aggregation can hide task-specific weaknesses.
Pro Evaluation Strategies: Fine-grained Metrics, Calibration, and Counterfactuals
Once you have the baseline replication, these techniques dig deeper into how and why the model changed.
Decompose hallucination types
Separate hallucinations into categories: invented entities, wrong dates, incorrect numbers, or partial hallucinations (mixing true and false). Which class drove the 38-point improvement? For example, improvements on named-entity hallucinations suggest better retrieval or grounding layers, while number errors point to calibration or training data differences.
Measure calibration and confidence
Ask models to provide confidence scores or use logit-based proxies. Does the newer model abstain more or express uncertainty instead of fabricating facts? Track coverage: did the model decline more often, and are declined cases excluded from the reported metric?
Run ablations
Change only one variable at a time: decoding settings, prompt phrasing, retrieval augmentation on/off. Which change reduces hallucinations most? If adding a retrieval step reproduces the 38-point drop, that is a clear mechanism.
Counterfactual testing
Create perturbations: swap names, dates, or locations in source text. Does the model still hallucinate the same way? Robust reductions should generalize across perturbations, not just to shared training data.
Longitudinal checks
Test model behavior across multiple dates. A drop reported on a single date could be due to a specific model snapshot or an evaluation bug. Record the date and time of each API call.
When Replication Fails: Diagnosing Discrepancies Between Published and Observed Rates
Not reproducing a 0.7% hallucination rate or a 38-point drop is common. Here are diagnostic questions and quick fixes.
- Did you use the same exact prompts and model tag? Slight prompt phrasing changes can have outsized effects. Fix: run the original public prompts verbatim if available. Were you comparing deterministic vs sampled outputs? Fix: match decoding settings exactly and re-run. Was the published evaluation unblinded or single-rater? Fix: collect blinded, multi-rater annotations for a fair comparison. Did the published metric exclude "abstains" or truncated outputs? Fix: adopt the same inclusion criteria or report both inclusive and exclusive metrics. Is the dataset representative? If the claim used in-distribution examples, your out-of-distribution test will show worse performance. Fix: stratify by in- vs out-of-distribution and report both. Could annotation instructions differ subtly? Fix: align rubrics, provide examples, and re-train raters until kappa > 0.6.
When numbers disagree, explain why
Publish a reproducibility table that lists each methodological difference and an estimate of its likely effect size. For example:
DifferenceEffect on hallucination rate Deterministic decoding vs samplingEst. -5 to -15 percentage points Single-rater vs 3 raters (majority vote)Variable - may inflate or deflate by a few points Data overlap with training setLarge - can reduce measured hallucination by 10-40 pointsLabel which differences are plausible explanations and which require further experiments. That is more valuable than a single headline.
Examples, Calculations, and Quick Reference
Sample size example
How many examples to detect a 38-point drop? Using a standard two-proportion power calculation with alpha=0.05 and power=0.8, if Model A has 45% hallucination and Model B has 7% (difference = 0.38), the theoretical sample size per group is small - on the order of 20 - because the effect is large. Practical advice: use at least 200 examples per model to account for annotation noise and population heterogeneity.
Confidence interval for a proportion
For a measured hallucination rate p_hat = x/n, compute a 95% Wilson interval or bootstrap it. Always report the interval; a single point estimate without uncertainty is meaningless for model selection.

Inter-annotator agreement
Compute Cohen's kappa or Krippendorff's alpha. If kappa < 0.4, your rubric is unclear. If kappa > 0.7, you can be reasonably confident in the labels.
Final Checklist Before You Publish Your Comparison
- Did you lock and record model versions and timestamps? Are decoding parameters saved and matched across models? Are prompts and system messages saved in version control? Are inclusion/exclusion criteria for outputs explicit? Did you blind annotators and have multiple raters per item? Do you report point estimates, CIs, and inter-annotator agreement? Did you run sensitivity analyses for major methodological choices?
Will you reproduce every vendor claim? Maybe not. But using this practical evaluation pipeline you can convert headline numbers into testable hypotheses. If a vendor claims 0.7% hallucination or a 38-point improvement, your job is to check the assumptions: dataset overlap, prompt engineering, decoding settings, and annotation procedures. When results differ, you can point to specific methodological levers that explain the gap rather than accepting or rejecting the claim outright.
If you want, I can generate: (1) a template annotation rubric for basic summarization hallucinations, (2) a set of 200 example prompts to test, or (3) code snippets for the statistical tests and Find more info bootstrap confidence intervals. Which would you like next?