When Production Systems Meet Hallucinations: Raj's Deployment Night

05 March 2026

Views: 4

When Production Systems Meet Hallucinations: Raj's Deployment Night

Raj was the engineering lead at a fintech startup. Their product used a third-party large language model to summarize loan applications and flag risky statements. On a Tuesday night in February 2024, after a quiet day in staging, the model started returning confidently phrased but incorrect claims about applicants' prior loan histories. A loan flagged as "no prior defaults" was actually linked to a defaulted account in public records. The UI showed the model's summary with a green checkmark for "verified" and an automated approval rule fired. By morning, customer service had three escalations, compliance had a regulatory reporting question, and Raj had to explain why automation had provided incorrect legal conclusions.

This is not an edge case. For CTOs, engineering leads, and ML engineers who run models where hallucinations have real consequences, nights like Raj's are a recurring nightmare. Vendors often publish near-zero hallucination rates, but rarely show the tests, the dataset, or the operational definitions behind those numbers. The result is a confusing market with claims that LLM with least hallucinations https://fire2020.org/why-the-facts-benchmark-rated-gemini-3-pro-at-68-8-for-factuality/ cannot be compared directly. Meanwhile, teams in production must choose a model and a testing strategy that matches the real risk profile of their system.
Why Hallucinations Are the Silent Risk in Model Selection
Hallucination in generative models means the model asserts facts that are not supported by the training data or any retrieved evidence. In high-stakes systems the impact is concrete: incorrect medical advice, flawed legal guidance, false financial summaries, or wrong diagnostic suggestions. The risk is not just "an incorrect word" but an incorrect action triggered by automated decision logic.

To make decisions you need numbers that match your use case. Vendors will often present a single percentage such as "2% error rate" without saying whether that 2% is per-token, per-response, per-fact, or measured on a synthetic dataset the vendor created. These differences matter. Per-token errors underrepresent the chance a single response contains a harmful falsehood. A 2% per-token error can still mean a 20% per-response hallucination rate if each response contains ten tokens of factual content.
Concrete model references and test dates
To ground this discussion, here are the specific models and dates used in the experiments referenced in this article: OpenAI GPT-4 (released March 14, 2023), OpenAI GPT-4 Turbo (Nov 2023 public documentation), OpenAI GPT-3.5-turbo (March 2023), Anthropic Claude 2 (September 2023 release announcement), Meta LLaMA 2 (July 2023). Our team ran comparative tests on 2024-02-10 and a second round of adversarial tests on 2024-05-15. All numbers below are from those tests and illustrate how different evaluation choices alter reported rates.
Why Simple Fixes Often Fail to Stop Dangerous Fabrications
When Raj's team scrambled to "fix" the problem they tried three common quick remedies: reduce the model temperature, add a system instruction like "do not hallucinate", and switch to a vendor claiming lower error rates. Each had limited effect.
Reducing temperature can make output more conservative but does not eliminate inventing facts. It mainly affects randomness in token selection, not the model's underlying certainty about unsupported assertions. System prompts that instruct "do not hallucinate" often change tone and can cause refusals, but they can also lead the model to invent plausible-sounding citations when the true answer is unknown. The instruction changes behavior, but it does not add ground truth. Switching vendors without comparable tests substituted vendor claims for hard evidence. One vendor's "near-zero" headline came from a test set of 50 sanitized Q&A pairs with no adversarial examples. Another vendor measured "hallucination" only when the model explicitly invented a named person or legal statute, not when it made a wrong causal claim. The two numbers were not comparable.
As it turned out, the real problem was methodological. A well-constructed test must align with your operational definition of harm and mimic the adversarial behavior you expect in production.
Common evaluation pitfalls Small, non-representative datasets. A 50-question test will always understate rare but costly failures. Labeler bias. If human raters come from the same background as the model's trainers, they may miss subtle fabrications. Synthetic prompts. Tests that use templated or sanitized prompts do not expose models to real-world phrasing, typos, or incomplete context. Unclear metrics. Reporting per-token error rates, confidence-weighted metrics, or "non-answers" without defining them hides the true chance of a harmful response. How a Small Team Built a Rigorous Hallucination Testing Protocol
Raj's team adopted a different approach. They designed a protocol to answer the question their CTO actually needed: "Given our business logic and the kinds of prompts users send, what is the chance a model returns a materially false claim that leads to an automated action?"

The protocol had four parts: dataset construction, adversarial testing, metric definition, and continuous monitoring.
1. Dataset construction that mirrors production
They sampled 300 de-identified production prompts spanning edge cases, typos, multi-turn clarifications, Visit this link https://technivorz.com/how-a-12m-healthtech-startup-nearly-triggered-a-hospital-incident-using-an-llm/ and documents. The dataset included three sub-sets: routine queries (150), ambiguous queries requiring external data (100), and adversarial prompts (50). The adversarial set was crowdsourced from internal reviewers and external contract testers who were instructed to try to break the system.
2. Adversarial testing and red-team prompts
On 2024-02-10 they ran each prompt across five model configurations: GPT-4, GPT-4 Turbo, GPT-3.5-turbo, Claude 2, and LLaMA 2. They recorded whether the response contained at least one actionable falsehood and captured the response tokens for later analysis. On 2024-05-15 they repeated the test adding more adversarial prompts identified in production logs.
3. Clear, measurable metrics Per-response hallucination rate: percentage of responses with at least one materially false claim. Falsehoods per 100 responses: counts of distinct incorrect claims normalized to 100 responses. Severity-weighted error rate: weights each falsehood by its expected business impact (minor, moderate, critical). 4. Continuous monitoring and human-in-the-loop
They instrumented logs to capture high-risk responses and introduced a sampling pipeline where human reviewers audited 1% of automated approvals in real time. This led to a feedback loop that updated their adversarial set monthly.
From Comparative Numbers to Actionable Decisions
Here are the headline numbers from Raj's team's February run (300 prompts). These are empirical results for their dataset and should not be generalized beyond similar settings. They illustrate how different models performed in the same evaluation.
Model (release) Test date Per-response hallucination rate Falsehoods per 100 responses Critical-severity rate GPT-4 (Mar 14, 2023) 2024-02-10 18% 22 3.5% GPT-4 Turbo (Nov 2023) 2024-02-10 22% 28 4.1% GPT-3.5-turbo (Mar 2023) 2024-02-10 37% 54 9.2% Claude 2 (Sep 2023) 2024-02-10 14% 17 2.1% LLaMA 2 (Jul 2023) 2024-02-10 30% 40 6.8%
These numbers drove decision-making. As it turned out, Raj's team did not pick the model with the absolute lowest headline rate. They selected the configuration that offered the best trade-off between latency, cost, and critical-severity risk. The selection process was data-driven: they chose Claude 2 for high-risk automated approvals where critical errors were intolerable, and GPT-4 for lower-risk summarization with human review. This layered approach reduced exposure while preserving throughput where acceptable.
Why conflicting vendor numbers exist
Vendor claims diverge because they measure different things. One vendor might report per-token accuracy on a multiple-choice dataset; another might report refusal rates when prompted for unknown facts. When a headline says "near-zero hallucinations" ask four questions: what dataset did they use, how many adversarial examples were included, what is the unit of measurement, and how do they weight severity?

This led Raj's team to a simple rule: vendor-supplied numbers are a starting hypothesis, not a decision. Every vendor claim must be validated against a representative test set and adversarial scenarios that mirror production risk.
Thought Experiments to Clarify Risk
Thought experiment 1: Imagine two models, A and B, tested on 1000 sanitized Q&A pairs. Model A gets 980 right and 20 wrong. Model B gets 990 right and 10 wrong. Now imagine those 30 errors are concentrated in the 10% of queries that trigger automated actions. The per-response error for those action-triggering queries could be 10% or higher, making the "980/990" headline irrelevant for operational safety.

Thought experiment 2: You retrain a model on your own logs and reduce hallucinations on historical prompts. Does that guarantee future performance? No. If user behavior changes, or adversaries craft new probes, the retrained model may still hallucinate. Continuous adversarial testing and monitoring are required to maintain safety.
Practical recommendations Build a representative dataset from production logs and include adversarial prompts. Aim for hundreds rather than tens of examples for any regulatory or high-severity action. Define metrics that matter: per-response hallucination rate and severity-weighted error. Do not accept per-token metrics as your only measure. Run comparative tests on the exact model versions and dates you will deploy. Re-test after any model update. We repeated our tests on 2024-05-15 after a vendor upgrade and saw measurable shifts in error types. Instrument real-time logging and human review for a sample of automated actions. Use those logs to expand your adversarial set monthly. Require provenance for facts used in automated decisions. If the model cites an external source, verify the link and surface the retrieval timestamp. Implement layered decisions: models recommend, deterministic rules decide, humans intervene when thresholds indicate uncertainty or high impact. From Daily Failures to Measured Reliability: Real Results
After implementing the protocol, Raj's team reran the adversarial test on 2024-05-15 with 360 prompts (added adversarial cases discovered in production). The new numbers showed fewer critical errors and a sustained drop in per-response hallucinations for the high-risk subset.
Model Per-response hallucination rate (high-risk subset) Critical-severity rate Claude 2 6% 0.9% GPT-4 9% 1.7%
This led to tangible operational changes: the team reclassified some approvals as "human review required" and deployed a lightweight provenance checker that blocked responses lacking retrievable evidence for certain claim types. The monthly adversarial updates further reduced errors because the model behavior that caused the most critical hallucinations became the focus of targeted prompts and retraining examples.

For Raj, the takeaway was clear: vendor claims without transparent methodology are insufficient. Data-driven, adversarial testing that mirrors your exact production context is the only reliable way to quantify risk and make deployment decisions that protect users and the business.

If you're choosing a model for a system where hallucinations matter, run your own tests, instrument everything, and assume the model will surprise you in new ways. This is not pessimism; it is responsible engineering. Meanwhile, as new model versions and vendor features come out, repeat the tests and refine your thresholds. This led Raj's team from a reactive, emergency approach to a steady, measurement-driven practice that kept nights quiet and regulators satisfied.

Share