A confidence threshold is a numerical cutoff that determines when an AI system should act on its own versus defer to a human. In customer support, it controls whether an AI reply is sent directly to a customer or held for agent review. Getting this number right is the single most important tuning decision when deploying AI in a support team.
How AI Confidence Scores Work
When an AI generates a reply, it does not produce just the text — it also produces a score reflecting how certain the model is that its answer is correct and appropriate. In customer support systems, this score is typically normalised to a 0–100 range.
The factors that affect the score vary by system, but typically include:
- How closely the customer's query matched information in the knowledge base
- How many relevant articles or data sources were found
- How specific and unambiguous the query was
- Whether the AI had to infer context versus find an explicit answer
- Whether external data (such as order information) was successfully retrieved
A query like "What are your shipping times to London?" against a knowledge base containing a "Delivery Times" article with a clear answer might score 91. A query like "I am a bit confused and not sure what to do about my order" might score 34 — too vague, too many possible responses, insufficient context.
What Is a Confidence Threshold?
The confidence threshold is the score above which the AI acts autonomously. If you set your threshold at 80:
- Score of 80 or above: AI sends the reply directly to the customer
- Score below 80: AI drafts the reply and holds it for human review
This is sometimes called the auto-send threshold or autonomy threshold. The concept appears in any AI system that must decide between acting and asking for help — medical diagnosis tools, fraud detection systems, autonomous vehicles, and content moderation platforms all use equivalent mechanisms.
In customer support, the threshold gives you precise control over the balance between efficiency (fewer human touches) and accuracy (fewer mistakes reaching customers).
Why the Confidence Threshold Matters More Than the AI Model
Many teams focus on which AI model to use. The threshold typically matters more for day-to-day accuracy in customer support.
| Configuration | Threshold | Practical Result |
|---|---|---|
| Excellent AI, threshold set too high (95) | 95 | AI almost never auto-sends; team still reviews everything manually |
| Good AI, threshold well-calibrated (78) | 78 | AI auto-sends 72% of tickets correctly; team handles the 28% edge cases |
| Good AI, threshold set too low (40) | 40 | AI sends many uncertain replies; errors reach customers; trust erodes |
The middle configuration — a well-calibrated threshold on a good (not perfect) AI — produces the best operational outcome. Over-trusting the AI by setting a low threshold damages customer experience. Excessive caution by setting a very high threshold defeats the purpose of automation entirely.
How to Find Your Optimal Confidence Threshold
There is no universal correct answer. The right threshold depends on several factors:
1. Your brand risk tolerance A healthcare SaaS company where a wrong reply could cause harm should run a higher threshold than a streetwear brand. When the cost of an incorrect AI reply is high — financial, legal, or reputational — set a higher threshold.
2. Your ticket mix Simple, factual questions such as order tracking and opening hours are well-suited to lower thresholds. Complex complaints and escalations need higher thresholds.
3. Your knowledge base quality A sparse knowledge base leads to lower confidence scores across the board. Improving the knowledge base raises scores without touching the threshold at all.
The calibration process:
Start at threshold 80. Run the AI in draft-only mode for one week — all replies are drafted but none are sent automatically. Review the drafts and rate them: correct and appropriate, correct but tone off, or incorrect.
After one week, sort your drafts by AI confidence score. Find the score band where incorrect replies start appearing consistently. Set your auto-send threshold five points above that band.
Most well-configured systems settle between 72 and 85 for general support queries.
Threshold by Ticket Category
Advanced deployments use category-specific thresholds rather than a single number applied to all ticket types.
| Ticket Category | Recommended Threshold | Rationale |
|---|---|---|
| Order tracking / WISMO | 75 | Factual, verifiable, low risk if slightly imprecise |
| Shipping times | 78 | Simple FAQs with consistent answers |
| Returns policy | 80 | Needs to match policy exactly |
| Refund requests | 85 | Financial commitment — higher bar |
| Complaints | 90+ | Emotional, complex — almost always benefit from draft review |
| Account issues | 85 | Security-adjacent — err on the side of caution |
This tiered approach maximises automation where it is safe (WISMO) while keeping humans involved where mistakes are costly (complaints, refunds).
What Happens Below the Threshold: Three Options
A well-designed AI support system handles low-confidence queries in one of three ways:
Option 1: Draft for review The AI writes a provisional reply and flags it for the agent. The agent reviews, edits if necessary, and approves. This is appropriate for most business contexts.
Option 2: Acknowledge and escalate The AI sends a holding message to the customer ("Our team will be in touch shortly") and notifies the agent immediately. This prevents dead silence when agents are offline or busy.
Option 3: Ask a clarifying question The AI asks the customer to provide more information before generating a full reply. This is useful when the query is genuinely ambiguous and more context would produce a better answer.
What a properly calibrated AI should never do: guess confidently when it is not confident. The confidence threshold prevents that failure mode.
How Kriseena Implements Confidence Thresholds
Kriseena calculates a confidence score for every AI reply using a combination of:
- Cosine similarity between the query and matching knowledge base articles via pgvector
- Number of relevant document chunks retrieved
- Whether Shopify or WooCommerce order data was successfully fetched and verified
- Query specificity and completeness
The score is stored with every message and visible in the dashboard for auditing purposes. This transparency allows teams to review decisions and understand exactly why the AI chose to send or hold any given reply.
You set a threshold in the Widget Settings panel. Replies above the threshold are sent automatically. Replies below are held in the inbox with the AI draft pre-populated, waiting for one-click approval. This design means agents only read and approve replies — they never start from a blank cursor.
Common Threshold Mistakes
Setting it and forgetting it Your ticket mix changes as your business grows. A threshold calibrated in January may be wrong by June if you have added new product lines or changed your returns policy. Review quarterly.
Starting with auto-send Always start with draft mode. Build evidence that the AI is accurate on your specific content before giving it autonomy over what reaches customers.
Using a single threshold for all ticket types Different query categories carry different risk profiles. WISMO at 75 and refund confirmation at 85 is a sensible split. Treating them identically leaves money on the table (over-cautious on WISMO) or creates risk (under-cautious on refunds).
Ignoring the review queue composition If agents are approving 95% of drafts without reading them, the threshold is too low — too much volume is routing to human review unnecessarily. If agents are mostly writing from scratch, the threshold is too high. The right steady state: agents carefully review 15–30% of drafts, and the rest are one-click approvals.
Key Takeaways
- A confidence threshold is the score above which an AI acts autonomously — below it, a human reviews
- Threshold calibration is often more impactful than AI model selection for day-to-day accuracy
- Start at 80 and run draft-only mode for at least one week before enabling auto-send
- Use category-specific thresholds for high-risk ticket types such as refunds and complaints
- Review your threshold quarterly as your ticket mix and knowledge base evolve
- The goal is not maximum auto-send rate — it is maximum correct auto-send rate
Frequently Asked Questions
What happens when the AI scores exactly at the threshold value? Behaviour varies by platform. In Kriseena, a score exactly at the threshold triggers auto-send — it is treated as meeting the requirement. You should treat scores within five points of your threshold as borderline and monitor them closely during calibration, since small changes in query phrasing can push them either side of the line.
Can I see individual confidence scores for each AI reply? Yes, in systems like Kriseena that expose this data. Each message record includes the confidence score, which helps you audit decisions and identify patterns — for instance, that queries about a specific product line consistently score low because the knowledge base article is too thin or out of date.
Does a high confidence score guarantee the AI reply is correct? No. A high confidence score means the AI found strong matches in the knowledge base and is certain about its reply relative to those sources. If the knowledge base contains incorrect information, the AI will confidently generate an incorrect reply. Confidence scores measure consistency with sources, not absolute truth. This is why knowledge base quality is the foundation of any AI support deployment.
What is the difference between a confidence threshold and a similarity threshold? A similarity threshold is a lower-level concept — it is the minimum cosine similarity score required for a knowledge base chunk to be included in the AI's context window. A confidence threshold is the higher-level concept — it is the overall score that triggers autonomous sending. The similarity threshold feeds into the confidence calculation.
Should I lower the threshold if agents are too slow to review the draft queue? No. If agents are overwhelmed by the review queue, the solution is to improve staffing or reduce overall ticket volume — not to lower the threshold and risk sending uncertain replies to customers. Lowering the threshold trades short-term agent workload reduction for long-term damage to customer trust. That is a poor trade-off in any context.