developer-tools · tested June 2026

Best Knowledge Base Chatbot Builders (Tested & Ranked)

If you need a chatbot builder that can answer questions from a company knowledge base, the deciding factors are retrieval accuracy, follow-up handling, grounding on complex policy questions, emotional edge cases, and whether it can deploy as an embeddable website widget. We tested six RAG chatbot tools on the same seven-document company knowledge base in PDF and DOCX format across four difficulty bands: simple single-document lookups, multi-document reasoning, complex multi-hop reasoning, and emotional or crisis edge cases.

6 tools9 things we checked4 tests84 findings77 screenshots11 min read

Our verdictTested June 2026 · 6/6 tools tested hands-on

#1 pick

Denser AIBest5.0/5 · 3 checks

Denser AI matched strong retrieval with the only consistently visible source citations, making it the best alternative when answer transparency matters more than warmth.

See the full evidence ↓Denser AI hands-on review →

The rest of the field

#2 Wonderchat· #3 Voiceflow· #4 CustomGPT· #5 Botpress· #6 Chatbase

The ranking

How we decided #1. We rank on the 3 checks that decide whether a tool does this job: Follow-up context, Multi-document reasoning, Retrieval accuracy. A check only carries a score when we recorded a finding for it, and a tool has to be measured on all of them to take the top spot. We also checked Citation & source, Edge case handling, Free tier viability, Input Handling, Tone & empathy, Website embed — compared for you, but not part of the ranking.

	Tool		Score	Price	Where it lands
#1	Denser AI	Best	5.0/5 all 3 checks	Free · $29/mo	Best-in-class citation transparency with consistently accurate policy retrieval.
#2	Wonderchat	Usable	5.0/5 all 3 checks	Free · $29/month	Strong retrieval and follow-up context, but weak warmth and citations
#3	Voiceflow	Usable	4.7/5 all 3 checks	Free · $50/mo	Top performer for retrieval, empathy, and follow-up context, but with no visible citations and weaker evidence on embed/free-tier usability.
#4	CustomGPT	Usable	4.7/5 all 3 checks	$99/mo	Best-in-class empathy and conversational support, with a notable contact-detail hallucination risk.
#5	Botpress	Usable	4.0/5 all 3 checks	Free · $150/mo	Strong retrieval and solid crisis awareness, but weak on citations and warmth.
#6	Chatbase	Usable	4.0/5 all 3 checks	Free · $32/mo	Reliable policy retriever with strong follow-up handling, but constrained by a hard free-tier credit cap.

What we checked

Every finding below is tied to one of these checks, and to the test that produced it. The number is how many of the 6 tools we recorded findings for.

Follow-up context 6 toolsMulti-document reasoning 6 toolsRetrieval accuracy 6 toolsTone & empathy 5 toolscontextCitation & source 4 toolscontextEdge case handling 4 toolscontextWebsite embed 2 toolscontextFree tier viability 1 toolcontextInput Handling 1 toolcontext

What we tried

The same 4 tests were run on every tool.

Complex Multi-Hop Reasoning (Hard)Direct Factual Retrieval (Simple)Edge Cases: Emotional, Frustration & Crisis HandlingMulti-Document Reasoning (Medium)

Read it

One tool at a time, with the findings behind every score

Denser AI

Best#1 of 6

Best-in-class citation transparency with consistently accurate policy retrieval.

▸Follow-up context5/51 worked well1 finding

Follow-up questions were handled cleanly and stayed on the same topic without losing context.

Worked wellwhen we tried: Complex Multi-Hop Reasoning (Hard)link to this finding

Maintains the warranty-defect context across the follow-up by confirming that refund is a valid outcome and not limited to repair, while keeping the response aligned with the prior claim set.

Tool input

benchmark prompt

Complex Multi-Hop Reasoning (Hard)

A hard multi-hop scenario that requires layering conditions across several policy documents, including international shipping, returns, warranty coverage, and Premium-member logic.

Tool output

▸Multi-document reasoning5/53 worked well3 findings

It correctly combined information from multiple policy documents, especially in the lost shipment and warranty-defect cases.

Worked wellacross all testslink to this finding

The tool handled multi-document reasoning consistently, correctly combining rules or remedies from multiple documents in both cases.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

Separates domestic and international lost-shipment handling in one answer, giving three domestic remedies and three international remedies, plus the expected verification steps and support details.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Retrieval accuracy5/56 worked well6 findings

It returned the correct policy answers across the simple, medium, and complex policy questions.

Worked wellacross all testslink to this finding

The tool consistently retrieved the needed policy details accurately, including the lost-shipment threshold, Premium return-shipping rules, exclusions, and the available refund-or-repair outcomes.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

Correctly retrieves that Premium return shipping is free for eligible products, and that the benefit includes complimentary domestic return shipping labels.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Tone & empathy4/51 mixed1 finding

The anger response was polite and helpful, but the researcher noted it felt slightly formulaic rather than especially warm.

Mixedwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

Responds to an angry frustration message with an immediate empathetic acknowledgment and four human-contact paths; the reply is helpful and non-defensive, but the warmth is somewhat formulaic.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

▸Citation & sourceCapability check5/51 worked well1 finding

Every key claim was tagged with source references, and the researcher called this the tool’s standout strength.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Worked wellacross all testslink to this finding

Displays numeric source markers on every tested answer, so the retrieved policy responses and the frustration-handling reply are traceable to cited sources throughout the demo.

▸Edge case handling4/51 worked well1 finding

It de-escalated frustration well and offered human handoff options, but did not add much extra warmth or escalation nuance.

Worked wellwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

Handles an angry handoff request appropriately by acknowledging frustration, offering immediate human-contact options, stating support hours, and mentioning priority handling for Premium and Enterprise customers without becoming defensive.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

Wonderchat

Usable#2 of 6

Strong retrieval and follow-up context, but weak warmth and citations

▸Follow-up context5/53 worked well3 findings

Context was retained reliably across follow-up turns in every scenario.

Worked wellacross all testslink to this finding

It consistently kept follow-up context, retaining the prior order-level constraints and staying grounded in the same policy details on the next turn.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

Across 1 follow-up turn, it stayed grounded in the same warranty policy and correctly answered that liquid damage and accidental drops are excluded, while pointing to extended warranty as the alternative.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Multi-document reasoning5/54 worked well4 findings

It combined multiple policy conditions correctly across follow-ups and compound questions, with no reasoning errors reported.

Worked wellacross all testslink to this finding

It consistently handled multi-document policy reasoning, combining EMI, shipping, warranty, and refurbished-product details coherently without mixing categories.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

In the medium session, it combined payment and shipping policy details coherently, giving EMI eligibility above ₹10,000, 4 partner banks, 4 tenure options, 7–12 business-day EMI reversal timing, and region-specific delivery windows in one conversation.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Retrieval accuracy5/51 worked well1 finding

All tested policy answers were retrieved correctly across warranty, EMI, delivery, and warranty-extension questions.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

On the simple warranty thread, the assistant retrieved the correct policy details in one turn and the follow-up: it gave the standard coverage periods for the named product categories, included the battery and cable component rules, and correctly listed the exclusions for accidental damage, liquid exposure, cosmetic damage, unauthorized repairs, misuse, unsupported voltage, and natural disasters.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Tone & empathy2/51 mixed1 struggled2 findings

Only the frustration case opened empathetically; most replies read like a policy dump with little warmth.

Struggledacross all testslink to this finding

In the ordinary support threads shown, the assistant repeatedly responds in long, policy-dump style blocks rather than a warm conversational tone; the report describes this pattern as the main weakness across the session, even when the answers themselves are correct.

Mixedwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

For the frustration query, it opened with 1 empathetic apology sentence before switching to procedural help, so the response showed a minimal but real emotional acknowledgment.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

▸Citation & sourceCapability check1/51 failed1 finding

The answers were not shown with explicit document or source citations in the UI.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Failedacross all testslink to this finding

Across the visible chat screenshots, the assistant provides answer text only; it does not surface document names, source labels, or citation links, so the provenance of the retrieved policy answer is not visible to the user.

▸Edge case handling3/51 worked well2 mixed3 findings

It acknowledged frustration, offered resolution and escalation, but the crisis-style message still got a mostly transactional response.

Mixedacross all testslink to this finding

It could de-escalate frustrated, crisis-adjacent messages and route to live chat/email with the 72-hour report window, but the calmer guidance was not sustained and one response read like a policy dump.

Mixedwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

For the same frustration and anger case, it de-escalated only partially: it offered live-chat/email handoff, cited the 72-hour report window, and requested evidence, but the rest of the answer read as a policy dump rather than sustained calming guidance.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

Voiceflow

Usable#3 of 6

Top performer for retrieval, empathy, and follow-up context, but with no visible citations and weaker evidence on embed/free-tier usability.

▸Follow-up context5/53 worked well1 mixed1 struggled5 findings

Follow-up questions consistently retained session context and answered the new sub-questions without needing re-explanation.

Mixedacross all testslink to this finding

It usually kept the same issue in view and answered follow-up questions without asking for restated context, but it did not always carry forward specific details like the reporting deadline and redirected to support instead.

Struggledwhen we tried: Direct Factual Retrieval (Simple)link to this finding

It keeps the damaged-item topic in the follow-up, but it fails to carry forward the concrete reporting deadline: the agent says it does not have the specific time window and redirects the user to support instead of answering from the prior turn.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Multi-document reasoning5/54 worked well4 findings

It combined policy details accurately across multiple documents, including the complex Germany/Premium/international damaged-device case.

Worked wellacross all testslink to this finding

It consistently handled multi-document policy reasoning, combining multiple relevant policy layers correctly in each case.

Worked wellwhen we tried: Complex Multi-Hop Reasoning (Hard)link to this finding

The Premium-in-Germany damaged SmartHub answer fuses multiple policy facts in one response: Premium status, Germany location, opened-device damage, the 72-hour reporting rule, the 45-day Premium return window, and the Japan/South Korea-only restriction for the X2 wireless rule.

Tool input

benchmark prompt

Complex Multi-Hop Reasoning (Hard)

A hard multi-hop scenario that requires layering conditions across several policy documents, including international shipping, returns, warranty coverage, and Premium-member logic.

Tool output

▸Retrieval accuracy4/52 worked well2 failed4 findings

It answered most policy questions correctly across the tested flows, though the research notes a few omissions and one failed button-based navigation at the end.

Failedwhen we tried: Multi-Document Reasoning (Medium)link to this finding

On the lost-shipment follow-up, it does not retrieve any exact 'considered lost' threshold; it explicitly says the policies do not specify an exact number of days and only offers support escalation and general guidance.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

Worked wellwhen we tried: Complex Multi-Hop Reasoning (Hard)link to this finding

It retrieves a very specific regional rule correctly by stating that the SmartHub X2 wireless restriction applies to Japan and South Korea, not Germany.

Tool input

benchmark prompt

Complex Multi-Hop Reasoning (Hard)

A hard multi-hop scenario that requires layering conditions across several policy documents, including international shipping, returns, warranty coverage, and Premium-member logic.

Tool output

▸Tone & empathy5/51 worked well1 finding

The agent consistently used warm, reassuring language, including an apology on the damaged-product query and a reassuring opening on the complex case.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

It opens damaged-item support replies with an explicit apology ('I'm sorry to hear your product arrived damaged!') before moving into policy details, which is appropriately warm and human.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Citation & sourceCapability check0/51 failed1 finding

The answers did not visibly show which uploaded document each response came from.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Failedacross all testslink to this finding

Across all six captured screenshots, the chatbot output shows plain answer cards and quick-reply chips but no document names, source labels, or citation markers, so the origin of the answer is not displayed.

CustomGPT

Usable#4 of 6

Best-in-class empathy and conversational support, with a notable contact-detail hallucination risk.

▸Follow-up context5/52 worked well2 mixed4 findings

Maintained session context cleanly across follow-up questions in the demo.

Mixedacross all testslink to this finding

It generally stayed grounded in the original context and kept the right policy constraints in view, but on the harder follow-up it narrowed the outcome and missed an allowed full-refund path.

Mixedwhen we tried: Complex Multi-Hop Reasoning (Hard)link to this finding

The refund-vs-repair follow-up stayed on the right policy topic, but it narrowed the outcome to conditional or partial refund and did not surface the full-refund path the report says the KB also allows.

Tool input

benchmark prompt

Complex Multi-Hop Reasoning (Hard)

A hard multi-hop scenario that requires layering conditions across several policy documents, including international shipping, returns, warranty coverage, and Premium-member logic.

Tool output

▸Multi-document reasoning5/53 worked well3 findings

Combined information across documents correctly in the more complex policy scenario.

Worked wellacross all testslink to this finding

It consistently handled multi-document and multi-hop reasoning, combining multiple sources and rule sets into coherent answers with the cited policy details intact.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

It combined 3 referenced sources to answer shipping policy correctly across domestic delivery, international delivery, express eligibility, cutoff time, tracking timing, and the 3-delivery-attempt rule.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Retrieval accuracy4/54 worked well4 findings

Retrieved the core policy answers correctly across the tested queries, with a few missed nuances and one refund-detail oversimplification.

Worked wellacross all testslink to this finding

It consistently retrieved detailed policy facts correctly, including combining multiple windows and exclusions and separating similarly named warranty cases.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

Combines 3 domestic delivery windows, 5 international region windows, a 1–2 business-day express rule, and 4 express exclusions; the follow-up correctly says remote areas are ineligible for express.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Tone & empathy5/51 worked well1 finding

Responded with notably warm, human, and empathetic language, especially on emotional prompts.

Worked wellwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

The crisis-style reply opened with strong empathy and calm de-escalation, making it the most human-sounding response in the demo.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

▸Citation & sourceCapability check5/51 worked well1 finding

Clearly showed source documents referenced in the response for the tested answers.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Worked wellacross all testslink to this finding

Displays a source-attribution panel after responses, naming the underlying policy document and showing reference counts such as 1/1, 1/2, and 1/3.

▸Edge case handling4/51 mixed1 finding

Handled frustration and crisis language well, but introduced hallucinated contact details in the escalation response.

Mixedwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

It redirected the user toward complaint or replacement help instead of escalating the anger, but the report says its phone and email contact details were hallucinated, so the handoff behavior was only partly reliable.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

▸Website embedCapability check5/51 worked well1 finding

The interface prominently offered deployment as a website agent, suggesting easy widget-style deployment.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Worked wellacross all testslink to this finding

Surfaces a persistent 'Ready to deploy this agent to your website?' prompt with a 'Deploy Agent' button after responses, indicating built-in website deployment support.

▸Input HandlingCapability check5/51 worked well1 finding

Accepted and used both PDF and DOCX sources in the demo without errors.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Worked wellacross all testslink to this finding

Across the demo, the agent answered from both PDF and DOCX knowledge-base files with no visible ingestion or parsing errors.

Botpress

Usable#5 of 6

Strong retrieval and solid crisis awareness, but weak on citations and warmth.

▸Follow-up context4/52 worked well2 mixed4 findings

It preserved context well across follow-ups, though the EMI reversal follow-up introduced a slight contradiction.

Mixedacross all testslink to this finding

It generally maintained 2-turn context and carried prior rules or framing into the follow-up, but one harder case became internally inconsistent after first saying no refund was available.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

It maintained 2-turn context and correctly carried the headphone rule into the follow-up, stating that opened in-ear headphones are not returnable for hygiene reasons and offering to check over-ear or on-ear models.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Multi-document reasoning3/51 worked well1 finding

The report shows accurate policy retrieval, but does not clearly demonstrate robust combination of multiple documents.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

Combines several compensation rules into one coherent policy answer: delayed-delivery compensation is Premium-only, can come as store credits, expedited replacements, or priority support, and is excluded for customs delays, incorrect addresses, customer unavailability, and force majeure events.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Retrieval accuracy5/53 worked well2 mixed5 findings

It consistently returned the correct policy answers across the tested queries, with only minor completeness gaps.

Mixedacross all testslink to this finding

It was strong on direct retrieval and multi-document policy details, but the complex multi-hop case introduced a policy contradiction by adding a 7–12 business-day EMI reversal timeline after correctly denying cancellation and refund for a customized laptop.

Worked wellwhen we tried: Direct Factual Retrieval (Simple)link to this finding

It answered the opened-product policy with the full rule set: 3 excluded item groups, a 20% restocking fee for modified/customized items, and the requirement that opened items be undamaged with original accessories and only standard setup use.

Tool input

benchmark prompt

Direct Factual Retrieval (Simple)

A simple set of policy questions that each map to a single knowledge base document, designed to test direct retrieval accuracy and whether follow-up questions stay grounded in the same source.

Tool output

▸Tone & empathy3/51 worked well1 finding

The tone was generally neutral and professional, with limited warmth outside the crisis response.

Worked wellwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

It used an empathetic de-escalation tone in 3 short sentences, opening with an apology and reassurance that the user is not alone.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

▸Edge case handling4/51 worked well2 mixed3 findings

It recognized the crisis query appropriately and responded empathetically, but did not provide hotline or resource details.

Mixedacross all testslink to this finding

It generally recognized crisis-style messages as out-of-scope safety issues, responded with empathy, and redirected users toward human help instead of product troubleshooting, but one response was incomplete because it did not surface specific crisis resources or a proactive human handoff.

Mixedwhen we tried: Edge Cases: Emotional, Frustration & Crisis Handlinglink to this finding

Recognizes a self-harm-style message as a crisis, responds with empathy, avoids treating it as a product issue, and redirects the user toward trusted people or mental-health professionals, but the response is incomplete because it does not surface specific crisis resources or a proactive human handoff.

Tool input

benchmark prompt

Edge Cases: Emotional, Frustration & Crisis Handling

A set of out-of-scope emotional and safety-sensitive messages that are designed to test de-escalation, human handoff behavior, and crisis recognition rather than knowledge-base retrieval.

Tool output

Chatbase

Usable#6 of 6

Reliable policy retriever with strong follow-up handling, but constrained by a hard free-tier credit cap.

▸Follow-up context5/53 worked well3 findings

The bot retained conversational context correctly across follow-up questions in the tested flows.

Worked wellacross all testslink to this finding

It consistently kept follow-up context grounded, retaining the return-policy and COD refund details correctly across both checks.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

On the COD follow-up, the bot stayed grounded in the prior refund policy and correctly said COD refunds are not issued in cash, instead going through verified bank transfer or UPI after 3 verification checks.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Multi-document reasoning3/51 worked well1 finding

It handled related policy lookups and follow-ups well, but the hardest multi-hop case was not tested because the free credit limit ran out.

Worked wellwhen we tried: Complex Multi-Hop Reasoning (Hard)link to this finding

The bot combined international-return rules correctly by surfacing 4 customer responsibilities, 4 non-reimbursable cost types, and 3 restricted product categories, while also separating Premium benefits as domestic-only and not applicable to free international returns.

Tool input

benchmark prompt

Complex Multi-Hop Reasoning (Hard)

A hard multi-hop scenario that requires layering conditions across several policy documents, including international shipping, returns, warranty coverage, and Premium-member logic.

Tool output

▸Retrieval accuracy4/56 worked well6 findings

The bot retrieved the correct return and refund policy answers on the tested simple and medium queries, with only minor completeness gaps.

Worked wellacross all testslink to this finding

It consistently retrieved return-policy details correctly, including refund timelines, international return eligibility and responsibilities, the electronics return window, COD refund handling, and opened-electronics eligibility and restocking fee rules.

Worked wellwhen we tried: Multi-Document Reasoning (Medium)link to this finding

The bot accurately returned refund timelines for 6 payment paths: UPI 2–4 business days, Credit/Debit Card 5–7, Net Banking 5–8, PayPal 3–5, COD 7–10, and EMI reversal 7–12.

Tool input

benchmark prompt

Multi-Document Reasoning (Medium)

A medium-complexity set of questions that requires combining information from multiple policy documents, especially where refund, payment, delivery, and lost-shipment details overlap.

Tool output

▸Website embedCapability check4/51 worked well1 finding

The chatbot was shown running as an embedded website widget, suggesting deployment is workable, though the report does not detail setup complexity.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Worked wellacross all testslink to this finding

The chatbot can be deployed as an on-page website widget that renders the conversation inline and exposes controls such as an AI-requests badge and a 'Revise answer' action.

▸Free tier viabilityCapability check1/51 failed1 finding

The free plan’s 50-credit cap blocked full benchmark coverage, making the tool poorly viable for complete testing without payment.

This is a capability we checked per tool — whether (and how well) it supports this — so it shows a support verdict and what we found, rather than media or an input→output pair.

Failedacross all testslink to this finding

The free plan was not fully benchmarkable because the tool had a hard cap of 50 credits total, which blocked completion of the full RAG test set and prevented evaluation of the critical complex multi-hop query.

Final Take

Denser AI is the overall winner if the goal is accurate policy retrieval with clear evidence: it is the only tool that combines top scores on retrieval, citations, multi-document reasoning, and follow-up context, while also having strong free-tier viability. CustomGPT is the closest all-around challenger and is the better choice for conversational quality, embedding, and free-tier access, but the scorecard flags a notable contact-detail hallucination risk, so it is less safe when factual precision matters. Voiceflow and Wonderchat are strong on retrieval plus follow-up handling, but both lack citation transparency; Voiceflow is the more empathetic of the two, while Wonderchat is slightly stronger on retrieval accuracy. Botpress and Dante AI both retrieve policies well and handle crises reasonably, but weak or missing citation/deployment scores make them less compelling for a citation-sensitive use case. Chatbase is reliable for follow-up context and embedding, but the hard free-tier cap and weak edge-case handling limit it.

Tested as of June 2026 · Will be re-verified monthly