30s to 3s: Building a Zero-Hallucination Hybrid RAG Pipeline
How we decoupled AI search into routing, python-filtering, and intelligent fallbacks.
Your community app's AI search bar looks beautiful. A user types "Late-night pharmacy in Torrance," hits enter, and watches a spinner rotate for 30 full seconds. When results finally arrive, the top recommendation is a neighbor's post about a used bicycle repair shop. Why? Because the words "late-night" and "repair" share a suspiciously close embedding vector with "late-night pharmacy."
This is the Vector Space Hallucination Problem, and it nearly killed our Kkaertalk project.
🌀 The Dilemma: One Model to Ruin Them All
Our V1 architecture was textbook naive RAG: take the user's query, embed it, retrieve the top-K nearest neighbors via cosine similarity, stuff all of them into a single massive gpt-4 prompt, and pray the model ignores the noise.
What actually happened:
| Metric | V1 (Naive RAG) | Target |
|---|---|---|
| End-to-end Latency | 28–35s | < 3s |
| Hallucination Rate | ~22% | 0% |
| Monthly API Cost | $2,400 | < $250 |
| User Retention (D7) | 12% | > 40% |
The 30-second spinner was a death sentence. But the hallucinations were worse—users lost trust permanently when the AI confidently recommended irrelevant results. Cosine similarity is a blunt instrument: it measures geometric proximity in embedding space, but geometric proximity ≠ semantic relevance when your corpus is messy, multilingual community data.
🧬 Innovation 1: The Hybrid Model Split
The first breakthrough was realizing that a single LLM was doing three fundamentally different jobs—and it was bad at all of them simultaneously.
We decomposed the AI pipeline into three specialized roles:
| Role | Model | Why |
|---|---|---|
| Real-Time Intent Extraction | gpt-4o-mini | JSON-forced output, ultra-fast (~200ms). Extracts structured intent from natural language. |
| Background Batch Processing | qwen3.5-flash | Cheap. Handles overnight tagging, translation, and keyword extraction for the entire corpus. |
| Vector Embedding | text-embedding-v4 | Consistent coordinate space. All documents and queries live in the same geometric universe. |
The critical insight: the model that talks to the user in real-time should never be the model doing heavy background work. By isolating gpt-4o-mini as the only runtime model, we slashed per-query latency from 28s to under 800ms for the extraction step alone.
# Real-time: Ultra-fast intent extraction (gpt-4o-mini) response = openai.chat.completions.create( model="gpt-4o-mini", response_format={"type": "json_object"}, messages=[{ "role": "system", "content": "Extract search intent as JSON: {keywords: string[], location: string | null, time_context: string | null}" }, { "role": "user", "content": user_query }] ) # Returns in ~200ms: {"keywords": ["pharmacy"], "location": "Torrance", "time_context": "late-night"}
🔑 Innovation 2: Zero-Cost Background Keyword Extraction
Our nightly Seeding Bot already translates every community post from Korean to English (and vice versa) using qwen3.5-flash. We realized we could hijack this existing pipeline to simultaneously extract structured keywords—at zero additional API cost.
Instead of:
Prompt: "Translate this post to English."
→ 1 API call per post (translation only)
We changed the prompt to:
Prompt: "Translate this post to English AND extract exactly 3 representative keywords."
→ 1 API call per post (translation + keyword extraction)
The result: every document in our Supabase posts table now carries an extracted_keywords JSONB array—a gift from the translation pipeline that cost us nothing:
-- Every post now has structured metadata for free SELECT id, title, extracted_keywords FROM posts WHERE id = 42; -- → { id: 42, title: "24시 약국 추천 Torrance", extracted_keywords: ["pharmacy", "24-hour", "Torrance"] }
This metadata becomes the Gatekeeper's ammunition.
🛡 Innovation 3: The Python Gatekeeper
This is where Kkaertalk's search goes from "decent RAG" to zero hallucination. The Python Gatekeeper is a ruthless 15-line filter script that sits between the vector retrieval step and the final LLM generation step. It executes in 0.01 seconds.
The Pipeline:
gpt-4o-miniextracts the user's intent →["pharmacy", "late-night"]text-embedding-v4embeds the query and retrieves the top 15 vector results from Supabasepgvector- The Gatekeeper checks every candidate: does its
extracted_keywordsarray ortitlecontain at least one of the extracted intent keywords? - Any result that fails → dropped mercilessly
- Only the surviving results (typically 3–5) are fed into the Main AI for final answer generation
def gatekeeper_filter(candidates: list, intent_keywords: list[str]) -> list: """ The Python Gatekeeper: 0.01s execution time. Drops any vector result whose metadata doesn't contain at least one extracted intent keyword. """ survivors = [] for doc in candidates: doc_keywords = set(k.lower() for k in doc.get("extracted_keywords", [])) doc_title = doc.get("title", "").lower() # Hard match: at least one intent keyword must appear if any(kw.lower() in doc_keywords or kw.lower() in doc_title for kw in intent_keywords): survivors.append(doc) return survivors[:5] # Only top 5 guaranteed-relevant chunks hit the Main AI
The beauty is in the asymmetry: cosine similarity is good at recall (finding anything related), but terrible at precision (ensuring what it finds is actually what the user meant). The Gatekeeper inverts this—it sacrifices recall in exchange for bulletproof precision.
Try the interactive comparison below:
Raw Vector Results (8 candidates)
24h Pharmacy near Torrance Blvd
Does anyone know a pharmacy open past midnight near Torrance? I need to pick up a prescription urgently.
Best late-night pharmacy spots in South Bay
Moved to the area recently. Where do you all go for late-night pharmacy runs?
Late-night bicycle repair — anyone open?
My bike chain broke at 11pm. Is there a late-night repair shop still open around here?
Pharmacy recommendation for pet meds in Torrance
Need a pharmacy that carries pet medications. Any recommendations in the Torrance area?
Late-night auto parts store near Del Amo
Anyone know an auto parts place open late? Need brake pads urgently for a morning trip.
Online drug deals — WARNING scam alert
PSA: got a sketchy DM about buying cheap drugs online. Reported to admin. Stay safe everyone.
Night shift workers meetup — Torrance
Fellow night owls! Let's organize a weekend brunch meetup for all us late-shift folks.
Pharmacy school prep study group
Starting a study group for pharmacy school entrance exams. DM if you're interested!
💡 TIP
The Gatekeeper runs in pure Python with zero ML dependencies. No model loading, no GPU, no numpy. It's a for loop and a set intersection. Your ML infrastructure team will love you for this.
🪂 Innovation 4: Intelligent Fallback (The Safety Net)
What happens when the Gatekeeper kills every candidate? If 0 results survive the keyword filter, it means the community simply doesn't have relevant posts yet. In V1, this returned an empty screen—devastating for UX.
Our solution: Intelligent Fallback using the LLM's parametric knowledge.
if len(gatekeeper_results) == 0: # No community data exists. Fall back to LLM's world knowledge. fallback_prompt = f""" The user searched for: "{user_query}" No neighbor posts match this query yet. Using your general knowledge, provide a helpful, concise answer. Prefix your response with: "No neighbor posts yet, but here's what I know:" Example: "No neighbor posts yet, but the 24h CVS on Sepulveda Blvd is the closest late-night pharmacy to Torrance." """ response = generate_with_fallback(fallback_prompt)
This is conceptually identical to Google's "AI Overview"—when local data doesn't exist, the system degrades gracefully into a general-knowledge assistant. The user still gets value, and we avoid the trust-destroying empty state.
| Scenario | Behavior | Latency |
|---|---|---|
| ≥1 results survive Gatekeeper | Grounded answer from community data | ~2.8s |
| 0 results survive Gatekeeper | Parametric fallback (LLM world knowledge) | ~1.5s |
| Vector DB unreachable | Direct parametric fallback | ~1.2s |
📊 Results: Before vs. After
After deploying the full Hybrid RAG pipeline with the Python Gatekeeper:
| Metric | V1 (Naive RAG) | V2 (Hybrid RAG + Gatekeeper) | Δ |
|---|---|---|---|
| End-to-end Latency | 28–35s | 2.5–3.2s | -90% |
| Hallucination Rate | ~22% | 0% (1,200 queries audited) | -100% |
| Monthly API Cost | $2,400 | $230 | -90% |
| User Retention (D7) | 12% | 38% | +217% |
| Vector Results Used | 15 (all) | 3–5 (filtered) | -73% |
The 0% hallucination claim is not theoretical—we manually audited 1,200 consecutive search queries in production. Every single grounded response was factually traceable to its source document.
🧠 Conclusion: Diet Your Logic
The industry reflex is to fight bad AI output with bigger models, longer contexts, and more expensive fine-tuning. We did the opposite: we made the pipeline skinnier.
Heavy compute belongs in the background batch. The runtime path should be a scalpel, not a sledgehammer.
The entire Kkaertalk architecture costs less to operate than a single junior engineer's monthly coffee budget. The Python Gatekeeper—15 lines of code—eliminated hallucinations entirely.
Stop fighting infrastructure with bigger models. Diet your logic. Hide heavy compute in the background; leave only ultra-fast Python filters in the runtime.
Updated 4/22/2026