30s to 3s: Building a Zero-Hallucination Hybrid RAG Pipeline

How we decoupled AI search into routing, python-filtering, and intelligent fallbacks.

Your community app's AI search bar looks beautiful. A user types "Late-night pharmacy in Torrance," hits enter, and watches a spinner rotate for 30 full seconds. When results finally arrive, the top recommendation is a neighbor's post about a used bicycle repair shop. Why? Because the words "late-night" and "repair" share a suspiciously close embedding vector with "late-night pharmacy."

This is the Vector Space Hallucination Problem, and it nearly killed our Kkaertalk project.

🌀 The Dilemma: One Model to Ruin Them All

Our V1 architecture was textbook naive RAG: take the user's query, embed it, retrieve the top-K nearest neighbors via cosine similarity, stuff all of them into a single massive gpt-4 prompt, and pray the model ignores the noise.

What actually happened:

Metric	V1 (Naive RAG)	Target
End-to-end Latency	28–35s	< 3s
Hallucination Rate	~22%	0%
Monthly API Cost	$2,400	< $250
User Retention (D7)	12%	> 40%

The 30-second spinner was a death sentence. But the hallucinations were worse—users lost trust permanently when the AI confidently recommended irrelevant results. Cosine similarity is a blunt instrument: it measures geometric proximity in embedding space, but geometric proximity ≠ semantic relevance when your corpus is messy, multilingual community data.

🧬 Innovation 1: The Hybrid Model Split

The first breakthrough was realizing that a single LLM was doing three fundamentally different jobs—and it was bad at all of them simultaneously.

We decomposed the AI pipeline into three specialized roles:

Role	Model	Why
Real-Time Intent Extraction	`gpt-4o-mini`	JSON-forced output, ultra-fast (~200ms). Extracts structured intent from natural language.
Background Batch Processing	`qwen3.5-flash`	Cheap. Handles overnight tagging, translation, and keyword extraction for the entire corpus.
Vector Embedding	`text-embedding-v4`	Consistent coordinate space. All documents and queries live in the same geometric universe.

The critical insight: the model that talks to the user in real-time should never be the model doing heavy background work. By isolating gpt-4o-mini as the only runtime model, we slashed per-query latency from 28s to under 800ms for the extraction step alone.

# Real-time: Ultra-fast intent extraction (gpt-4o-mini)
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[{
        "role": "system",
        "content": "Extract search intent as JSON: {keywords: string[], location: string | null, time_context: string | null}"
    }, {
        "role": "user",
        "content": user_query
    }]
)
# Returns in ~200ms: {"keywords": ["pharmacy"], "location": "Torrance", "time_context": "late-night"}

🔑 Innovation 2: Zero-Cost Background Keyword Extraction

Our nightly Seeding Bot already translates every community post from Korean to English (and vice versa) using qwen3.5-flash. We realized we could hijack this existing pipeline to simultaneously extract structured keywords—at zero additional API cost.

Instead of:

Prompt: "Translate this post to English."
→ 1 API call per post (translation only)

We changed the prompt to:

Prompt: "Translate this post to English AND extract exactly 3 representative keywords."
→ 1 API call per post (translation + keyword extraction)

The result: every document in our Supabase posts table now carries an extracted_keywords JSONB array—a gift from the translation pipeline that cost us nothing:

-- Every post now has structured metadata for free
SELECT id, title, extracted_keywords FROM posts WHERE id = 42;
-- → { id: 42, title: "24시 약국 추천 Torrance", extracted_keywords: ["pharmacy", "24-hour", "Torrance"] }

This metadata becomes the Gatekeeper's ammunition.

🛡 Innovation 3: The Python Gatekeeper

This is where Kkaertalk's search goes from "decent RAG" to zero hallucination. The Python Gatekeeper is a ruthless 15-line filter script that sits between the vector retrieval step and the final LLM generation step. It executes in 0.01 seconds.

The Pipeline:

gpt-4o-mini extracts the user's intent → ["pharmacy", "late-night"]
text-embedding-v4 embeds the query and retrieves the top 15 vector results from Supabase pgvector
The Gatekeeper checks every candidate: does its extracted_keywords array or title contain at least one of the extracted intent keywords?
Any result that fails → dropped mercilessly
Only the surviving results (typically 3–5) are fed into the Main AI for final answer generation

def gatekeeper_filter(candidates: list, intent_keywords: list[str]) -> list:
    """
    The Python Gatekeeper: 0.01s execution time.
    Drops any vector result whose metadata doesn't contain
    at least one extracted intent keyword.
    """
    survivors = []
    for doc in candidates:
        doc_keywords = set(k.lower() for k in doc.get("extracted_keywords", []))
        doc_title = doc.get("title", "").lower()
        
        # Hard match: at least one intent keyword must appear
        if any(kw.lower() in doc_keywords or kw.lower() in doc_title 
               for kw in intent_keywords):
            survivors.append(doc)
    
    return survivors[:5]  # Only top 5 guaranteed-relevant chunks hit the Main AI

The beauty is in the asymmetry: cosine similarity is good at recall (finding anything related), but terrible at precision (ensuring what it finds is actually what the user meant). The Gatekeeper inverts this—it sacrifices recall in exchange for bulletproof precision.

Try the interactive comparison below:

Hybrid RAG Gatekeeper Sim

0.0s

v2.0.0

"Late-night pharmacy in Torrance" pgvector

Raw Vector Results (8 candidates)

24h Pharmacy near Torrance Blvd

Does anyone know a pharmacy open past midnight near Torrance? I need to pick up a prescription urgently.

pharmacy24-hourTorrance

cos: 0.91

Best late-night pharmacy spots in South Bay

Moved to the area recently. Where do you all go for late-night pharmacy runs?

pharmacylate-nightSouth Bay

cos: 0.87

Late-night bicycle repair — anyone open?

My bike chain broke at 11pm. Is there a late-night repair shop still open around here?

bicyclerepairlate-night

cos: 0.84

⚠ Vector Space Hallucination

Pharmacy recommendation for pet meds in Torrance

Need a pharmacy that carries pet medications. Any recommendations in the Torrance area?

pharmacypetTorrance

cos: 0.83

Late-night auto parts store near Del Amo

Anyone know an auto parts place open late? Need brake pads urgently for a morning trip.

auto-partslate-nightDel Amo

cos: 0.82

⚠ Vector Space Hallucination

Online drug deals — WARNING scam alert

PSA: got a sketchy DM about buying cheap drugs online. Reported to admin. Stay safe everyone.

scamonlinewarning

cos: 0.79

⚠ Irrelevant Content

Night shift workers meetup — Torrance

Fellow night owls! Let's organize a weekend brunch meetup for all us late-shift folks.

meetupnight-shiftTorrance

cos: 0.76

⚠ Semantic Drift

Pharmacy school prep study group

Starting a study group for pharmacy school entrance exams. DM if you're interested!

pharmacy-schoolstudyeducation

cos: 0.73

⚠ Semantic Drift

~37%

Hallucination

0.0s

Latency

→ Main AI

💡 TIP The Gatekeeper runs in pure Python with zero ML dependencies. No model loading, no GPU, no numpy. It's a for loop and a set intersection. Your ML infrastructure team will love you for this.

🪂 Innovation 4: Intelligent Fallback (The Safety Net)

What happens when the Gatekeeper kills every candidate? If 0 results survive the keyword filter, it means the community simply doesn't have relevant posts yet. In V1, this returned an empty screen—devastating for UX.

Our solution: Intelligent Fallback using the LLM's parametric knowledge.

if len(gatekeeper_results) == 0:
    # No community data exists. Fall back to LLM's world knowledge.
    fallback_prompt = f"""
    The user searched for: "{user_query}"
    No neighbor posts match this query yet. 
    
    Using your general knowledge, provide a helpful, concise answer.
    Prefix your response with: "No neighbor posts yet, but here's what I know:"
    
    Example: "No neighbor posts yet, but the 24h CVS on Sepulveda Blvd 
    is the closest late-night pharmacy to Torrance."
    """
    response = generate_with_fallback(fallback_prompt)

This is conceptually identical to Google's "AI Overview"—when local data doesn't exist, the system degrades gracefully into a general-knowledge assistant. The user still gets value, and we avoid the trust-destroying empty state.

Scenario	Behavior	Latency
≥1 results survive Gatekeeper	Grounded answer from community data	~2.8s
0 results survive Gatekeeper	Parametric fallback (LLM world knowledge)	~1.5s
Vector DB unreachable	Direct parametric fallback	~1.2s

📊 Results: Before vs. After

After deploying the full Hybrid RAG pipeline with the Python Gatekeeper:

Metric	V1 (Naive RAG)	V2 (Hybrid RAG + Gatekeeper)	Δ
End-to-end Latency	28–35s	2.5–3.2s	-90%
Hallucination Rate	~22%	0% (1,200 queries audited)	-100%
Monthly API Cost	$2,400	$230	-90%
User Retention (D7)	12%	38%	+217%
Vector Results Used	15 (all)	3–5 (filtered)	-73%

The 0% hallucination claim is not theoretical—we manually audited 1,200 consecutive search queries in production. Every single grounded response was factually traceable to its source document.

🧠 Conclusion: Diet Your Logic

The industry reflex is to fight bad AI output with bigger models, longer contexts, and more expensive fine-tuning. We did the opposite: we made the pipeline skinnier.

Heavy compute belongs in the background batch. The runtime path should be a scalpel, not a sledgehammer.

The entire Kkaertalk architecture costs less to operate than a single junior engineer's monthly coffee budget. The Python Gatekeeper—15 lines of code—eliminated hallucinations entirely.

Stop fighting infrastructure with bigger models. Diet your logic. Hide heavy compute in the background; leave only ultra-fast Python filters in the runtime.