People have told me my blogs are too long, I like long blogs, they don’t. Too bad. I need to solve this problem, sure I can write shorter blogs, but there is no fun in that.

I wanted the ability to search on my blog, search content and put an LLM on this blog so I and other people can just hit samitmohan.github.io/ask and ask any blog related question.

I had never built a RAG. So I watched some Stanford videos, read some docs, and used ChatGPT to plan the project.

ChatGPT doesn’t know about samitmohan.github.io - it’s a general model that can’t answer questions about my blog. I could fine-tune a model, but that’s expensive and overkill. RAG is simpler: give the model my blog content as context at query time, no fine-tuning needed.

finetune


what this project does

The goal was to build a retrieval-augmented question-answering system that:

  • Indexes all my blog posts (17 posts, ~1100 chunks).
  • Retrieves relevant text or code chunks based on semantic similarity + intent-aware reranking + lexical boosting.
  • Generates grounded, streaming answers with citations using Groq’s llama-3.3-70b-versatile - always on, no local setup needed.
  • Measures itself with an evaluation pipeline (MRR, Recall, Precision over 20 golden queries).

architecture

GitHub Pages UI (/ask)
        |
        v
FastAPI backend (Render)
        |
        v
FAISS vector store (1111 chunks, 384-dim)
        |
        v
Intent-aware retrieval + hybrid reranking
        |
        v
Groq API (llama-3.3-70b) -- streaming SSE
        |
        v
Answer + citations streamed to browser

arch

FAISS isn’t optimal for all scales, but for 1111 chunks it works fine.

repo structure

rag-for-blogs/
├── pyproject.toml          # Pinned deps, dev extras (pytest)
├── config.py               # Centralized settings (Groq, embeddings, chunking params)
├── .env.example            # GROQ_API_KEY, GITHUB_TOKEN
├── data/
│   └── faiss/              # Pre-built FAISS index + metadata (committed to git)
├── ingestion/
│   ├── load_all_blogs.py   # Fetch posts from GitHub API
│   ├── parse_markdown.py   # Frontmatter + structure extraction
│   ├── clean_text.py       # Normalize without summarizing
│   └── chunker.py          # Overlap-aware chunking (~200 token target, 50 overlap)
├── embeddings/
│   ├── embed_chunks.py     # FastEmbed (BAAI/bge-small-en-v1.5, 384-dim)
│   └── vector_store.py     # FAISS index + metadata persistence
├── retrieval/
│   └── retrieve.py         # Intent classification, reranking, lexical boosting
├── generation/
│   ├── groq_client.py      # Groq SDK wrapper (streaming + blocking)
│   ├── answer.py           # LLM call
│   ├── answer_with_citations.py  # Streaming SSE + citation attachment
│   ├── prompt_builder.py   # System + user prompt construction
│   └── formatter.py        # Post-processing
├── eval/
│   ├── golden_set.json     # 20 hand-curated test queries
│   └── evaluate.py         # MRR@k, Recall@k, Precision@k
├── tests/                  # 32 tests (chunker, retrieval, embeddings, integration)
├── backend/
│   └── app.py              # FastAPI + streaming endpoint
└── index.html              # Frontend with SSE consumption

This is how the UI looks

ragop


phase 1: ingestion & corpus creation

Goal: Turn blog content into clean, inspectable retrieval units.

phase1 phase1step

1. loading blog data

All my blog posts already exist as Markdown files in my GitHub repository. All I had to do was write a simple load_all_blogs.py script that fetches them from the website.

2. parsing markdown

Markdown can be messy. parse_markdown.py extracts structured data:

  • Frontmatter: Title, date, categories.
  • Structure: Section headings, paragraphs, code blocks, and links.

Avoided using a heavy-weight Markdown AST parser and instead relied on careful regex and structure matching. This gave me more control but was trickier to get right than I initially expected, had to use GPT.

3. cleaning without summarizing

The script clean_text.py handles normalization:

  • Normalizes whitespace.
  • Removes non-content “junk”.
  • Does not rewrite or summarize.

parsing

RAG quality starts with data integrity. If the source data is lost or mutated, retrieval breaks. This part was pretty easy.

4. chunking (the part i got wrong the first time)

A chunk is the smallest piece of text that should be retrieved on its own to answer a question.

Chunking took me a while to get right. Your embedding model doesn’t care about your paragraph breaks. It cares about semantic density. My first chunker was dead simple - one paragraph = one chunk. The result? 2851 chunks with a median size of 16 words. Sixteen words. Barely a sentence.

If someone asks “how does KV cache work?”, and the answer spans 3 consecutive paragraphs, my old chunker would retrieve maybe 1 of those 3 paragraphs. The embedding for a 16-word chunk is noisy; there’s not enough semantic signal for FAISS to work with.

The fix was obvious once I measured it:

  1. Merge consecutive text items within a section into one buffer
  2. Split by approximate token count (~200 tokens target, 50 token overlap)
  3. Never split code blocks (they need to be intact or they’re useless)
  4. Prepend [Section: {name}] to each chunk so the embedding model knows context

Without overlap, a key sentence at a chunk boundary disappears from both chunks. With 50 tokens of overlap, boundary information shows up in both neighbors.

The core splitting logic:

def _split_text_with_overlap(text: str, target: int, overlap: int) -> List[str]:
    words = text.split()
    if len(words) <= target:
        return [text]

    chunks = []
    start = 0
    while start < len(words):
        end = start + target
        chunk_words = words[start:end]
        chunks.append(" ".join(chunk_words))
        start += target - overlap
        if start >= len(words):
            break
    return chunks

No tiktoken, no BPE, just split(). Word count is a good enough proxy for token count when you’re chunking blog posts, not training a tokenizer.

Interactive: Sliding Window Chunking

Drag the slider to change overlap. Watch how chunks share boundary text (highlighted in yellow).


Result: Chunk count dropped from 2851 to 1111. Same 17 blog posts, each chunk now has enough context for the embedding model to produce a meaningful vector. Retrieval quality jumped - numbers in Phase 5.

chunk


phase 2: embeddings & vector store

  • Store chunks in a searchable vector index. For RAG, chunks are stored as vectors where semantically related words cluster together - the vector for “king” and “queen” land nearby because they’re related.
  • Use cosine similarity to measure distance between a query vector and stored chunk vectors, assigning a score to each.

embeddings

I started with the Sentence Transformers library for generating embeddings:

SentenceTransformerEmbeddings("all-MiniLM-L6-v2")
  • 384-dimensional vectors which means 1 chunk has dimension of 384 in space (generally, more dimensions = more spatial awareness in vector space = better semantic search, though with diminishing returns and increased computational cost)
  • Fast inference
  • Runs locally
  • Excellent performance for semantic search

Both text and code use the same model, flagged via metadata to distinguish them during retrieval. I prefix text chunks with "Explain the following text:\n" and code chunks with "Code example:\n" before embedding. Same model, but the prefix biases the embedding vector toward the right semantic neighborhood.

Update; SentenceTransformer didn’t work out in the long run; I wanted others to use my /ask feature and creating query embeddings is fast in SentenceTransformerEmbedding but takes too much memory (my back-end for this was deployed on Render and it crashes);

So I switched to fastembed TextEmbedding(model_name="BAAI/bge-small-en-v1.5")

It’s fast, lightweight and onnx weights are only ~200mb; so doesn’t consume a lot of cores (deployable on render)

But both/all embedding models work the same; they take your chunks that you just created -> convert into vector embeddings -> which are later stored in a database so that you can search using cosine similarity;

Embedding

vector store (FAISS)

search

Store these vector embeddings in a database.

I use FAISS (Facebook AI Similarity Search) to store the vectors efficiently. Since FAISS only stores the vectors themselves, I maintain a parallel metadata store:

vector  -->  embedding (384-dim, L2-normalized)
payload -->  { id, text, metadata: {post_title, section, date, url, type} }

I normalize vectors before adding them to FAISS. L2 distance then becomes equivalent to cosine distance, which is what you want for semantic similarity. Without normalization, a longer chunk could appear “closer” due to larger embedding magnitude, not relevance.

Query Example:

  • Query: “Why do we use softmax in MNIST?”
  • Returns:
    • Distance: 0.29 (lower the score; more the similarity between our chunks / stored_embeddings and query)
    • Chunk: “Softmax is used in MNIST classification…”

search2 search3

At this point: blog documents -> cleaned -> chunks -> embeddings -> searchable database.

And we have our query/question which is also converted into embedding.

These two are searched using cosine similarity; lesser score means closer -> but it’s not necessary that the lowest score only will be the correct answer.

Interactive: Cosine Similarity

Drag the blue vector (chunk) to change its angle relative to the red vector (query). Cosine similarity = 1 when vectors point the same way, 0 when perpendicular, -1 when opposite.

Cosine similarity alone isn’t enough. That’s where the retrieval pipeline comes in.


phase 3: retrieval control & reranking

retrieval

Raw semantic search isn’t always enough. User intent varies.

  • If I ask “show me code” - I want code blocks.
  • If I ask “why does this work” - I want conceptual explanations.

To handle this, retrieve.py implements a multi-stage pipeline:

query
  |
  v
intent classification (code indicators vs text indicators)
  |
  v
embed query (same BGE model)
  |
  v
FAISS search (top 10 candidates)
  |
  v
reranking:
  - intent match:   distance *= 0.5 (boost)
  - intent mismatch: distance *= 2.0 (penalize)
  - section name match: distance *= 0.8
  - lexical overlap:  distance *= (1.0 - 0.3 * coverage_ratio)
  - exact phrase match: distance *= 0.8
  - drop chunks < 5 words
  |
  v
deduplicate (one chunk per post+section)
  |
  v
confidence check (best distance < 1.0)
  |
  v
top-k chunks

Interactive: Reranking Pipeline

Query: "show me the attention code"   Detected intent: code

Click each step to see how retrieval scores change. Yellow rows = changed in this step. Lower distance = more relevant.

Pure vector search returns “semantically similar” results, but that’s not always what you want. Someone asking “show me the code for attention” doesn’t want a paragraph explaining attention - they want the actual class SelfAttention(nn.Module) block. The intent classifier is dead simple (just keyword matching), but the 0.5x boost / 2.0x penalty makes a real difference.

The lexical overlap part is a poor man’s BM25. I count what fraction of query words appear in the chunk text, and boost accordingly. If every query word appears in the chunk, distance gets multiplied by 0.7. If you search for “softmax” and a chunk literally contains “softmax”, it should float to the top even if some other chunk is slightly closer in embedding space.

Filtering by intent type cuts a lot of irrelevant results before the LLM ever sees them.

retrieval2


phase 4: prompting & generation

Simple RAG just dumps chunks to the LLM and hopes for the best. That works poorly.

simpleLLM

Adding a system prompt that explains the context and gives answer instructions makes a real difference.

prompt construction

Context is king. I explicitly tell the LLM what each retrieved chunk represents to avoid confusion:

[TEXT | source=ai_engineering_KV_Cache_0]
KV caching stores previously computed key-value pairs so the model
doesn't recompute attention for prior tokens during autoregressive generation...

[CODE | source=math_for_ai_Self_Attention_Code_0]
class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        ...

Then I append the user’s question. This simple formatting prevents the model from “guessing” the context type. It also gives the model source IDs it can reference, which maps directly to the citations we return.

promptLLM

generation (Groq + streaming)

This was the biggest pain point of v1. I was running Ollama locally with qwen2.5:7b, which worked great on my laptop but was completely useless for anyone else. The LLM was offline ~90% of the time because it required my laptop to be running + tunneled through ngrok. Not a great look.

The fix was embarrassingly simple: Groq. Free tier, fast inference, always available. I swapped Ollama for Groq’s API running llama-3.3-70b-versatile (10x the parameters of my old local model) and the whole deployment headache just… disappeared.

from groq import Groq

client = Groq(api_key=GROQ_API_KEY)

def generate_stream(prompt):
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": prompt["system"]},
            {"role": "user", "content": prompt["user"]},
        ],
        stream=True,
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

The other big upgrade: streaming. In v1, the user clicked search, waited 5-10 seconds staring at “workin…”, then got the full answer dumped at once. Now the frontend consumes Server-Sent Events and renders tokens as they arrive, with a cursor animation. Feels like ChatGPT.

The SSE format is simple:

data: {"token": "KV"}
data: {"token": " caching"}
data: {"token": " stores"}
...
data: {"done": true, "citations": [...]}

The frontend uses ReadableStream to consume this, debounces markdown re-rendering at ~80ms intervals (so we’re not calling marked.parse() on every single token), and removes the streaming cursor when done arrives.

citations & grounding

The system is designed to never hallucinate sources. Instead, it:

  • Tracks exactly which chunks were used to generate the answer.
  • Returns specific citations alongside the answer.

Example Response:

Answer: Softmax converts logits into probabilities…

Sources: 600+ leetcode questions: lessons_intro_0 (text, section=intro)

This was fairly easy since I was storing meta-data along with the vectors; so whatever answer is returned; just return the source along with it.


phase 5: measuring retrieval quality

This is the part most RAG tutorials skip, and it’s the part that matters most. You can build the prettiest pipeline in the world, but if you can’t measure whether it actually retrieves the right chunks, you’re just vibing.

I built an evaluation pipeline (eval/evaluate.py) that:

  1. Loads 20 hand-written test queries from eval/golden_set.json
  2. Each query has ground-truth annotations: which blog post should be retrieved, which section, what intent
  3. Runs each query through the full retrieval pipeline (no LLM, just the retriever)
  4. Computes three standard IR metrics: MRR@k, Recall@k, Precision@k

MRR (Mean Reciprocal Rank): where does the first correct result appear? If it’s #1, score is 1.0. If it’s #3, score is 0.33. If it’s not in top-k at all, score is 0.

Recall@k: for what fraction of queries did we get at least one correct result in top k?

Precision@k: of the top k results, what fraction are actually relevant?

The golden set:

{
  "query": "How does KV cache work in transformers?",
  "relevant_posts": ["AI engineering"],
  "relevant_sections": ["KV Cache", "KV Caching: Core Idea"],
  "intent": "text"
}

Writing these queries is the hard part. You have to actually know your own content well enough to say “this query should hit this section of this post.” I wrote 20 of them covering different posts, intents, and difficulty levels.

the results

This is what made the chunking rewrite worth it. Same embedding model, same retrieval logic, same everything - only difference is how the text gets chunked.

Retrieval Metrics: Old Chunking vs New Chunking

The numbers:

============================================================
  Retrieval Evaluation  |  20 queries
============================================================
Metric              @1       @3       @5
-----------------------------------------
MRR              0.450    0.558    0.571
Recall           0.450    0.700    0.750
Precision        0.450    0.250    0.180
-----------------------------------------

MRR@1 went from 0.10 to 0.45 - that’s a 4.5x improvement. Meaning the correct result went from almost never being #1 to being #1 nearly half the time.

Recall@5 went from 0.45 to 0.75 - three out of four queries now get a relevant result in the top 5. On the old chunking, more than half the queries missed entirely.

The 5 remaining misses are queries like “What is RLHF?” where the relevant content exists in a larger merged chunk with a section title that doesn’t exactly match. Fixable with better section-level metadata, but 75% recall is solid for a personal blog search engine.

The lesson here is what Karpathy keeps saying: don’t just train, evaluate. The eval pipeline is 150 lines of code and a JSON file with 20 entries. It took maybe an hour to write. But it’s the difference between “I think my RAG works” and “I can show you that MRR@1 improved by 4.5x when I changed chunking strategy.” One of these is a side project, the other is engineering.


persistence & deployment

persistence

Since I have to deploy this on my website; every time user goes to samitmohan.github.io/ask it shouldn’t reindex / embed all my blogs; I need to do it once and store it in /data folder.

The FAISS index and metadata are persisted to disk and committed to git. On restart:

  • No re-embedding required.
  • No re-indexing required.
  • Instant retrieval.

deployment (the v1 struggle and the fix)

v1 was a mess deployment-wise. I had this Rube Goldberg setup:

  • Backend on Render (free tier) - handles retrieval, always on
  • LLM on my laptop via Ollama - only works when my laptop is open
  • ngrok tunnel connecting my laptop’s Ollama to the Render backend

forwarding

It technically worked but the LLM was offline 90% of the time. Not useful for anyone who isn’t me, sitting at my desk, with Ollama running.

v2 is much simpler: Backend on Render, LLM on Groq. Both always on. No tunnels, no local dependencies, no dual-mode architecture. Just set GROQ_API_KEY as an env var on Render and you’re done.

render

The streaming endpoint (/query/stream) does retrieval and generation in one request, streaming tokens back as SSE. The frontend handles the rest. No websockets, no polling, just a standard HTTP response with text/event-stream content type.

@app.post("/query/stream")
def query_stream_endpoint(req: RetrieveRequest):
    retrieved = retrieve(req.query, vector_store, k=req.top_k)
    if not is_confident(retrieved):
        def empty_stream():
            yield f"data: {json.dumps({'done': True, 'citations': [], 'no_results': True})}\n\n"
        return StreamingResponse(empty_stream(), media_type="text/event-stream")

    chunks = [r["chunk"] for r in retrieved]
    return StreamingResponse(
        answer_with_citations_stream(req.query, chunks),
        media_type="text/event-stream",
    )

Final Pipeline:

flow

  • GitHub Pages UI captures input.
  • FastAPI backend processes the request.
  • FAISS performs retrieval.
  • Reranker optimizes results based on intent + lexical overlap.
  • Groq (llama-3.3-70b) streams the response.
  • UI renders tokens as they arrive + source cards.

testing

32 tests across 4 files. Not aiming for 100% coverage, just testing the parts that matter for a ML pipeline:

  • Chunker tests (13): metadata correctness, size bounds, overlap verification, code block integrity, edge cases (empty input, consecutive merging, text-code-text sequences)
  • Retrieval tests (11): intent classification, confidence thresholding, code heuristic detection, reranking boosts correct intent, deduplication
  • Embedding tests (4): deterministic output, 384-dim verification, empty input handling, different prefixes produce different embeddings
  • Integration tests (2): end-to-end ingest-embed-query pipeline with synthetic docs, confidence filtering on unrelated queries
uv run python -m pytest tests/ -v
# 32 passed in 0.31s

The slow tests (anything that loads the embedding model) are marked with @pytest.mark.slow so you can skip them during development with -m "not slow".


final thoughts

Lessons learnt about building RAG:

  • RAG is mostly data engineering, not model selection. The chunking rewrite improved retrieval by 4.5x. The embedding model stayed the same.
  • Measure everything. The eval pipeline is the most valuable file in the repo. Without it, the chunking rewrite would have been a guess.
  • Chunking strategy matters more than the embedding model. 16-word chunks with a perfect embedding model will still lose to 200-word chunks with a mediocre one.
  • Intent control beats larger embeddings. A simple keyword-based intent classifier + distance multipliers outperforms just throwing more dimensions at the problem.
  • Don’t over-engineer deployment. v1 had tunnels, dual modes, and local LLM. v2 has one env var. Use hosted inference APIs when they’re free and good enough.

You can test it at samitmohan.github.io/ask

Source code: github.com/samitmohan/rag-for-blogs

# Run it locally
git clone https://github.com/samitmohan/rag-for-blogs
cd rag-for-blogs
export GROQ_API_KEY=gsk_your_key_here
uv sync --all-extras
uv run uvicorn backend.app:app --reload --port 8000
# open localhost:8000

demo

appendix: how to measure RAG accuracy

accuracy

The eval pipeline in this repo measures retrieval quality, not generation quality. That’s intentional - retrieval is the bottleneck. If you retrieve the right chunks, a 70B model will almost certainly generate a good answer. If you retrieve garbage, no amount of prompt engineering will save you.

To measure generation quality you’d need human evaluation or an LLM-as-judge setup (feed the answer + ground truth to GPT-4 and ask it to score). That’s a project for another day.

  • Another project (Customer Support Bot) which is much simpler and I built to learn RAG basics (LangChain for everything) can be found here

<
Previous Post
all the math you need for ai
>
Next Post
first failure of 2026: solving 100+ deep learning questions