Most RAG tutorials hand you a LangChain one-liner and call it a day. That’s fine for prototyping, but it leaves you blind to the decisions happening under the hood. What chunking strategy is it using? How is similarity actually computed? What does the prompt look like before it hits the LLM?

I wanted to understand those decisions, so I built a RAG pipeline from scratch using Wikipedia articles as a knowledge base, a local embedding model, FAISS for vector search, and Claude for generation. No frameworks, no orchestration layers. Every component is explicit.

This post walks through the architecture choices I made and the tradeoffs behind each one.

The full notebook is available on my GitHub.

The pipeline at a glance

The system has six stages:

  1. Ingest 15 Wikipedia articles on AI/ML topics via the MediaWiki API
  2. Chunk each article into ~1,024-character segments at sentence boundaries
  3. Embed each chunk using all-MiniLM-L6-v2 (a local sentence-transformer)
  4. Index the embeddings in a FAISS flat inner-product index
  5. Retrieve the top-K chunks most similar to a user query
  6. Generate an answer via Claude, grounded in the retrieved context

Each stage involves a meaningful design choice. Let me go through them.

Data ingestion: why the MediaWiki API instead of a HuggingFace dump

The wikimedia/wikipedia dataset on HuggingFace is a common starting point, but the full English dump is roughly 20 GB. Even streaming it and filtering by title match is slow and wasteful when you only need 15 specific articles.

Instead, I fetch articles directly from the MediaWiki API by exact title. This gives clean plaintext extracts, respects Wikipedia’s rate limits with a polite User-Agent header, and keeps the data loading step under a minute. The tradeoff is that you need to know your article titles upfront, but for a curated domain-specific corpus, that’s exactly what you want.

Chunking: sentence-boundary splitting

Chunking is where a lot of RAG pipelines quietly go wrong. The simplest approach is to split every N characters, but that cuts mid-sentence, which means your chunks contain incomplete thoughts. When those fragments get retrieved, the LLM has to work with broken context.

My approach splits on sentence boundaries. The function uses a regex to identify sentence endings (periods, exclamation marks, question marks followed by whitespace), then greedily accumulates sentences until the next one would push the chunk past 1,024 characters. This produces chunks that average around 1,163 characters, each one containing complete sentences.

sentences = re.split(r'(?<=[.!?]) +', text)
chunks, current = [], ""

for sentence in sentences:
    if len(current) + len(sentence) <= chunk_size:
        current += " " + sentence
    else:
        if current.strip():
            chunks.append(current.strip())
        current = sentence

There are more sophisticated options here. Token-based splitting would align chunk size with the embedding model’s context window. Overlapping windows would reduce the chance of splitting a relevant passage across two chunks. Semantic chunking groups text by topic shifts rather than length. Each of these adds complexity. For 15 articles and 625 total chunks, sentence-boundary splitting is a reasonable baseline that keeps the pipeline easy to reason about.

Embedding model: all-MiniLM-L6-v2

I chose sentence-transformers/all-MiniLM-L6-v2 for a few reasons. It runs entirely on CPU with no API calls or costs. It produces 384-dimensional vectors, which are compact enough to keep the FAISS index small while still capturing meaningful semantic relationships. And it’s widely benchmarked, so its behavior is well-understood.

The embedding step encodes all 625 chunks in batches of 64. On a Colab CPU runtime, this takes about a minute. Each chunk becomes a 384-dimensional float vector that positions it in a semantic space where related concepts are nearby.

FAISS indexing: why normalize + inner product

This is the part that trips people up if they haven’t looked under the hood. FAISS offers several index types, and the choice matters.

I use IndexFlatIP, a flat (brute-force) inner-product index. On its own, inner product is not the same as cosine similarity. But there’s a trick: if you L2-normalize every vector to unit length first, then the inner product between any two vectors equals their cosine similarity. This is because for unit vectors a and b, the dot product a·b =   a   *   b   * cos(θ) simplifies to just cos(θ).
faiss.normalize_L2(embeddings_matrix)
index = faiss.IndexFlatIP(embedding_dim)
index.add(embeddings_matrix)

Why not use IndexFlatL2 (Euclidean distance) instead? Cosine similarity measures the angle between vectors, ignoring magnitude. This matters because embedding models don’t guarantee that all vectors have the same norm. Two chunks about the same topic could have different magnitudes depending on text length or vocabulary. Cosine similarity treats them as equivalent; Euclidean distance does not.

The “flat” part means the index does exact search: it compares the query against every stored vector. For 625 vectors this is instant. If the corpus grew to millions of vectors, I’d switch to an approximate index like HNSW (hierarchical navigable small world) or IVF (inverted file index), which trade a small accuracy loss for orders-of-magnitude faster lookup.

Retrieval: top-K and what the scores mean

The retrieval function embeds the user query, normalizes it, and calls index.search() to get the K nearest vectors. The returned scores are cosine similarities ranging from -1 to 1 (though in practice, for text embeddings, they’re usually between 0 and 0.8).

I set K=3, meaning the LLM gets three chunks as context. This is a balance: too few chunks and you risk missing relevant information; too many and you dilute the context with marginally related text, which can confuse the generation step or simply waste tokens.

A sanity check in the notebook confirms the similarity metric works as expected:

  • “What is a transformer model?” vs. a relevant definition: 0.58 cosine similarity
  • The same question vs. “The Eiffel Tower is in Paris”: 0.05

That 10x difference in scores gives confidence that the embedding space is doing its job.

Prompt design: grounding the LLM

The generation step uses a system prompt that constrains Claude to answer only from the provided context:

You are a knowledgeable assistant that answers questions
strictly based on the provided context.
If the answer cannot be found in the context, say:
"I don't have enough information in the provided context
to answer this."
Do not use any external knowledge beyond what is given.

This is the core value proposition of RAG. Without this grounding instruction, the LLM would happily answer from its training data, which might be outdated or wrong for your specific domain. With it, the model is forced to work with the evidence you’ve retrieved, and it’s explicitly told to admit when the context doesn’t contain an answer.

The off-topic test demonstrates this working: when asked “What is the capital of France?”, the system correctly declines to answer rather than pulling from general knowledge.

What I’d change in a production system

This pipeline is deliberately minimal. Here’s what I’d add for production use:

Overlapping chunks. A 100-character overlap between consecutive chunks ensures that passages near chunk boundaries aren’t lost. This is a low-cost improvement with meaningful retrieval gains.

A re-ranking stage. Bi-encoder embedding search is fast but imprecise. Adding a cross-encoder re-ranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) as a second pass scores each query-chunk pair jointly, which is slower but more accurate. The typical pattern is to retrieve 10-20 candidates with FAISS, then re-rank to the final top-3.

Evaluation metrics. I eyeballed the results here, which doesn’t scale. RAGAS or similar frameworks can measure faithfulness (does the answer match the context?), answer relevance (does it address the question?), and context recall (did retrieval find the right chunks?). These metrics turn qualitative judgment into quantifiable signals.

Approximate FAISS indexes. IndexFlatIP is fine at this scale, but if the corpus grew to hundreds of thousands of documents, switching to IndexIVFFlat or IndexHNSWFlat would keep query latency low.

Takeaway

Building RAG from scratch forces you to confront decisions that frameworks hide. Why this chunk size? Why cosine over Euclidean? Why three retrieved chunks instead of five? When you control every component, you can reason about failure modes: bad retrieval, diluted context, prompt leakage. That understanding is what separates someone who can use a RAG framework from someone who can debug and improve one.

The full notebook with runnable code is on my GitHub.


Saylee Pradhan

Software engineer turned AI specialist, exploring the intersection of code quality, LLM evaluation, and intelligent system design.