How to Ship a Production
Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.
They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.
In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).
Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades
Why RAG Alone Does Not Equal Production-Ready
Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.
Production issues usually arise from the silent failures in the system surrounding the model:
Weak retrieval:If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility:Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility:A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing:In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.
We’ll solve each of these issues systematically in this guide.
Prerequisites
This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.
Knowledge
You should be comfortable with:
Python fundamentals(functions, modules, virtual environments)
Basic HTTP + JSON(requests, response payloads)
APIs with FastAPI(what an endpoint is and how to run a server)
High-level LLM concepts(prompting, temperature, structured outputs)
Tools + Accounts
You’ll need:
Python 3.10+
A working OpenAI-compatible API key(OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)
What This Tutorial Covers (and What It Doesn’t)
We’ll build a production-minded baseline:
A FAISS-backed retrieverwith a persisted index + metadata
A retrieval gateto prevent “forced hallucination”
Structured JSON outputsso your backend is stable
Fallback behaviorfor timeouts and provider errors
A small eval harnessto prevent regressions
We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.
The Architecture You Are Building
The flow of our application follows a disciplined path so every answer is grounded in evidence:
User query:The user submits a question via a FastAPI endpoint.
Retrieval:The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate:We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation:If the gate passes, we send a context-augmented prompt to the LLM.
Structured response:The model returns a JSON object containing the answer, sources used, and a confidence level.
Project Setup and Structure
To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.
Project Structure
.├── app.py # FastAPI entry point and API logic├── rag.py # FAISS index, persistence, and document retrieval├── llm.py # LLM API interface and JSON parsing├── prompts.py # Centralized prompt templates├── data/ # Source .txt documents├── index/ # Persisted FAISS index and metadata└── evals/ # Evaluation dataset and runner script ├── eval_set.json └── run_evals.pyInstall Dependencies
First, create a virtual environment to isolate your project:
python -m venv .venvsource .venv/bin/activate # On Windows: .venv\Scripts\activatepip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenvConfigure the Environment
Create a .envfile in the root directory. We are targeting OpenAI-compatible providers:
OPENAI_API_KEY=your_actual_api_key_hereOPENAI_BASE_URL=https://api.openai.com/v1OPENAI_MODEL=gpt-4o-miniImportant note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts()and call_llm().
How to Build the RAG Layer with FAISS
In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.
What is FAISS (and What Does It Do)?
FAISS(Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:
“Given this question embedding, which document chunks are closest to it?”
In this tutorial, we use IndexFlatIPinner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.
Chunking Strategy With Overlap
We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.
Implementation of rag.py
import osimport faissimport numpy as npimport requestsimport jsonfrom typing import List, Dictfrom dotenv import load_dotenvload_dotenv()INDEX_PATH = "index/faiss.index"META_PATH = "index/meta.json"def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]: chunks = [] step = max(1, size - overlap) for i in range(0, len(text), step): chunk = text[i : i + size].strip() if chunk: chunks.append(chunk) return chunksdef embed_texts(texts: List[str]) -> np.ndarray: # Note: If your provider is not OpenAI-compatible, change this URL and headers url = f"{ os.getenv('OPENAI_BASE_URL')}/embeddings" headers = { "Authorization": f"Bearer { os.getenv('OPENAI_API_KEY')}"} payload = { "input": texts, "model": "text-embedding-3-small"} resp = requests.post(url, headers=headers, json=payload, timeout=30) resp.raise_for_status() # If your provider uses a different response format, change the line below vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32") return vectorsdef build_index() -> None: all_chunks: List[str] = [] metadata: List[Dict] = [] if not os.path.exists("data"): os.makedirs("data") return for file in os.listdir("data"): if not file.endswith(".txt"): continue with open(f"data/{ file}", "r", encoding="utf-8") as f: text = f.read() chunks = chunk_text(text) all_chunks.extend(chunks) for c in chunks: metadata.append({ "source": file, "text": c}) if not all_chunks: return embeddings = embed_texts(all_chunks) faiss.normalize_L2(embeddings) dim = embeddings.shape[1] index = faiss.IndexFlatIP(dim) index.add(embeddings) os.makedirs("index", exist_ok=True) faiss.write_index(index, INDEX_PATH) with open(META_PATH, "w", encoding="utf-8") as f: json.dump(metadata, f, ensure_ascii=False)def load_index(): if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)): raise FileNotFoundError( "FAISS index not found. Add .txt files to data/ and run build_index()." ) index = faiss.read_index(INDEX_PATH) with open(META_PATH, "r", encoding="utf-8") as f: metadata = json.load(f) return index, metadatadef retrieve(query: str, k: int = 5) -> List[Dict]: index, metadata = load_index() q_emb = embed_texts([query]) faiss.normalize_L2(q_emb) scores, ids = index.search(q_emb, k) results = [] for score, idx in zip(scores[0], ids[0]): if idx == -1: continue m = metadata[idx] results.append( { "score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)} ) return resultsHow to Add the LLM Call with Structured Output
A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.
We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.
Implementation of llm.py
import jsonimport requestsimport osfrom typing import Dict, Anydef call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]: # Note: Change URL/Headers if using a non-OpenAI compatible provider url = f"{ os.getenv('OPENAI_BASE_URL')}/chat/completions" headers = { "Authorization": f"Bearer { os.getenv('OPENAI_API_KEY')}", "Content-Type": "application/json", } payload = { "model": os.getenv("OPENAI_MODEL"), "messages": [ { "role": "system", "content": system_prompt}, { "role": "user", "content": user_prompt}, ], "response_format": { "type": "json_object"}, "temperature": 0, } try: resp = requests.post(url, headers=headers, json=payload, timeout=30) resp.raise_for_status() content = resp.json()["choices"][0]["message"]["content"] parsed = json.loads(content) parsed.setdefault("answer", "") parsed.setdefault("refusal", False) parsed.setdefault("confidence", "medium") parsed.setdefault("sources", []) return parsed except (requests.Timeout, requests.ConnectionError): return { "answer": "The system is temporarily unavailable (network issue). Please try again.", "refusal": True, "confidence": "low", "sources": [], "error_type": "network_error", } except Exception: return { "answer": "A system error occurred while generating the answer.", "refusal": True, "confidence": "low", "sources": [], "error_type": "unknown_error", }How to Add Guardrails: Retrieval Gate and Fallbacks
Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.
The Retrieval Gate: How It Works and How to Add It
In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.
The solution is the retrieval gate:
Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example
0.30), refuse immediatelyOnly call the LLM when retrieval is strong enough to ground the answer
A threshold of 0.30is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).
Fallbacks and Why They Matter
Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.
In this tutorial, fallbacks are implemented inside call_llm()so your FastAPI layer stays simple.
FastAPI App: Creating the /answer Endpoint
The app.pyfile is the conductor. It ties retrieval, guardrails, prompting, and generation together.
Implementation of app.py
from fastapi import FastAPIfrom pydantic import BaseModelfrom rag import retrievefrom llm import call_llmimport promptsimport timeimport logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger("rag_app")app = FastAPI(title="Production-Ready RAG")class QueryRequest(BaseModel): question: [email protected]("/answer")async def get_answer(req: QueryRequest): start_time = time.time() question = (req.question or "").strip() if not question: return { "answer": "Please provide a non-empty question.", "refusal": True, "confidence": "low", "sources": [], "latency_sec": round(time.time() - start_time, 2), } # 1) Retrieval results = retrieve(question, k=5) top_score = results[0]["score"] if results else 0.0 logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results)) # 2) Retrieval Gate (Guardrail) if top_score < 0.30: return { "answer": "I do not have documents to answer that question.", "refusal": True, "confidence": "low", "sources": [], "latency_sec": round(time.time() - start_time, 2), "retrieval": { "top_score": top_score, "k": 5}, } # 3) Augment context_text = "\n\n".join([f"Source { r['source']}: { r['text']}" for r in results]) user_prompt = f"Context:\n{ context_text}\n\nQuestion: { question}" # 4) Generation with Fallback response = call_llm(prompts.SYSTEM_PROMPT, user_prompt) # 5) Attach debug metadata response["latency_sec"] = round(time.time() - start_time, 2) response["retrieval"] = { "top_score": top_score, "k": 5} return responseCentralized Prompt – Template: prompts.py
A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.
Example prompts.py
SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.If the context does not contain the answer, respond with refusal=true.Return a valid JSON object with exactly these keys:- answer: string- refusal: boolean- confidence: "low" | "medium" | "high"- sources: array of strings (source filenames you used)Do not include any extra keys. Do not include markdown. Do not include commentary."""How to Add Beginner-Friendly Evals
In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.
Instead of “does it output exactly this string,” you test:
Should the app refusewhen the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?
Step 1: Create evals/eval_set.json
This should contain both positive and negative cases.
[ { "id": "in_scope_01", "question": "What is a retrieval gate and why is it important?", "expect_refusal": false, "notes": "Should explain gating and relate it to hallucination prevention." }, { "id": "out_of_scope_01", "question": "What is the capital of France?", "expect_refusal": true, "notes": "If the knowledge base only includes our docs, the app should refuse." }, { "id": "edge_01", "question": "", "expect_refusal": true, "notes": "Empty input should not call the LLM." }]Step 2: Create evals/run_evals.py
This runner calls your API endpoint (end-to-end) and checks expected behaviours.
import jsonimport requestsAPI_URL = "http://127.0.0.1:8000/answer"def run(): with open("evals/eval_set.json", "r", encoding="utf-8") as f: cases = json.load(f) passed = 0 failed = 0 for case in cases: resp = requests.post(API_URL, json={ "question": case["question"]}, timeout=60) resp.raise_for_status() out = resp.json() got_refusal = bool(out.get("refusal", False)) expect_refusal = bool(case["expect_refusal"]) ok = (got_refusal == expect_refusal) # Beginner-friendly: if it answers, sources should exist and be a list if not got_refusal: ok = ok and isinstance(out.get("sources"), list) if ok: passed += 1 print(f"PASS { case['id']}") else: failed += 1 print(f"FAIL { case['id']} expected_refusal={ expect_refusal} got_refusal={ got_refusal}") print("Output:", json.dumps(out, indent=2)) print(f"\nDone. Passed={ passed} Failed={ failed}") if failed: raise SystemExit(1)if __name__ == "__main__": run()How to Use Evals in Practice
Run your server:
uvicorn app:app --reloadIn another terminal, run evals:
python evals/run_evals.pyIf an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.
What to Improve Next: Realistic Upgrades
Building a reliable RAG app is iterative. Here are realistic next steps:
Semantic chunking:Break text based on meaning instead of character count.
Reranking:Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering:Filter results by category, date, or department to reduce false positives.
Better citations:Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability:Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing:Move index building to a background job and keep the API responsive.
Final Thoughts: Production-Ready Is a Set of Habits
Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.
Retrieval quality is measurable:Use similarity scores to gate your LLM.
Refusal is a feature:It is better to say “I do not know” than to lie.
Fallbacks are mandatory:Design for the moment the API goes down.
Evals prevent regressions:Never deploy a change without running your tests.
About Me
I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.
My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.
More From This Topic
View Topic
Android Studio Configuration Guide
Risk management identifies and mitigates potential threats. Assessment tools, monitoring systems, an …
Learn Linux for Beginners: From Basics to Advanced Techniques [Full Book]
Learning Linux is one of the most valuable skills in the tech industry. It can help you get things d …
The PHP Handbook – Learn PHP for Beginners
PHP is an incredibly popular programming language.Statistics say it’s used by 80% of all websites. I …
How to Get Started with NodeJS – a Handbook for Beginners
By Krish JaiswalHello folks! 👋 Recently, I have been learning about Node.js. So I decided to share m …
Real-time Data Visualization
Inventory management systems track stock levels and optimize ordering. Automated reordering prevents …
The Express + Node.js Handbook – Learn the Express JavaScript Framework for Beginners
What is Express?Express is a Web Framework built upon Node.js.Node.js is an amazing tool for buildin …