I am building the RAG pipeline, retrieval flow, and query layer

NCERTGPT

A RAG-based study assistant for Class 12 NCERT material, built around textbook extraction, chunking, embeddings, retrieval, and source-grounded answers.

Project Type

RAG-based textbook Q&A system

Stack

Python, RAG, vector database, embeddings, NLP

Pipeline

PDF extraction, chunking, vector retrieval, query layer

Timeline

Built in 2026

Case Study

Engineering Notes

Project Overview

NCERTGPT is an in-progress RAG-based Q&A system for Class 12 NCERT textbooks. It is meant for students who want to query textbook content directly, revise concepts faster, and avoid answers that drift away from the actual source.

Problem / Motivation

Students usually move between PDFs, notes, random search results, and AI chat windows. That workflow is slow and unreliable because the answer may sound confident while missing the textbook context. Confidence without retrieval is bas acting.

Architecture / System Design

The planned flow starts with PDF ingestion and text extraction, then splits the text into manageable chunks. Those chunks are converted into vector embeddings and stored for semantic retrieval. A query layer searches the relevant chunks first, then passes the retrieved context into the answer-generation step.

The system is being designed around retrieval quality, chunk boundaries, and prompt discipline. For education use, the model should answer from the selected context and make uncertainty clear instead of inventing a polished answer.

Key Features

The project focuses on study usefulness rather than chatbot theatrics.

NCERT-focused question answering.
PDF extraction and chunking pipeline.
Vector search over textbook content.
Prompt flow that prioritizes source-grounded answers.
Natural-language query handling for revision workflows.

Technical Challenges

The main challenge is keeping retrieved chunks relevant enough for accurate answers. Textbook PDFs can produce noisy extraction, bad chunk boundaries, and context gaps. Once bad context enters the prompt, the model starts doing jugaad, and that is exactly what the system should avoid.

Solutions / Engineering Decisions

I am treating retrieval as the core system, not a side feature. Chunk sizing, metadata, and query routing matter more than making the chat screen look impressive. The model layer is useful only after the context layer is reliable.

Outcome / Final State

The current direction is a source-aware study assistant that can answer from textbook material and support focused revision. The project is still evolving, but the architecture is grounded in RAG fundamentals instead of vague AI claims.

AIRAGVector DBEmbeddingsNLPPython

Key Capabilities

Built around Class 12 NCERT textbook Q&A instead of generic chatbot responses.

Uses a PDF to text to chunks to embeddings to retrieval pipeline.

Focuses on RAG fundamentals, vector search, NLP query handling, and source-grounded answers.

Keep Moving

All Work

2026