Designing a Socratic Voice Agent for AI/ML Upskilling

May 18th, 2026

Most AI tutoring tools answer questions directly. You ask “what is backpropagation?” and get an explanation. That’s useful, but it’s not how deep understanding forms. The Socratic method works differently — it responds to questions with questions, guiding the learner to construct their own understanding through reasoning.

I’m building a voice-enabled Socratic tutor for AI/ML concepts. The system doesn’t just retrieve information; it models a teaching strategy, adapts to the learner’s level across sessions, and delivers the experience through natural voice conversation. This post walks through the architecture, the key design decisions, and the technical challenges I’m working through.

Why multi-agent?

A single LLM with a good system prompt can do passable Socratic dialogue. I tried this first. The problem isn’t the individual interactions — it’s everything around them: knowing what the student already understands, deciding what to teach next, evaluating whether the student actually learned it, and remembering all of this across sessions.

These are fundamentally different cognitive tasks. Mixing them into one prompt creates a monolith that’s hard to evaluate, hard to improve, and hard to debug when it teaches something incorrectly. So the system is split into four specialized agents, each with a clear responsibility.

The four agents

Socratic Voice Agent Architecture

Tutor Agent — Conducts the Socratic dialogue. It receives the current topic and the student’s skill level from the curriculum agent, then asks probing questions, gives hints when the student is stuck, and guides them toward understanding without giving away the answer. This is the only agent the student interacts with directly.

Evaluator Agent — Analyzes the student’s responses to determine whether they’ve demonstrated genuine understanding or just pattern-matched a correct-sounding answer. This is harder than it sounds. “Backpropagation computes gradients using the chain rule” could be parroted from a textbook or could reflect real comprehension — the evaluator needs surrounding context (what questions were asked, how the student arrived at the answer) to tell the difference.

Curriculum Agent — Maintains the learning state. It tracks what the student has mastered, what they’re currently working on, and what should come next. It queries the cross-session memory store to make decisions that account for the student’s full history, not just the current conversation.

Knowledge Agent — Retrieves relevant domain content from a FAISS-indexed corpus of AI/ML material. When the tutor needs accurate technical details to ground its questions, or the evaluator needs reference material to check a student’s answer, the knowledge agent handles retrieval.

Why LangGraph for orchestration

I chose LangGraph over a simple chain or a custom routing layer for a few specific reasons.

First, the conversation flow isn’t linear. A typical interaction might go: curriculum agent picks topic → tutor asks question → student responds → evaluator assesses → evaluator says “not quite” → tutor asks a follow-up → student responds → evaluator says “got it” → curriculum updates mastery → curriculum picks next topic. The branching and looping are natural to express as a graph with conditional edges.

Second, typed state. LangGraph’s TypedDict state lets me define exactly what gets passed between agents — the current topic, the student’s assessed level, the conversation history, the evaluator’s verdict. This makes it possible to test agents in isolation by constructing specific state objects.

Third, checkpointing. LangGraph supports state persistence between invocations, which maps directly to the cross-session memory requirement. A student can leave mid-session and resume without losing context.

Cross-session memory

This is the architectural piece I’m most interested in getting right. The naive approach is to dump the full conversation history into a vector store and retrieve relevant chunks. That works for simple recall (“what did we cover last time?”) but fails for curriculum decisions.

The design I’m working with has two layers:

Session summaries. After each session, a summarizer extracts structured data: topics covered, mastery assessments, misconceptions identified, and open questions. These go into Postgres with timestamps.

Temporal weighting. When the curriculum agent retrieves a student’s history, recent evidence is weighted more heavily. If a student struggled with attention mechanisms in session 2 but demonstrated clear understanding in session 5, the curriculum agent should trust the more recent signal. This is implemented as a decay function over the timestamp, not a simple “most recent wins” — because a topic that hasn’t been revisited in weeks might need a refresher check.

The voice layer

Text-based Socratic dialogue works, but voice changes the interaction meaningfully. Students explain their reasoning more naturally when speaking than when typing, and the tutor can maintain conversational flow without the student needing to wait for a text response.

The planned stack:

STT (Speech-to-Text): Deepgram or AssemblyAI for real-time transcription
TTS (Text-to-Speech): ElevenLabs or OpenAI TTS for natural-sounding output
Transport: WebSocket-based streaming in both directions
Backend: FastAPI handling the WebSocket connections and routing to the LangGraph agent

The hard problem here is latency. A voice agent that takes 3+ seconds to respond after you stop speaking feels broken. The latency budget gets consumed by STT processing, LLM inference (potentially multiple agent hops), and TTS generation. I’m planning to address this with streaming TTS (start speaking before the full response is generated) and keeping the agent graph shallow for voice interactions — the tutor agent handles most turns directly, only routing to other agents when it genuinely needs them.

Evaluation framework

Building an agent that seems like a good tutor is easy. Building one that is a good tutor requires evaluation beyond “did the student say the right thing.”

The evaluation framework I’m designing covers three dimensions:

Pedagogical quality. Does the tutor ask questions at the right difficulty level? Does it give hints before giving up and explaining? Does it probe for understanding rather than accepting surface-level answers? This will be evaluated with a rubric scored by an LLM judge against a set of reference conversations.

Factual accuracy. When the tutor makes claims about AI/ML concepts, are they correct? This is validated against the knowledge agent’s source material.

Session coherence. Does the curriculum agent make sensible decisions about topic progression? Does the cross-session memory produce better outcomes than a memoryless version? This requires comparing learning trajectories over multiple simulated sessions.

Tech stack summary

Component	Technology
Orchestration	LangGraph (supervisor pattern)
LLM	Claude (Anthropic API)
Embeddings	HuggingFace sentence-transformers
Vector store	FAISS
Backend	FastAPI
Database	PostgreSQL (session memory, mastery tracking)
Voice STT	Deepgram / AssemblyAI
Voice TTS	ElevenLabs / OpenAI TTS
Frontend	Streamlit (MVP) → React (production)
Observability	Langfuse / LangSmith

Current status and what’s next

I’m currently in the architecture and early implementation phase. The immediate milestones:

Single tutor agent MVP — Socratic dialogue without voice, without the other agents. Get the core teaching loop working and evaluate it.
Evaluator integration — Close the feedback loop so the system can assess student responses and adapt.
Curriculum agent + persistence — Cross-session memory with temporal weighting.
Voice layer — WebSocket streaming with latency optimization.
Full multi-agent orchestration — All four agents running in the LangGraph supervisor graph.

I’ll write follow-up posts as each milestone ships, covering what worked, what didn’t, and the specific technical decisions that changed along the way.

This project is part of my AI/ML portfolio. The codebase is in a private repository — if you’re a hiring manager or recruiter interested in discussing the architecture, feel free to reach out.

Saylee Pradhan

Software engineer turned AI specialist, exploring the intersection of code quality, LLM evaluation, and intelligent system design.