·9 min read

Building a Conversational AI Assessment Agent: What I Learned From Screenr

AITypeScriptArchitectureBackend

When I started building Screenr, I thought the hard part would be the LLM prompting. It wasn't. The hard part was making the system feel like a real conversation — one that remembers context, adapts mid-interview, and doesn't collapse when a candidate does something unexpected.

This is a technical walkthrough of how I built the assessment agent that sits at the core of Screenr. It's in production, it has paying clients, and it has broken in ways I didn't anticipate. I'll cover all of it.


What Screenr Does

Screenr is an agentic hiring platform. Companies post a role, candidates apply, and instead of a human reading 200 resumes cold, an AI agent conducts a structured back-and-forth assessment with each candidate. At the end, every candidate gets a composite score and a transcript the hiring manager can read.

The pipeline looks like this:

Resume Upload
      │
      ▼
Resume Parser ──────► Fraud Detection
                            │
                            ▼
                    Role-Specific Assessment Agent  ◄── (this post)
                            │
                            ▼
                    Composite Scorer
                            │
                            ▼
                    Ranked Candidate List

Each stage feeds the next. This post is about stage 3 — the assessment agent.


How the Conversation Works

The agent doesn't ask the same questions to every candidate. It reads the parsed resume first, then conducts a conversation that adapts based on what the candidate says.

Here's the flow for a single session:

Candidate joins session
         │
         ▼
Load: role prompt + resume summary + message history from Redis
         │
         ▼
Build full context → send to LLM
         │
         ▼
Stream response back to candidate
         │
         ▼
Append exchange to Redis session
         │
     [loop until assessment complete]
         │
         ▼
Finalize: write full transcript to PostgreSQL
         │
         ▼
Trigger scoring stage

No summarization mid-conversation. For an assessment the full context needs to be intact — dropping anything is a correctness risk.


Memory: Two Layers

I use Redis and PostgreSQL for different things.

Redis holds the live session. It's fast, it survives browser drops (candidate closes tab, comes back 10 minutes later), and it's easy to expire after the session closes.

// session store — Redis
const SESSION_TTL = 60 * 60 * 2; // 2 hours

async function getSession(sessionId: string): Promise<Session | null> {
  const raw = await redis.get(`session:${sessionId}`);
  return raw ? JSON.parse(raw) : null;
}

async function appendMessage(
  sessionId: string,
  role: "user" | "assistant",
  content: string
) {
  const session = await getSession(sessionId);
  if (!session) throw new Error("Session not found");

  session.messages.push({ role, content, timestamp: Date.now() });
  await redis.set(
    `session:${sessionId}`,
    JSON.stringify(session),
    "EX",
    SESSION_TTL
  );
}

PostgreSQL stores the finalized transcript once the assessment is done. This is what the hiring manager reads. It's also what gets passed to the scoring stage.

async function finalizeSession(sessionId: string) {
  const session = await getSession(sessionId);
  if (!session) throw new Error("Session expired");

  await db.assessmentTranscript.create({
    data: {
      candidateId: session.candidateId,
      roleId: session.roleId,
      messages: session.messages,
      completedAt: new Date(),
    },
  });

  await redis.del(`session:${sessionId}`);
}

Building the Context on Each Turn

Every message turn rebuilds the full context. There's no “memory module” — the history is just prepended each time.

async function buildMessages(session: Session): Promise<Message[]> {
  const systemPrompt = buildSystemPrompt(session.role, session.resumeSummary);

  return [
    { role: "system", content: systemPrompt },
    ...session.messages.map((m) => ({
      role: m.role,
      content: m.content,
    })),
  ];
}

function buildSystemPrompt(role: Role, resumeSummary: string): string {
  return `
You are conducting a structured technical assessment for the role of ${role.title}.

The candidate's background:
${resumeSummary}

Your job:
- Ask questions relevant to this specific role
- Adapt based on their answers — if they claim experience, probe it
- If an answer is vague, ask for specifics
- Do not move to the next topic until the current one is sufficiently explored
- After ${role.maxTurns} exchanges, close the assessment

Do not reveal scoring criteria. Be conversational, not interrogative.
`.trim();
}

The resume summary is pre-computed in the parsing stage and stored on the session — I don't re-parse on every turn.


The Unexpected Case

A candidate was halfway through an English-language assessment and switched to Hindi.

The LLM handled it gracefully — it responded in Hindi and continued the conversation naturally. From a user experience standpoint, that's actually the right behavior. But my downstream scoring pipeline expected English output. The extractor that pulled structured data (skills mentioned, experience claimed, red flags) out of the transcript broke silently.

I didn't catch it immediately. The candidate got a null score and the hiring manager saw a blank evaluation.

The fix:

async function detectLanguage(text: string): Promise<string> {
  // Using a lightweight library — don't want LLM overhead for this
  const { franc } = await import("franc");
  return franc(text); // returns ISO 639-3 code, e.g. 'hin', 'eng'
}

async function normalizeTranscript(
  messages: Message[]
): Promise<Message[]> {
  const normalized: Message[] = [];

  for (const message of messages) {
    if (message.role !== "user") {
      normalized.push(message);
      continue;
    }

    const lang = await detectLanguage(message.content);

    if (lang !== "eng") {
      // Translate to English before scoring
      const translated = await translateToEnglish(message.content);
      normalized.push({
        ...message,
        content: translated,
        originalContent: message.content,
        detectedLanguage: lang,
      });
    } else {
      normalized.push(message);
    }
  }

  return normalized;
}

I also added a flag on the transcript record so the hiring manager can see when the original conversation was in another language.

await db.assessmentTranscript.create({
  data: {
    candidateId: session.candidateId,
    roleId: session.roleId,
    messages: normalizedMessages,
    containsNonEnglish: normalizedMessages.some((m) => m.detectedLanguage),
    completedAt: new Date(),
  },
});

What I'd Do Differently

Explicit session state machine. Right now the agent decides when the assessment is “done” based on the system prompt. That's fragile. I'd model it as an explicit state machine: intro → topic_1 → topic_2 → ... → closing → complete. Each state has its own prompt segment. The LLM focuses on the current state only.

Structured output from the start. I'm parsing unstructured text in the scoring stage. It works but it's brittle. Newer models handle structured output well — I'd push the extraction into the assessment turn itself so the agent outputs both the conversational response and a JSON side-channel.

Unhappy path testing. The Hindi case exposed a gap. I now have a small suite of “chaos candidates” — bots that switch languages, give one-word answers, try to game the system by listing every technology they can think of. Running these against any prompt change before shipping.


Stack

  • RuntimeTypeScript, Node.js, Encore (Go for infrastructure layer)
  • LLMAnthropic Claude and OpenAI via their SDKs directly — no LangChain
  • Session memoryRedis (Upstash)
  • PersistencePostgreSQL (via Prisma)
  • InfrastructureAWS, deployed on Encore's cloud

The system is live at app.screenr.co. The biggest lesson wasn't about prompting or memory architecture — it was that the edge cases in conversational systems are almost never the ones you plan for. Build the happy path, then immediately try to break it.