November 21, 2025

Evaluating GenAI: Building Your "Golden Dataset"

Evaluation is the single most critical component of a GenAI system. If you can’t measure it, you can’t improve it. But how does everyone actually do it?

Most teams fall into one of two traps:

The Manual Slog: Collecting logs and manually checking answers against a knowledge base. (Unscalable).
The Shotgun Approach: Using tools like RAGAS to randomly generate a test set from your documents. (Good for a baseline, but lacks depth).

This post proposes a Surgical Approach: A customizable workflow that uses Topic Modeling to build a “Golden Dataset” that actually tests your system’s weak points.

The Workflow: From Random to Surgical

Instead of randomly sampling chunks, we use structure to guide our generation.

Step 1: Know Your Data (Topic Modeling)

Gather your data (structured or unstructured text) and run a topic modeling tool like BERTopic.

Why? This gives you a map of your knowledge base. You can see exactly how much data you have for “Authentication” vs. “Pricing.”
The Insight: This distribution allows you to identify scarce topics (edge cases) that a random sampler would likely miss.

Step 2: Targeted Generation

Based on these topics, we generate Question-Answer (QA) pairs. This allows for a surgical approach:

RAGAS (The Shotgun): “Here is a random spread of questions from your data.”
This Method (The Scalpel): “I know my ‘Authentication’ topic is weak, so I am going to force-generate 50 adversarial questions specifically for that topic.”

Tip: You can control the LLM context window based on the length/size of the documents in each topic to optimize performance.

Step 3: The “Fool’s Gold” Filter (Quality Control)

A generator LLM can hallucinate. If your test data is bad, your evaluation is worthless. We introduce a Critic Loop:

Generate: The first LLM creates a QA pair from the context.
Critique: A separate “Critic LLM” (e.g., Claude 3.5 Sonnet) reads only the context and attempts to answer the generated question.
Validate: Compare the Critic’s answer with the Generator’s answer using simple metrics (like ROUGE/BLEU) or an LLM-as-a-Judge. If they don’t match, discard the pair.

The Multi-Hop Dilemma

This method is fantastic for single-hop QA. However, we must be honest about its limits.

The Challenge: Real users ask complex questions that bridge multiple documents (“How does X affect Y?”).
The Reality: While we can try to find documents with overlapping topics, Knowledge Graphs are king for multi-hop generation. If your app relies heavily on complex reasoning, you might need a Graph-based approach.
The Compromise: For most RAG applications, the “Topic Intersection” method (finding docs that score high in two different topics) is a powerful middle ground.

Why This Approach Wins

Edge Case Coverage: Topic modeling ensures you test the “scarce” topics just as thoroughly as the popular ones.
Surgical Control: You stop guessing and start testing specific behaviors.
Higher Quality: The Critic Loop ensures your “Golden Dataset” isn’t filled with hallucinations.
Flexibility: It bends to your rules. You don’t need to learn a complex framework’s proprietary DSL—just standard Python and LLMs.