From 40% to 75%: Improving Copilot Query Accuracy

In late 2023, I was asked to switch streams. My previous work had been squarely in infrastructure and data platform territory — ADX ingestion pipelines, Bicep IaC, security hardening. Now I was being handed the AI Copilot workstream, which was stuck at 40% query accuracy with a Public Preview deadline closing in fast.

I had never worked on an LLM pipeline before.

What the Pipeline Looked Like

The Copilot system took a user's natural language question and turned it into a KQL query to run against the manufacturing knowledge graph. The pipeline was RAG-based: it retrieved relevant instructions and examples from storage, assembled a prompt, called Azure OpenAI, got back a KQL query, and executed it.

At 40% accuracy, 6 out of 10 queries were either failing to execute or returning wrong results. The deadline was weeks away. I needed to find and fix the real failure modes — fast.

Diagnosing the Failures

I started by running the Copilot against a set of test questions and studying every failure carefully. Three patterns emerged:

Failure Mode 1: Missing Type Casts

Copilot was generating KQL with column references but consistently forgetting to cast them to the correct type. A query that should have been where tostring(Status) == "Active" was being generated as where Status == "Active" — and failing silently or returning wrong results.

Fix: Post-process the generated KQL and automatically inject typecasts for all columns based on their type definitions in the DTDL schema. The model didn't need to know the types — the system handled it.

Failure Mode 2: No Learning from Previous Errors

When a query failed validation, the next attempt started from scratch with the same prompt. The model had no idea what had gone wrong previously.

Fix: Propagate previous validation errors into the next prompt. The model now sees: "Your previous query failed with error X — here it is, please fix it." This turned single-shot attempts into a self-correcting loop.

Failure Mode 3: Customer Examples Were in the Wrong Syntax

The pipeline used cipher-based KQL internally (more structured, easier for the LLM to handle) but customers registered examples in standard KQL. There was a mismatch — the model was getting confused by mixed syntax in its context.

Fix: Build an automatic KQL-to-cipher-text converter that runs server-side when a customer registers an example. Customers write standard KQL, the system transparently stores and uses the cipher equivalent. Zero migration effort for customers, clean inputs for the model.

Design principle The model's job is reasoning, not syntax normalisation. Every inconsistency in the prompt that the model has to mentally resolve is a chance for error. Normalise inputs before they reach the model.

Other Improvements

Parallel Data Fetching

The prompt assembly step was fetching instructions, examples, and aliases from Cosmos DB sequentially. Switched to parallel calls — reduced prompt assembly time significantly and made the pipeline more responsive.

Inbuilt Jobs for Aliases and Instructions

Wrote startup jobs to pre-register out-of-the-box aliases, instructions, and examples — so every new customer deployment started with a solid baseline that the model could rely on, rather than a blank slate.

Results

40%

Query accuracy before

75%+

Query accuracy after — consistently

Manager feedback "Switch streams for accepting copilot work and delivering within a very short time." — "Devesh's velocity in delivery and his quick turnaround time in bringing tasks to completion with quality is top-notch."

What I Took Away

Study the failures, not the architecture. I didn't spend time redesigning the whole pipeline — I ran it, looked at what was actually failing, and fixed those specific things. A lot of LLM "accuracy problems" are really just input-quality problems.
Post-processing is underrated. You don't have to get the model to be perfect at everything. Some things are easier to fix deterministically after the model responds — like type casting — than to get right through prompt engineering.
Give the model feedback. Propagating previous errors into the next prompt was one of the highest-impact changes. The model is trying to help you — tell it what went wrong.

Azure OpenAIRAGKQLPrompt EngineeringC#PythonCosmos DB

From 40% to 75%: How I Improved Copilot Query Accuracy in Weeks

What the Pipeline Looked Like

Diagnosing the Failures

Failure Mode 1: Missing Type Casts

Failure Mode 2: No Learning from Previous Errors

Failure Mode 3: Customer Examples Were in the Wrong Syntax

Other Improvements

Parallel Data Fetching

Inbuilt Jobs for Aliases and Instructions

Results

What I Took Away