In late 2023, I was asked to switch streams. My previous work had been squarely in infrastructure and data platform territory — ADX ingestion pipelines, Bicep IaC, security hardening. Now I was being handed the AI Copilot workstream, which was stuck at 40% query accuracy with a Public Preview deadline closing in fast.
I had never worked on an LLM pipeline before.
What the Pipeline Looked Like
The Copilot system took a user's natural language question and turned it into a KQL query to run against the manufacturing knowledge graph. The pipeline was RAG-based: it retrieved relevant instructions and examples from storage, assembled a prompt, called Azure OpenAI, got back a KQL query, and executed it.
At 40% accuracy, 6 out of 10 queries were either failing to execute or returning wrong results. The deadline was weeks away. I needed to find and fix the real failure modes — fast.
Diagnosing the Failures
I started by running the Copilot against a set of test questions and studying every failure carefully. Three patterns emerged:
Failure Mode 1: Missing Type Casts
Copilot was generating KQL with column references but consistently forgetting to cast them to the correct type. A query that should have been where tostring(Status) == "Active" was being generated as where Status == "Active" — and failing silently or returning wrong results.
Fix: Post-process the generated KQL and automatically inject typecasts for all columns based on their type definitions in the DTDL schema. The model didn't need to know the types — the system handled it.
Failure Mode 2: No Learning from Previous Errors
When a query failed validation, the next attempt started from scratch with the same prompt. The model had no idea what had gone wrong previously.
Fix: Propagate previous validation errors into the next prompt. The model now sees: "Your previous query failed with error X — here it is, please fix it." This turned single-shot attempts into a self-correcting loop.
Failure Mode 3: Customer Examples Were in the Wrong Syntax
The pipeline used cipher-based KQL internally (more structured, easier for the LLM to handle) but customers registered examples in standard KQL. There was a mismatch — the model was getting confused by mixed syntax in its context.
Fix: Build an automatic KQL-to-cipher-text converter that runs server-side when a customer registers an example. Customers write standard KQL, the system transparently stores and uses the cipher equivalent. Zero migration effort for customers, clean inputs for the model.
Other Improvements
Parallel Data Fetching
The prompt assembly step was fetching instructions, examples, and aliases from Cosmos DB sequentially. Switched to parallel calls — reduced prompt assembly time significantly and made the pipeline more responsive.
Inbuilt Jobs for Aliases and Instructions
Wrote startup jobs to pre-register out-of-the-box aliases, instructions, and examples — so every new customer deployment started with a solid baseline that the model could rely on, rather than a blank slate.
Results
What I Took Away
- Study the failures, not the architecture. I didn't spend time redesigning the whole pipeline — I ran it, looked at what was actually failing, and fixed those specific things. A lot of LLM "accuracy problems" are really just input-quality problems.
- Post-processing is underrated. You don't have to get the model to be perfect at everything. Some things are easier to fix deterministically after the model responds — like type casting — than to get right through prompt engineering.
- Give the model feedback. Propagating previous errors into the next prompt was one of the highest-impact changes. The model is trying to help you — tell it what went wrong.