2025-10-06

Diagnosing Latency: Making Our Agent 3× Faster

A man squats in front of a server while holding a laptop
By Jason Toews, AI Developer, Orium and David Azoulay, Director, Agentic R&D, Orium
7 min read

The vision was simple: take up to three items, surface their key differences, and deliver personalized pros and cons tailored to different kinds of shoppers. The compare agent was designed to be the smart shopping companion we always wanted: fast, insightful, and genuinely helpful.

And it worked. The assistant pulled detailed specs, generated five bullet points per product, and followed up with well-structured paragraphs for different demographics.

At first, we didn’t think we had a performance problem. Sure, it felt a little slow, but we chalked it up to dev mode quirks.

Then we introduced performance evaluators.

That’s when we saw the real number: 13 seconds from page load to response. Suddenly, what had seemed “fine” didn’t look so fine anymore. No crashes. No model failures. Just a smart feature hidden behind a sluggish experience.

We had built a product that worked technically, but didn’t feel like it was working, at least not for real-world users.

And in AI, perception is performance.


Orium is a launch partner for Stripe and OpenAI's ACP. Whether you need a high-level overview for leadership alignment or a deep-dive strategy session, we can help you understand where you stand today, what to prioritize, and how to compete with confidence in the agentic era. Request your 30-minute executive briefings and 1-day strategic intensives.


The Real Cost of Doing It All at Once

So what was going wrong? The agent was doing exactly what we asked it to: it processed multiple products at once, retrieved the relevant specs, and generated a complete output in a single LLM call.

But that meant generating 15 bullet points and 3 tailored paragraphs per session— each one nuanced and customer-aware. That’s a heavy lift. Add in full product data, extra context, and verbose message history, and the latency became inevitable. Unfortunately, it’s also unacceptable.

Improving performance wasn’t just about chasing benchmark numbers. Slow performance means the overall experience is poor, and studies show that lag time kills customer engagement and costs sales. Improving load times improves the experience, so we made performance a product requirement, not just a backend concern.

Breaking Down the Bottleneck

As with so many things in the complex world of enterprise technology and AI, there was no silver bullet. Instead, we made a series of focused changes to improve responsiveness, both in how we structured data and how we orchestrated generation.

LangGraph-Powered Parallelism

One of the most impactful improvements came from how we used LangGraph to parallelize generation.

LangGraph’s Send[] feature allowed us to run multiple generation tasks at the same time, with each path receiving a different prompt and schema. In our case, we wanted to produce both bullet-point comparisons and paragraph-style summaries— two distinct outputs that could be generated independently.

Instead of executing these steps sequentially, we used LangGraph’s Send[] feature to run them concurrently, which significantly reduced latency and simplified orchestration.

Here’s a simplified version of how we triggered both prompts in parallel:

TEXT
123456789101112
const prompts = [
  process.env.COMPARE_PROMPT_KEY || "compare-prompt",
  process.env.COMPARE_SUMMARY_PROMPT_KEY || "compare-summary-prompt"
];

return prompts.map((prompt) => {
  return new Send("compareProducts", {
    ...state,
    prompt,
  });
});

This approach gave us:

  • Parallel execution of bullet points and paragraph summaries
  • More efficient use of model time
  • Clean, maintainable graph structure without duplicating logic
  • Faster end-to-end results for users

By leveraging LangGraph’s concurrency features, we were able to dramatically cut generation time while keeping our architecture simple and scalable.

Data Cleaning & Reduced Context

Before sending anything to the model, we cleaned and flattened the data into a lean, prompt-ready structure. We stripped out nested fields, removed irrelevant metadata, and included only what was strictly necessary.

We also minimized the context passed to the LLM. No full message histories. No extra baggage. Each generation task got only the input it needed, nothing more.

This resulted in:

  • Faster generation – Less input = less processing
  • Lower token usage – Smaller prompts and outputs
  • More focused responses – No distractions (just the essentials) and less hallucination
  • Easier prompt control – Cleaner input = better alignment

Runtime Model Switching

To support experimentation and cost/quality tradeoffs, we enabled runtime model switching without redeploying.

This gave us the flexibility to test different LLMs under real conditions, optimize for different scenarios, and move fast without disrupting the pipeline.

With this setup, we could:

  • A/B test model performance on demand
  • Adapt dynamically to shifting latency or budget goals
  • Reduce deployment friction by decoupling model choice from code
  • Pick the right model for the right job in real time

Agility here meant speed, control, and confidence.

Model Configuration Tuning

We didn’t just pick the model, we tuned how we used it. For each generation step, we optimized the configuration:

  • Lowered max_tokens to reduce verbosity and control output length
  • Adjusted temperature and reasoning depth for more focused, consistent responses
  • Used high-tier models only where needed, and lighter ones where they sufficed

These adjustments gave us high-quality results with fewer tokens and lower latency, without compromising reliability— especially where it mattered most. For example, for mission-critical processes that required the lowest possible latency, we set the service_tier to priority. This increased the cost a bit, but significantly improved the output latency and the tradeoff in these instances makes sense.

Structured Output: Tailored JSON Responses

Rather than returning markdown or freeform text, the assistant generates structured JSON based on predefined schemas, tailored to the task. For example, bullet point responses include a list of products, each with a brand, SKU, and five differentiators while paragraph summaries return a set of titled descriptions comparing items across user needs.

This approach gave us:

  • Perfect formatting, every time – No need to parse or correct
  • Fewer output tokens – More compact = cheaper and faster
  • Less formatting work for the LLM – Reduces latency
  • Simpler prompts, smaller context – Clean and easy to reason about

Using structured JSON helped make the assistant’s output more consistent, efficient, and production-ready.

TEXT
12345678910111213141516171819202122
const schema = {
 type: "object",
 properties: {
   summaries: {
     type: "array",
     minItems: 3,
     maxItems: 5,
     items: {
       type: "object",
       properties: {
         title: {
           type: "string",
         },
         description: {
           type: "string",
         },
       },
       required: ["title", "description"],
     },
   },
 }

Prompt Simplification

We also reworked our prompts to be tighter and more specific. Early versions asked the model to analyze, explain, or describe—which led to bloated outputs and inconsistent results. We refined the prompts to ask the model to summarize or select instead.

This made the instructions clearer and the outputs more predictable.

Benefits included:

  • Faster execution – Less reasoning required
  • Fewer tokens – Simpler instructions, smaller outputs
  • More consistent responses – Better alignment with intent
  • Easier to evaluate – Less variability across runs

Prompt engineering turned out to be a high-leverage performance lever.

Query Pruning

Finally, our GraphQL queries were also part of the problem. Initially, we fetched entire product records, including fields the assistant didn’t need. That meant heavier payloads, more processing, and more tokens.

We reviewed and rewrote those queries to return only what the assistant actually used. The result?

  • Smaller data payloads
  • Faster request handling
  • Shorter prompts
  • Lower LLM token usage

It was a simple change with a big impact. Our GraphQL queries were originally so big we had to use POST request to make API calls, resulting in them not being cached. After significantly trimming what we were querying, we were able to switch to GET requests, allowing them to be cached, reducing latency even further.

The Results: 3x Faster

These optimizations made a clear impact:

  • Total latency dropped by more than two-thirds
  • LLM generation time was reduced significantly
  • The assistant now feels responsive and reliable

The experience went from “wait and see” to “click and compare.” The content stayed just as rich, only now it arrives fast enough to matter.

What’s Next?

Performance isn’t something you finish—it’s something you continuously improve.

We’re exploring several other ways to continue to improve performance, including progressive rendering, which would show bullet points first, then stream in full paragraphs; client-side streaming, which would populate results as they become available; vectorizing product info in a vector database which would significantly increase performance, reduce context size, and potentially even improve responses; and prioritized rendering, which would display the most relevant product first, especially in staggered comparisons.

Because in a fast-moving world, even the smartest answer needs to arrive on time.

Final Remarks

Speed doesn’t have to mean cutting corners. By untangling the process, slimming down inputs, and optimizing model behavior, we preserved the intelligence of the compare agent and made it feel fast.

Now when users open a comparison, it just works. Quickly. Thoughtfully. Reliably. And that’s exactly what smart assistants are supposed to do.

Popular Articles