In the previous post, we learned to trace what our agents actually did. Now we know what is happening—and what we’re seeing is a $4 API bill for a workflow that could have cost $0.40.
Multi-agent systems have a performance problem that doesn’t exist with single-agent calls: costs compound. A three-agent crew where each agent spends 2,000 tokens on context and produces 800-token outputs means the third agent may receive 5,000+ tokens as input before it writes a single word. Run this in production at any real volume and the bill grows fast. Add latency—each agent waits for the previous one—and you also have a slow system.
This post covers the practical techniques for fixing both problems.
Measuring Before Optimizing
Don’t optimize blind. CrewAI exposes token usage through usage_metrics after every crew run.
| |
The output looks like this:
| |
That prompt_tokens number is almost always the culprit. In most workflows, 80–90% of your token spend is prompt tokens—the context you’re feeding agents, not the content they produce. That’s where to focus.
Model Tiering: Use the Right Tool for Each Job
The biggest lever you have is model selection per agent. Not every agent in your crew needs GPT-4o or Claude Opus. A lot of what agents do is mechanical: formatting output, extracting structured data, routing decisions. Smaller models handle this well and cost 10–20x less.
| |
A practical tiering guide:
| Agent type | Recommended model tier | Why |
|---|---|---|
| Strategy, analysis, reasoning | Opus / GPT-4o | Needs deep inference |
| Data extraction, classification | Sonnet / GPT-4o-mini | Pattern-matching, not reasoning |
| Formatting, summarization | Haiku / GPT-3.5-turbo | Mechanical transformation |
| Tool-use-only agents | Haiku / GPT-3.5-turbo | Just dispatching calls |
You won’t always get this tiering right the first time—run usage_metrics before and after and let the numbers confirm the tradeoff.
Parallel Task Execution
By default, CrewAI runs tasks sequentially. Agent A finishes, then B starts, then C. If A and B don’t depend on each other, that’s wasted latency.
Set async_execution=True on tasks that can run in parallel:
| |
CrewAI’s sequential process respects async_execution—it kicks off async tasks together and waits for them before moving to tasks that depend on them via context. The synthesis_task here won’t start until both async tasks complete.
A few things to get right:
Don’t async tasks that share state. If two agents both write to the same memory store or external database, parallel execution creates race conditions. Either sequence them or use locking.
Async doesn’t mean free. You’re still hitting rate limits. If you have five async tasks all using GPT-4o simultaneously, you may hit tokens-per-minute limits and get throttled. Check your rate limit tier before scaling parallelism.
The context field is your dependency graph. Only list tasks in context that an agent truly needs. Listing everything “just in case” stuffs the context window with irrelevant output.
Tool Result Caching
CrewAI caches tool call results by default—cache=True is the default on every Agent, so you usually don’t need to set it. What it does: if an agent calls the same tool with the same arguments a second time within a run, it returns the cached result instead of executing the tool again.
| |
You’d only set cache=False when a tool returns live data that must be fresh on every call—a real-time price feed, a rate-limited scraper, or anything where a stale cached result would be wrong.
For LLM-level prompt caching (reusing KV cache tensors across identical prompt prefixes), that runs at the provider level—Anthropic, OpenAI, and Google all support it natively. You don’t configure it in CrewAI; you get it automatically when you send identical system prompts and context prefixes across calls. Providers like Anthropic’s Claude discount cached prompt tokens by up to 90%.
We covered fine-grained tool caching with custom TTL and cache_function in Part 2.
Context Window Management
The output of each task becomes input for the next. If your tasks produce verbose outputs, your context window fills fast and your costs grow with every agent in the chain.
Use output_pydantic to force structured, compact outputs:
| |
Instead of a 600-word prose analysis, downstream agents receive a compact, structured object. This alone can cut context tokens by 50–70% for data-heavy workflows.
When you can’t use structured output—creative or open-ended tasks, for example—write explicit length constraints in expected_output:
| |
Agents follow output format instructions more reliably than you’d expect, especially with temperature=0.0.
Trimming Agent Backstories
The backstory field is included in every prompt for that agent, for every LLM call it makes. A 200-word backstory on a tool-calling agent that makes 8 calls per task = 1,600 tokens of pure overhead.
Keep backstories short and role-specific:
| |
The rule: backstory should contain information the agent actually needs to make decisions, not bio padding.
Measuring the Impact
Before any optimization, capture a baseline:
| |
Run this before and after each optimization. The numbers that matter most:
- Prompt tokens → impact of context trimming,
output_pydantic, shorter backstories - Duration → impact of
async_executionand model tiering - Completion tokens → usually less controllable, but structured output helps here too
A common baseline for a three-agent research workflow: ~12,000 tokens, ~45 seconds, $0.08/run. After applying model tiering, async execution, and structured outputs: ~4,000 tokens, ~18 seconds, $0.015/run.
Common Pitfalls
Async tasks that share memory. If two agents both call memory.save() in parallel and your memory implementation isn’t thread-safe, you’ll get data corruption. Either sequence those tasks or give each agent its own memory namespace.
Caching stale tool results. Tool caching from Part 2 is time-bounded, but LLM caching is indefinite by default. If your workflow pulls live data and you cache LLM responses, an agent might produce a “current analysis” from a 3-day-old cache hit. Disable LLM caching for agents that need fresh reasoning.
Wrong model for complex tasks. Model tiering works when you tier correctly. Sending a nuanced competitive analysis to GPT-3.5-turbo to save cost often results in shallow output that requires a re-run with GPT-4o anyway. Measure quality, not just cost.
Over-parallelizing against rate limits. More async tasks doesn’t automatically mean faster. If you’re on a low-tier API key, five concurrent tasks throttling each other is slower than three sequential ones. Test your rate limit headroom before scaling async.
What We Covered
The performance gap between a naive CrewAI crew and an optimized one is usually 3–5x on cost and 2x on latency—without sacrificing output quality. The techniques that move the needle most:
- Model tiering — largest impact on cost
output_pydantic— largest impact on context bloatasync_execution=True— largest impact on latency- Lean backstories — easy win that most people skip
Use usage_metrics to confirm changes are actually working. Token counts don’t lie.
Next up: putting everything together for production deployment—rate limit handling, retries, cost controls, and monitoring CrewAI crews in a live environment.