Performance Optimization and Token Management in CrewAI

04 Jun 2026 CrewAIAI AgentsPerformanceOptimizationPython AI/MLTutorial11 min read

In the previous post, we gained full visibility into what our agents do and why. Now that I could see exactly what was happening, I saw the next problem immediately: a three-agent research workflow was spending $4 per run when it should have cost $0.40.

Multi-agent systems have a cost problem that doesn’t exist with single-agent calls: costs compound. A three-agent crew where each agent processes 2,000 tokens of context and produces 800-token outputs means the third agent may receive 5,000+ tokens as input before writing a single word. Multiply that by real traffic and the bill grows fast. Add sequential execution where each agent waits for the previous one—and you also have a slow system.

This post covers the practical techniques for fixing both.

Measure Before You Optimize

Don’t optimize blind. CrewAI exposes token usage through usage_metrics after every crew run.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from crewai import Agent, Task, Crew

researcher = Agent(
    role="Research Analyst",
    goal="Find market data for the given company",
    backstory="You specialize in financial market research.",
    llm="openai/gpt-4o"
)

summarizer = Agent(
    role="Report Writer",
    goal="Write a concise summary from research findings",
    backstory="You turn dense research into readable reports.",
    llm="openai/gpt-4o"
)

research_task = Task(
    description="Research {company} and gather key financial metrics.",
    expected_output="Bullet-point list of key metrics with sources.",
    agent=researcher
)

summary_task = Task(
    description="Write a 3-paragraph executive summary from the research.",
    expected_output="Three-paragraph executive summary.",
    agent=summarizer,
    context=[research_task]
)

crew = Crew(agents=[researcher, summarizer], tasks=[research_task, summary_task])
result = crew.kickoff(inputs={"company": "Stripe"})

print(crew.usage_metrics)

The output looks like this:

1
2
3
4
5
6
UsageMetrics(
    total_tokens=8432,
    prompt_tokens=6891,
    completion_tokens=1541,
    successful_requests=4
)

That prompt_tokens number is almost always the culprit. In most workflows, 80–90% of token spend is prompt tokens—the context you’re feeding agents, not the content they produce. That’s where to focus.

Model Tiering: Use the Right Tool for Each Job

The single biggest cost lever is model selection per agent. Not every agent in your crew needs GPT-4o or Claude Opus. A lot of what agents do is mechanical: formatting output, extracting structured data, routing decisions. Smaller models handle this well and cost 10–20x less.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from crewai import Agent, LLM

# Expensive model only where it earns its cost
researcher = Agent(
    role="Senior Research Analyst",
    goal="Identify key competitive threats and market opportunities",
    backstory="You have deep expertise in market analysis and strategic research.",
    llm=LLM(model="openai/gpt-4o", temperature=0.3)
)

# Cheap model for structured extraction
data_extractor = Agent(
    role="Data Extraction Specialist",
    goal="Extract structured data from research findings",
    backstory="You convert unstructured text into structured JSON.",
    llm=LLM(model="openai/gpt-4o-mini", temperature=0.0)
)

# Cheap model for mechanical formatting
report_formatter = Agent(
    role="Report Formatter",
    goal="Format findings into the standard report template",
    backstory="You apply consistent formatting to research reports.",
    llm=LLM(model="openai/gpt-4o-mini", temperature=0.0)
)

A practical tiering guide:

Agent type	Recommended model tier	Why
Strategy, analysis, reasoning	Opus / GPT-4o	Needs deep inference
Data extraction, classification	Sonnet / GPT-4o-mini	Pattern-matching, not reasoning
Formatting, summarization	Haiku / GPT-3.5-turbo	Mechanical transformation
Tool-use-only agents	Haiku / GPT-3.5-turbo	Just dispatching calls

You won’t always tier this correctly the first time. Run usage_metrics before and after each change and let the numbers confirm the tradeoff. I’ve had cases where downgrading an extraction agent cut costs by 70% with no quality difference—and cases where it visibly degraded output. Measure; don’t assume.

Parallel Task Execution

By default, CrewAI runs tasks sequentially. Agent A finishes, then B starts, then C. If A and B don’t depend on each other, that’s wasted wall-clock time.

Set async_execution=True on tasks that can run in parallel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from crewai import Task

# These two tasks have no dependency on each other
financial_research_task = Task(
    description="Research {company}'s financial metrics and recent earnings.",
    expected_output="Financial metrics summary with Q/Q trends.",
    agent=financial_analyst,
    async_execution=True   # runs concurrently
)

news_research_task = Task(
    description="Find recent news and press releases about {company}.",
    expected_output="Chronological list of significant news items.",
    agent=news_analyst,
    async_execution=True   # runs concurrently
)

# This task depends on both — it waits for them
synthesis_task = Task(
    description="Synthesize financial data and news into an investment brief.",
    expected_output="Two-page investment analysis brief.",
    agent=senior_analyst,
    context=[financial_research_task, news_research_task]  # explicit deps
)

CrewAI kicks off async tasks together and waits for them before moving to tasks that list them in context. The synthesis_task here won’t start until both async tasks complete.

Three things to get right:

Don’t async tasks that share state. If two agents both write to the same memory store or external database, parallel execution creates race conditions. Either sequence them or give each agent its own memory namespace.

Async doesn’t mean free. You’re still hitting rate limits. Five concurrent GPT-4o tasks may hit your tokens-per-minute ceiling and throttle each other. Test your rate limit headroom before scaling parallelism.

The context field is your dependency graph. Only list tasks that an agent truly needs. Listing everything “just in case” stuffs the context window with irrelevant output and increases costs.

Tool Result Caching

CrewAI caches tool call results by default—cache=True is the default on every Agent. If an agent calls the same tool with the same arguments twice within a run, it returns the cached result instead of executing again.

1
2
3
4
5
6
7
researcher = Agent(
    role="Research Analyst",
    goal="Find market data for the given company",
    backstory="You specialize in financial market research.",
    llm="openai/gpt-4o",
    cache=True   # default — caches tool call results within the run
)

You’d only set cache=False when a tool returns live data that must be fresh on every call—a real-time price feed, a rate-limited scraper, or anything where a stale result would be wrong.

For LLM-level prompt caching (reusing KV cache tensors across identical prompt prefixes), that runs at the provider level. Anthropic, OpenAI, and Google all support it natively—you don’t configure it in CrewAI. You get it automatically when you send identical system prompts and context prefixes across calls. Anthropic’s Claude discounts cached prompt tokens by up to 90%.

We covered fine-grained tool caching with custom TTL and cache_function in Part 2.

Context Window Management

The output of each task becomes input for the next. If tasks produce verbose outputs, your context window fills fast and costs grow with every agent in the chain.

Use output_pydantic to force structured, compact outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from pydantic import BaseModel
from typing import List

class CompanyMetrics(BaseModel):
    revenue_growth_yoy: float
    gross_margin: float
    key_risks: List[str]
    top_competitors: List[str]

financial_task = Task(
    description="Extract key financial metrics for {company} from the provided data.",
    expected_output="Structured financial metrics object.",
    agent=financial_analyst,
    output_pydantic=CompanyMetrics   # forces structured output
)

Instead of a 600-word prose analysis, downstream agents receive a compact, structured object. This alone can cut context tokens by 50–70% for data-heavy workflows.

When you can’t use structured output—creative or open-ended tasks, for example—write explicit length constraints in expected_output:

1
2
3
4
5
summary_task = Task(
    description="Summarize the research findings for {company}.",
    expected_output="Exactly 3 bullet points, each under 20 words. No prose.",
    agent=summarizer
)

Agents follow output format instructions more reliably than you’d expect, especially with temperature=0.0.

Trimming Agent Backstories

The backstory field is included in every prompt for that agent, for every LLM call it makes. A 200-word backstory on a tool-calling agent that makes 8 calls per task adds up to 1,600 tokens of pure overhead.

Keep backstories short and role-specific:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# 180 tokens of backstory — unnecessary for a routing agent
data_router = Agent(
    role="Data Router",
    goal="Route incoming data to the correct processing pipeline",
    backstory="""You are an expert data routing specialist with fifteen years of experience
    in enterprise data engineering. You have worked at Fortune 500 companies and have deep
    expertise in ETL pipelines, data warehousing, stream processing, and real-time analytics.
    You approach every routing decision methodically...""",
    llm="openai/gpt-4o-mini"
)

# 12 tokens — does the same job
data_router = Agent(
    role="Data Router",
    goal="Route incoming data to the correct processing pipeline",
    backstory="You route data to the right pipeline based on schema and content type.",
    llm="openai/gpt-4o-mini"
)

The rule: backstory should contain information the agent actually needs to make decisions, not bio padding. Most agents run fine with one sentence.

Measuring the Impact

Before any optimization, capture a baseline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import time

def run_with_metrics(crew, inputs):
    start = time.time()
    result = crew.kickoff(inputs=inputs)
    elapsed = time.time() - start

    metrics = crew.usage_metrics
    print(f"Duration:          {elapsed:.1f}s")
    print(f"Total tokens:      {metrics.total_tokens}")
    print(f"Prompt tokens:     {metrics.prompt_tokens}")
    print(f"Completion tokens: {metrics.completion_tokens}")
    print(f"API calls:         {metrics.successful_requests}")

    return result

Run this before and after each change. The numbers that matter most:

Prompt tokens → impact of context trimming, output_pydantic, shorter backstories
Duration → impact of async_execution and model tiering
Completion tokens → usually less controllable, but structured output helps here too

A common baseline for a three-agent research workflow: ~12,000 tokens, ~45 seconds, $0.08/run. After applying model tiering, async execution, and structured outputs: ~4,000 tokens, ~18 seconds, $0.015/run. That ratio holds roughly across most workflows.

Common Pitfalls

Async tasks that share memory. If two agents both call memory.save() in parallel and your memory implementation isn’t thread-safe, you’ll get data corruption. Either sequence those tasks or give each agent its own memory namespace.

Caching stale tool results. Tool caching from Part 2 is time-bounded, but LLM caching at the provider level can be indefinite. If your workflow pulls live data and identical prompt prefixes hit the provider cache, an agent might produce a “current analysis” from a 3-day-old response. Disable LLM caching for agents that need fresh reasoning, or vary a timestamp field to break cache hits.

Wrong model for complex tasks. Sending a nuanced competitive analysis to GPT-3.5-turbo to save cost often results in shallow output that needs a re-run with GPT-4o anyway. Net cost: higher. Measure quality, not just token counts.

Over-parallelizing against rate limits. More async tasks doesn’t automatically mean faster. If you’re on a low-tier API key, five concurrent tasks throttling each other is slower than three sequential ones. Test your rate limit headroom before scaling async.

What We Covered

The performance gap between a naive crew and an optimized one is usually 3–5x on cost and 2x on latency—without sacrificing output quality. The techniques that move the needle most:

Model tiering — largest impact on cost
output_pydantic — largest impact on context bloat
async_execution=True — largest impact on latency
Lean backstories — easy win that most people skip

Use usage_metrics to confirm changes are actually working. Token counts don’t lie.

Next up: Deploying CrewAI workflows to production with rate limit handling, retries, cost controls, and monitoring in a live environment.

This is part 5 of the CrewAI series. Previous: Part 1: Getting Started, Part 2: Building Custom Tools, Part 3: Memory and State Management, Part 4: Debugging Workflows