Your AI Agent Burns Up to 72% of Its Context Window Reading Tools It Will Never Use. Five Fixes That Work.

Every MCP server you load dumps its entire JSON schema into the model's context at session start, regardless of which tools the session will actually call. On a typical developer setup, that overhead occupies 31% to 72% of the available context window before the first prompt. The model then operates through the entire session carrying that weight. Multiple independent benchmarks have now quantified the cost, and five tested patterns eliminate most of it.

Key Takeaways

Three MCP servers loaded together (GitHub, Slack, Sentry) consumed 143,000 of 200,000 available Claude tokens in tool definitions alone before any conversation started, per Apideck testing cited by Nevo Systems.
Scalekit's 75-run benchmark found the GitHub CLI completed tasks using 1,365 tokens where the GitHub MCP server used 44,026, a 32x difference; CLI reliability was 100% versus MCP's 72%.
The problem compounds before the context fills: model response quality declines measurably around the 40% usage threshold, meaning a session starting with 60,000 tokens of MCP overhead begins already inside the degradation zone.
Atlassian's mcp-compressor wrapper, combined with CLI substitution for documented tools, reduces startup token cost by up to 95%, per developer testing.
MCP retains genuine advantages for services without CLI equivalents and for team-shared credential management; the problem is its use as the default for everything.

The core issue is not that MCP is broken. It was designed around a reasonable assumption: the model needs to know what tools exist before deciding which to use, so schemas load at startup. That assumption scales poorly. When developers stack five servers covering 40 or more tools, the model begins every session loaded with schemas for creating GitHub gists, querying Slack message history, running Sentry diagnostics, and managing pull request webhooks, while the actual task is to update a single issue title. The model carries that dead weight through every inference step for the entire session.

The Token Count Before You Type Anything: What Each Server Actually Costs

A single moderately configured MCP server already imposes a meaningful startup cost. The Linear MCP server exposes 42 tool definitions and consumes approximately 12,800 tokens before any work begins, per Quandri's engineering analysis. A combined four-server stack reaches approximately 21,077 tokens. The GitHub MCP server alone, with 93 tools, burns roughly 55,000 tokens loading definitions into context, per Mario Giancini's analysis. That is half of GPT-4o's 128K context window occupied by tool descriptions before asking a single question.

Tyler Folkman measured his own setup and found 31% of his Claude context window consumed before any task: 23 plugins, 8 skills, and 5 MCP servers loaded by default, all injecting schemas at startup, per his Substack analysis. The 31% figure represents a comparatively lean configuration. Developer reports compiled by Giancini document setups with 66,000 or more tokens consumed before the first prompt. At the production extreme, Apideck found that loading GitHub, Slack, and Sentry together consumed 143,000 of 200,000 available Claude tokens, 72% of the entire working memory, occupied before the model could respond to anything.

The overhead is compounded by routing cost. Every tool call requires the model to rank available tools and select the right one. With 40 tools loaded, that ranking process draws on context and inference that would otherwise go toward the task itself. Near-duplicate tools, multiple servers exposing overlapping functionality, push the cost higher by forcing the model to resolve ambiguity before acting. The schemas do not leave context once loaded; they remain for the entire session.

The Benchmarks: 32x Token Overhead, 9x Slower Initialization, 72% Reliability

Scalekit's public benchmark ran 75 operations comparing the GitHub MCP server against the GitHub CLI (gh) for representative agentic tasks. For "what language is this repository written in?", the CLI agent required 1,365 tokens; the MCP agent required 44,026. The 32x difference is driven almost entirely by the 43 tool schemas loaded regardless of which tool the task needs. At 10,000 monthly operations, this translates to approximately $3.20 in CLI token costs versus $55.20 via MCP, per Scalekit's analysis. CLI reliability across all 75 runs was 100%; MCP reliability was 72%, with failures attributed to initialization errors and mid-session crashes rather than protocol logic.

Quandri's team measured call latency directly: MCP was 3x slower per call and 9.4x slower on first call, including initialization overhead. The Linear server's per-task token cost was approximately 65x the equivalent CLI command. Combining the token multiplier with the latency multiplier, workflows that call a tool dozens of times per session accumulate costs that are structurally different from what most developers estimated when they configured their MCP setup.

The "dumb zone" is the informal name for what happens as context approaches capacity. Measurable degradation in response quality begins around 40% usage, per Giancini's analysis, before the window fills. A session that starts with 60,000 tokens of MCP overhead is already at 30% of a 200K context before the first exchange, and typically reaches 40% within the first few turns. The model is not running out of space; it is operating in a noisier signal environment where recent task context competes for attention with static schema definitions for operations the session will never invoke.

Why Perplexity's Decision Matters, and What It Actually Means

Perplexity's CTO Denis Yarats announced at the Ask 2026 conference that Perplexity is moving away from MCP internally for production systems, in favor of their proprietary Agent API optimized for latency and cost at scale, per Nevo Systems' coverage. The detail worth noting: Perplexity continues maintaining an MCP server for external developers. This is not an abandonment of the protocol as a product surface. It is a decision that production inference systems, where token cost and call latency are metered at scale, should not route through MCP's startup overhead when proprietary alternatives are available.

The broader trajectory is clarified by what Cloudflare and Anthropic have each published. Cloudflare's Code Mode approach achieves a 99.9% token reduction over standard MCP loading; Anthropic's code execution pattern with filesystem-based tool discovery achieves 98.7% reduction while preserving MCP compatibility. Neither represents an abandonment of the protocol. Both represent the same conclusion: the default loading behavior is the expensive part, not the underlying design.

Five Fixes That Work

The highest-return change is replacing MCP with the CLI for any tool that already has one. The GitHub CLI (gh), Linear's CLI, Jira's CLI, and Sentry's CLI are present in Claude's training data through man pages and developer documentation, which means the model knows their syntax without schema injection. Substituting CLI for MCP on these tools reduces their per-session token cost to near zero for the definitions, while the only context consumed is the output of commands actually run. Quandri's analysis defines the residual justified cases for MCP: web-only SaaS platforms without CLIs, real-time bidirectional communication where the server must push state changes to the agent, and team-shared credential management where individual CLI authentication is impractical.

The second fix is progressive schema loading. The mcp-compressor tool from Atlassian wraps existing MCP servers and exposes a compact facade: brief tool names without full parameter schemas. The model selects a tool from that compact list; only then does the full schema for that specific tool load into context. This inverts the loading sequence from "all schemas upfront" to "schemas on demand," and reduces startup overhead by up to 95% on heavy configurations, per Folkman's testing. The change requires no modification to the underlying MCP server.

The third fix is project-scoped server configuration. Rather than loading the same global MCP server set for every project, .mcp.json project files define which servers are active for a given repository. A backend service that uses a database integration does not need Figma, Notion, or Slack servers loaded. Claude Code's /mcp command disables individual servers mid-session and reclaims their token allocation immediately. Running third-party servers in Docker containers additionally enforces network egress limits, isolating their blast radius if a server behaves unexpectedly.

The fourth fix is proactive compaction. Claude Code's /compact command summarizes conversation history and frees the context it occupied. The common practice of waiting until context nears capacity produces weaker summaries than compacting at 60% usage, per Claude Code context management guidance. Compacting at a natural task boundary, after completing a discrete subtask, gives the compaction a coherent scope and preserves the reasoning that matters while dropping the scaffolding that does not. The /context command provides a live breakdown of token allocation across system prompt, MCP tools, memory files, skills, and conversation history, making the overhead visible before it becomes a problem.

The fifth fix is task decomposition. A single long session accumulating context across many subtasks can be replaced by shorter sessions with defined endpoints. A brief handoff note, a .md file capturing decisions made, patterns established, and next steps, preserves continuity across session boundaries without requiring the prior conversation to remain loaded in context. This is particularly effective for large refactors, where architectural context is critical at the start of a task and irrelevant once edits become mechanical.

Background: Why MCP's Default Behavior Became a Problem at Scale

MCP launched in November 2024 as a standardization effort for how AI agents connect to external tools. Over 13,000 public servers launched on GitHub within the first year. The protocol's design assumption was reasonable for the use case it was designed for: a small set of well-defined tools for a specific integration. What developers have actually done is aggregate general-purpose servers covering entire platforms, loading all of them simultaneously for all projects. The protocol did not anticipate that pattern, and its eager schema loading is not configurable at the protocol level without wrapping tools like mcp-compressor.

The practical summary: MCP works well as a targeted integration layer for specific, justified cases. It works poorly as a default "load everything" infrastructure layer for every development session. The five patterns described above do not require abandoning the protocol. They require being deliberate about when the protocol is the right choice, scoping which servers load for which projects, and loading schemas progressively rather than eagerly. The token budget recovered by those changes is available for the work the session is actually supposed to do.