I'm Building Tools That AI Agents Call

I'm building MCP tools that allow AI agents to execute on-chain transactions — reading chain state, bridging, swapping, placing exchange orders — all tools where the AI makes decisions and executes them directly.

At first, I approached it like building a conventional API. Design the endpoints, validate inputs, format responses. But the more I built, the more I realized the old approach wasn't working. Almost everything was different, from design philosophy to testing methodology.

1. "Who Is Calling" Changes Everything

Conventional APIs are called by humans or other services. Both are deterministic. When a frontend developer calls POST /orders, the server executes a fixed sequence: check balance → approve → sign → submit — in that exact order, because the developer hard-coded it.

With AI tools, the caller is an AI agent. It's non-deterministic. Even given the same goal, the AI may combine tools in a different order every time. If the balance is low, it might bridge first. If gas is expensive, it might route through a different chain. The AI trying combinations the developer never anticipated is the norm, not an exception.

I really felt this difference when I restructured things. Initially, I had venue.trade_polymarket with every step hard-coded inside it. It was Polymarket-specific, and adding another exchange meant building venue.trade_xxx from scratch.

I broke it down into composable blocks: chain.execute, defi.bridge, defi.approve. Once I did, the AI started freely combining these blocks to handle new protocols without any code changes.

The core shift is this: from "building a service" to "designing the AI's action space."

2. The Schema Is the Prompt

In conventional API development, parameter names don't matter much. Whether it's sc or source_chain, you just read the Swagger docs and map it in code. The developer calling it can read documentation.

With AI tools, this becomes an entirely different problem. The AI's only basis for selecting a tool and filling in parameters is the tool's name, description, and parameter descriptions. It doesn't reference a separate Swagger doc. The schema itself is the only documentation the AI reads.

That's why changing a parameter from amt to amount_usd can shift the AI's success rate. Writing "Cross-chain token bridge. Handles gas, approve, deposit automatically" in the description versus "Bridge tool" makes them completely different tools from the AI's perspective.

Changing a tool schema isn't code refactoring — it's prompt engineering. The thing you A/B test isn't code, it's the description text.

3. Errors Are "Decision Data," Not "Failure Notifications"

In conventional services, error handling is relatively straightforward. HTTP 400 shows "Insufficient balance" in the frontend; 500 shows "Server error." The consumer of the error message is a human, and the human decides what to do next.

In AI tools, the consumer of errors is the AI. The AI reads the error message to decide which tool to call next. The quality of the error message directly determines the quality of the AI's judgment.

I ran into this firsthand. A venue tool threw an insufficient-balance error, but it was returned as an MCP protocol error (JSON-RPC error). All the AI received was "Tool call failed." The actual message — "not enough balance on Polygon" — was swallowed when it got wrapped in a protocol error.

What if the AI had seen "not enough balance on Polygon"? It would have decided to use defi.bridge to pull USDC from another chain. But seeing only "Tool call failed," it just gave up.

The fix was simple: return business errors via the isError flag in the tool result instead of as protocol errors. From that point on, the AI started reading error content and responding autonomously.

4. Outputs Need to Carry "Intermediate Context"

In conventional services, a response just needs to contain the final result. { "status": "success", "tx_hash": "0x..." } — the caller confirms success and that's it. What steps were taken along the way is an internal implementation detail, and hiding it is the right call.

AI tools are different. For the AI to reason "approve succeeded, but order submission failed," intermediate steps must be visible in the output.

json

{
  "status": "submitted",
  "steps": [
    { "action": "balance_check", "result": "USDC: 4.2" },
    { "action": "approve", "status": "confirmed" },
    { "action": "order_submit", "status": "submitted" }
  ]
}

Without this steps array, the AI only sees the final state and has no basis to infer what went wrong along the way. But packing in too much information wastes tokens and causes the AI to miss what matters. The design challenge is to include only what the AI needs, concisely.

5. The Criteria for Abstraction Level Are Different

Conventional services divide APIs based on human cognitive load. REST is resource-centric (/users, /orders), broken into units easy for people to understand.

AI tools are guided by AI reasoning efficiency. This is a subtle balance:

Too low-level and the AI has to take too many steps. Each step burns tokens, and the more steps, the higher the chance of mistakes.
Too high-level and the AI can't handle edge cases flexibly. If "bridge then approve then trade" is bundled into one, you can't handle "just bridge" as a scenario.

To find this balance, I split things into Layers 0–3:

Layer	Exposed to AI?	Reason
L0 (signing, nonce, gas)	No	Mechanical details the AI doesn't need to know. Getting them wrong is just risky
L1 (chain.execute)	Yes	General-purpose building block. When the AI encounters a new protocol, it composes from these
L2 (defi.bridge)	Primarily used	The right abstraction for the AI's typical decision units. Most tasks are solved at this level
L3 (polymarket.trade)	Convenience	Shorthand for frequently used patterns. Also saves tokens

This is a design judgment that didn't exist before — defining the abstraction level based on "how much cognitive load does this place on the AI using it?"

6. Guardrails, Not Validation

Input validation in conventional services checks whether inputs are correct. if amount <= 0, reject. Wrong address format, throw an error. The responsibility for sending correct inputs lies with the caller; the server just verifies.

AI tools need a different layer — guardrails that check whether the AI's actions are safe.

Here's the distinction: the AI calls chain.execute with syntactically perfect parameters. The address format is valid, the amount is positive. Input validation passes. But the request is sending $10,000 to an unverified contract.

So inside chain.execute, it automatically runs chain.simulate to detect reverts in advance, asks the AI for confirmation if the amount exceeds a threshold, and evaluates contract risk with defi.analyze. This isn't input validation — it's behavioral safety validation.

7. Idempotency: From Best Practice to Survival Requirement

In conventional services, idempotency was "good practice." PUT is idempotent, POST may not be, and developers controlled duplicate calls in code.

In AI tools, idempotency is a survival condition. The AI can inadvertently call the same tool twice:

Response was long and got cut off → retry
Mistakenly treated success as an error → retried something that had already succeeded
Context got compressed and forgot the previous call → sent the same request again

In tools that move money, this is catastrophic. If defi.bridge runs twice, twice the intended amount moves.

In conventional code, the client (the calling code) was responsible for preventing duplicates. In AI tools, you can't delegate that responsibility to the AI. The tool itself must require an idempotency key, or detect "an identical bridge request was already executed within the last 30 seconds" and defend itself.

8. State Management: The Context Window Is the State Store

In conventional services, the server manages state — sessions, DB, Redis cache. The client just sends stateless requests.

In AI tools, the tools themselves are stateless. But someone has to hold state — and that's the AI's context window.

Concretely: the first call has prepare_bridge return { approval_tx: "0x...", swap_tx: "0x..." }. The second call has the AI pull approval_tx and pass it to chain.execute(approval_tx). The third call uses swap_tx. This flow works because the AI holds previous responses in its context.

That's why tool output design matters. "Values needed for the next step" must be explicitly included in the output. In conventional development, you just stash it in a server session. In AI tools, the output itself is the state-passing mechanism. If something is left out of the output, the AI has no way to perform the next step.

9. Token Economics: A New Cost Dimension

The cost of conventional services is server compute + DB I/O + network. Whether an API response is 1KB or 100KB, the cost difference is negligible.

In AI tools, the cost is token count × model price. And tokens are consumed by both tool definitions (schemas) and outputs.

Do the math: if 10 tools have schemas averaging 500 tokens each, tool definitions alone burn 5,000 tokens in every request's context. Add accumulated output from each call, and a single task consumes 12,000+ tokens as "infrastructure cost."

This is a completely different cost structure from conventional development where response size wasn't a concern. From the design stage:

Minimize tool count. 10 general-purpose tools save more schema tokens than 30 specialized tools.
Output only what the AI needs. Human-friendly messages are token waste.
Layer 3 shorthand tools also serve token efficiency. Collapsing 7 steps into 1 saves the tokens from 6 intermediate outputs.

10. Testing: Determinism vs. Non-Determinism

This is the most fundamental difference, in my view.

Conventional E2E tests are deterministic. Click login button → enter ID/PW → verify dashboard renders → assert(title === "Dashboard"). Same input yields the same result. Every run executes the same assertions in the same order.

AI tool E2E tests are non-deterministic. Give the AI a goal — "buy $5 of ETH" — and you can't know in advance which tools it will call or in what order. If the balance is sufficient, it might buy directly. If not, it might bridge first or pull funds from another chain. The path may vary each time, but there's only one thing to verify: "Did an ETH position end up being created?"

Summarizing the methodological differences:

Type	Conventional Service	AI Tools
Unit test	Function input → output	Tool input → output (same here)
Integration test	Fixed API call sequence	Whether tool combinations are valid (order-agnostic)
E2E	Reproduce UI flow	Give a goal → verify final state
Error test	Check error codes	Whether the AI responds correctly upon seeing an error

The last row is the key. "Whether the AI responds correctly upon seeing an error" is a test category that didn't exist before. When signing fails, does the AI choose "re-issue credentials" instead of "retry"? The quality of the tool's error message is what determines whether the test passes.

11. Design Methodology: The Observe → Adjust Loop

Conventional service development is top-down. Requirements → API design → DB schema → implementation → tests. Given "the user creates an order," you get POST /orders → orders table → handler implementation.

AI tool development is outside-in. You give the AI a goal and observe. You watch where it gets stuck, which tools are missing, which error messages are unhelpful — then modify the tools and observe again.

In practice, I ran this loop repeatedly:

AI couldn't execute trades → there was a bug in the signing logic
AI couldn't see errors → the error delivery mechanism was wrong
AI couldn't remember previous results → the output structure was missing the values needed for the next step
AI was spending time on low-level work → an abstraction layer was needed

I didn't write design docs first and then implement. I could only discover the right abstraction level by watching the AI actually use the tools. It's not "design → implement" — it's the repeated loop of "observe → adjust."

12. Extensibility: Infinite Expansion Through Composition

Adding a new exchange to a conventional service means implementing a dedicated adapter, building a dedicated API, writing dedicated tests. n exchanges = n implementations.

AI tools work differently. With just 4 L1 tools and 4 L2 tools built well, the AI composes them to handle virtually unlimited venues. The AI figures out on its own that "this DEX is on chain 137 and requires approve + swap calls," then composes existing tools to execute it. No new code needed.

The system expands not by adding more tools, but by increasing the combinatorial possibilities. It's the AI-era version of the Unix philosophy — "connect small tools with pipes."

In One Sentence

Conventional service development: hard-code human intent into code, then test whether the code is correct. AI tool development: design the AI's action space, then test whether the AI can make correct judgments within that space.

It's a shift from "control" to "environment design."

메뉴

AI Tool Development Is Fundamentally Different from Building Conventional Services