Teaching AI to Do My Job (Automated Bug Triage)

I have a task management system. I built it myself - it's called Task Board. And one evening in early 2026, I had what felt like a brilliant idea: what if Claude Code could automatically pick up bugs from Task Board, investigate the codebase, figure out what's wrong, and either fix it or post its findings back?

Six weeks later, I can report that this idea is approximately 60% brilliant and 40% "what was I thinking".

The Vision

The concept is straightforward. A bug gets logged in Task Board. An AI agent picks it up automatically. It reads the bug description, looks at the relevant code, checks the error logs in Application Insights, queries the MongoDB database if needed, and either produces a fix or posts back a detailed triage report: "Here's what I found, here's what I think the problem is, here's what I need from a human to proceed."

This isn't science fiction. All the individual pieces exist. Claude Code can read and modify files. MCP (Model Context Protocol) servers can connect it to external systems like Task Board and MongoDB. The agent architecture supports multi-turn interactions. In theory, you just wire it all together.

In practice, "just wire it all together" is the software development equivalent of "just climb Everest".

The Architecture

I designed a full state machine for the triage flow. The agent starts in an INVESTIGATING state, reads the bug details, explores the relevant code, and transitions through states: NEEDS_INFO (when it needs to ask a human something), DIAGNOSING (when it's formed a hypothesis), FIXING (when it's attempting a code change), and COMPLETE (when it's done).

Each state has defined inputs and outputs. The agent can pause when it needs human input and resume when it gets it. It generates MCP configuration dynamically based on the project it's investigating. It posts updates back to Task Board as it works.

On paper, this is beautiful. In reality, it exposed every limitation of the current AI tooling ecosystem.

What Works

Code exploration is excellent. Give Claude Code a bug description and point it at a codebase, and it's genuinely good at finding the relevant files, understanding the code structure, and forming hypotheses about what might be wrong. It'll trace through call chains, identify potential null reference issues, spot off-by-one errors, and flag race conditions. Not perfectly, but well enough to be useful.

Structured reporting is excellent. When the agent produces a triage report - "I investigated the payment processing failure, found that the Stripe webhook handler doesn't account for duplicate events, and here's the specific file and line number" - the quality is high. It's doing in minutes what a developer would take an hour to do, and the structured output is actually more consistent than what most developers write in bug tickets.

Simple fixes work. For straightforward bugs - missing null checks, incorrect string comparisons, obvious logic errors - the agent can produce working fixes. Not always first time, but the iterative process of "try a fix, run the tests, adjust" works when the problem is well-bounded.

What Doesn't Work (Yet)

Claude Code is interactive. This is the fundamental challenge. It's designed for conversation, not fire-and-forget. When it hits something ambiguous, it wants to ask. "Should I fix this in the controller or the service layer?" "There are two possible causes - which should I investigate first?" In an automated pipeline, there's nobody to ask.

I worked around this by defining decision heuristics - rules like "if there are two possible causes, investigate both and present findings for both" - but this only covers common scenarios. Edge cases still require human intervention, which defeats the purpose of automation.

Context windows are a real constraint. Large codebases exceed the token limits. The agent can't hold the entire codebase in memory, so it has to make decisions about what to read. Usually it makes good decisions, but sometimes it misses relevant context in a file it didn't think to check.

MCP tooling is immature. Connecting Claude Code to external systems via MCP works, but the ecosystem is young. Error handling is basic. Authentication flows are clunky. Building custom MCP servers for Task Board and MongoDB was more work than I expected, and the debugging experience when something goes wrong in the MCP layer is painful.

The Crucial Insight

Here's what I learned: the value isn't in fully autonomous bug fixing. It's in autonomous triage.

Getting an AI to reliably fix arbitrary bugs in a production codebase is still beyond what current tools can do consistently. But getting an AI to investigate a bug, understand the context, narrow down the cause, and present a structured report to a developer? That works today. And it saves a significant amount of developer time - the investigation phase of bug fixing is often the most time-consuming part.

I've settled on a "triage agent" model rather than a "fix agent" model. The AI does the detective work. A human reviews the findings and decides on the fix. The AI can then implement the fix under human supervision. It's less glamorous than full autonomy, but it actually works.

Where This Is Heading

I genuinely believe that within a year or two, autonomous AI agents will handle a significant portion of routine bug fixing. The technology is improving fast. Context windows are getting larger. Tool integration is getting more robust. The interaction model is evolving from conversational to agentic.

But right now, in February 2026, the honest assessment is: AI agents are an incredibly powerful augmentation tool for bug investigation and triage, a moderately useful tool for simple fixes, and not yet reliable enough for complex fixes in production codebases without human oversight.

That's still a massive step forward from where we were six months ago. And the pace of improvement suggests that the gap between "triage agent" and "fix agent" is closing fast.

I'll keep building. I'll keep testing the boundaries. And I'll keep being honest about what works and what doesn't. Because that's how you actually make progress - not by pretending the technology is further along than it is, but by using it where it genuinely helps and pushing it forward where it doesn't.