Debugging in Production

Debugging in production is a skill most developers learn under pressure, during an incident. The better approach is to think about it before something goes wrong, both in terms of the observability you put in place and the systematic process you follow when you need it.

The Difference

Local debugging: you can reproduce the issue, inspect state directly, add log statements, set breakpoints, and iterate in seconds.

Production debugging: you usually can't reproduce the issue locally. You have limited visibility into state. You can't modify running code. You have to reason from evidence, logs, metrics, traces, to find root cause.

The skill is asking the right questions of the evidence you have.

Before the Incident: Observability

You can't debug what you can't observe. Good production observability has three pillars:

Logs: structured, not just strings.

// ❌ Hard to query, hard to correlate
console.log(`Failed to process order ${orderId} for user ${userId}`)

// ✓ Queryable, correlatable
logger.error('order.processing.failed', {
  orderId,
  userId,
  error: err.message,
  stack: err.stack,
  requestId: req.id,
})

Structured logs can be queried by field. You can find all errors for a specific user, all failures of a specific type, etc.

Metrics: counters, gauges, histograms, things you can graph and alert on. Track error rates, p95/p99 latency, queue depths, cache hit rates. Metrics tell you that something is wrong.

Traces: a trace follows a request through multiple services. Traces tell you where something is slow or failing. If you're using a distributed system or microservices, traces are non-negotiable.

When an Incident Happens

1. Understand the Scope

Before you start debugging, understand the blast radius:

How many users are affected? All of them, or a subset?
Is it a hard error or a slowdown?
When did it start?
Did anything change around that time (deploy, config change, traffic spike)?

The answers shape your hypothesis. A bug affecting only users with a specific account type is a different investigation than a bug affecting everyone.

2. Look at Your Metrics First

Don't start by reading logs, they're high volume and hard to scan. Start with your metrics dashboard:

Error rate by endpoint: which endpoint is throwing errors?
Latency: is something slow upstream?
Downstream service health: did a dependency start failing?
Infrastructure: is a database at capacity? Is a service running out of memory?

Metrics narrow the scope. Once you know where the problem is, you can look at logs for why.

3. Correlate With Logs

Once you have a hypothesis about where the problem is, use logs to confirm it. Good questions to answer:

What's the error message and stack trace?
Is there a pattern in which requests fail? (user type, request parameters, geographic region)
What's the request context around the failure? (request ID, user ID, feature flags)

Use your logging infrastructure's query language to filter:

level:error AND service:order-service AND timestamp:[now-1h TO now]

4. Form and Test Hypotheses

Debugging is hypothesis-driven. Based on what you see, form a specific hypothesis, "the error happens when discount is NULL because we're not handling that case in the pricing calculation", and then test it. Can you verify it from the logs? Can you reproduce it in staging?

Avoid the trap of making changes blindly. Every change you make in production during an incident adds noise. If the issue resolves, you don't know why. Change one thing at a time and observe the effect.

5. Mitigate, Then Fix

Often the right first step is mitigation, reducing the impact while you investigate root cause:

Roll back to the last known good deploy
Disable the feature flag that enabled the new code
Route traffic away from the affected service
Scale up if it's a capacity issue

Mitigation buys you time to do a proper root cause analysis without pressure. Don't merge a hasty fix under pressure if you're not confident in it, that's how you introduce a second bug.

After the Incident

A post-mortem (or "learning review") is worth writing even for small incidents. Capture:

What happened: timeline of the incident
Root cause: what actually caused the problem
What we did: how we detected and resolved it
What we learned: what observability gaps, process gaps, or code issues contributed
Follow-up action items: specific things to improve

The goal isn't blame, it's to make the next incident shorter, and ideally prevent the same class of bug from happening again.

The Meta-Skill

Production debugging is fundamentally about reasoning under uncertainty with incomplete information. The best debuggers I've worked with share a few traits: they state their hypotheses explicitly before acting, they're comfortable saying "I don't know" and going to get more data, and they don't confuse correlation with causation.

The good news: it's a skill you can build deliberately. Review past incidents. Practice reading traces. Get familiar with your observability tooling before you need it urgently. The investigation that would take an hour under pressure takes ten minutes when you know the tools.