MTTR: What it means and how to improve it
Last updated: March 03, 2026
It's 2am. Your phone buzzes with an alert. Your site is down.
You fumble for your laptop, try to remember where the logs are, SSH into the server, and eventually figure out the problem. By the time you've fixed it, it's 4am.
The next morning, someone asks: "How long did it take to resolve?"
That's your MTTR.
Table of contents
- What is MTTR?
- The four types of MTTR
- How to calculate MTTR
- What's a good MTTR?
- MTTR vs MTBF
- How to improve your MTTR
What is MTTR?
MTTR stands for Mean Time to Resolution (or Recovery, or Repair - more on that in a second). It measures the average time it takes your team to resolve incidents.
The idea is simple: the faster you can get your service back online, the less pain for your customers and your business.
If you had three incidents last month that took 30 minutes, 45 minutes, and 15 minutes to resolve, your MTTR would be 30 minutes.
The four types of MTTR
Here's where things get confusing. MTTR can actually mean four different things:
Mean Time to Resolution - The total time from incident start to "everything's back to normal". This includes detection, diagnosis, fixing, and verification. This is the most common definition for software teams.
Mean Time to Recovery - How long until service is restored. You might not know why it broke, but users can use the product again.
Mean Time to Repair - Just the time spent actively fixing. Doesn't include waiting around or diagnosis. More common in hardware/manufacturing contexts.
Mean Time to Respond - How quickly someone acknowledges the incident after an alert fires. Sometimes called MTTA (Mean Time to Acknowledge).
For most of us running web services, Mean Time to Resolution is what we care about - it captures the full customer impact.
The key thing is to pick one definition and stick with it. Otherwise your metrics become meaningless.
How to calculate MTTR
The formula is straightforward:
MTTR = Total downtime / Number of incidents
So if you had 4 incidents last month with a combined downtime of 2 hours:
MTTR = 120 minutes / 4 incidents = 30 minutes
That's it. Nothing fancy.
One thing to watch out for: make sure you're measuring from when the incident actually started, not when you first noticed it. If your site was down for an hour before anyone noticed, that hour counts.
This is why monitoring your uptime matters - it reduces the gap between "incident started" and "someone's working on it".
What's a good MTTR?
It depends on what you're building, but here are some rough benchmarks from DORA's research:
| Performance | MTTR |
|---|---|
| Elite | Less than 1 hour |
| High | Less than 1 day |
| Medium | Less than 1 week |
| Low | More than 1 week |
If you're running a SaaS product, you probably want to aim for under an hour. For critical infrastructure like payments, under 15 minutes.
But here's the thing: don't obsess over the number.
I've seen teams game their MTTR by closing incidents early, or not counting certain outages as "real" incidents. That's worse than having a high MTTR, because now you're lying to yourself.
Focus on actually getting better at resolving incidents, and the number will follow.
MTTR vs MTBF
You'll often see MTTR mentioned alongside MTBF (Mean Time Between Failures).
MTTR tells you how fast you fix things.
MTBF tells you how often things break.
Both matter. A 5-minute MTTR is great, but if you're having incidents every day, something's fundamentally broken.
Ideally you want:
- High MTBF (things rarely break)
- Low MTTR (when they do break, you fix them fast)
If you're having lots of incidents, work on reliability first. If incidents are rare but take forever to resolve, work on your incident response.
How to improve your MTTR
Here are practical things you can do today:
Detect incidents faster
You can't fix what you don't know about. The time between "something broke" and "someone's looking at it" is often the biggest chunk of your MTTR.
Set up uptime monitoring that checks your site frequently (every 30 seconds is ideal) from multiple locations. When something goes wrong, you want to know within minutes, not hours.
Write runbooks
At 3am, you won't remember exactly how to restart that one service, or where the logs are, or who to escalate to.
Write it down. Keep a document for each service with:
- What does this service do?
- How do I check if it's working?
- Common problems and how to fix them
- Who to escalate to if you're stuck
I've written more about this in writing your first runbooks.
Reduce noise
If your team is drowning in alerts, the important ones get lost. When everything's urgent, nothing is.
Review your alerts regularly. If an alert fires and there's no action to take, delete it or change its threshold. I wrote a whole article on saving your team from alert fatigue if you want to dig deeper.
Use status pages
During an incident, your support team gets flooded with "is it down?" messages. This takes time away from actually fixing the problem.
A status page lets customers check the status themselves. Less noise for your team, faster resolution.
Run postmortems
After every significant incident, figure out what happened and why. Not to blame anyone, but to learn.
Ask:
- What broke?
- How did we find out?
- What slowed us down?
- How do we prevent this from happening again?
The goal is to get a bit better each time. Over months, these small improvements compound.
Keep it simple
This one's harder to act on, but worth mentioning: complex systems have more ways to fail, and take longer to debug.
If you're a small team and your architecture looks like a distributed systems textbook, maybe it's time to simplify. I wrote about this more in monitoring your web application as a small team.
MTTR isn't the only metric that matters, but it's a good one to track. It keeps you honest about how your team handles incidents.
Start measuring it, pick one thing to improve, and iterate from there.
