
Recap: Where We Left Off
In Level 1: Surviving Chaos, we built a foundation: version control for infrastructure, automated deployments, basic monitoring, and backup & disaster recovery. In Level 2: Centralized Infrastructure, we unified observability, CI/CD, and cost management into shared platforms that every team uses.
By now, your team has:
- Repeatable, version-controlled infrastructure
- A centralized observability stack (Prometheus + Grafana + Loki)
- Standardized CI/CD pipelines
- Basic cost allocation by service
You’ve moved from chaos to control. But control alone doesn’t tell you if you’re improving. That’s what Level 3 is about.
Welcome to Level 3: Measured Infrastructure — where we define SLIs, set SLOs, implement error budgets, and build a data-driven reliability culture. This is the level where you transition from reactive operations to proactive reliability engineering.
Why “Measured” Is a Prerequisite for Automation
Here’s a truth that surprises many SMB teams: you can’t automate what you can’t measure. Level 4 (Automated) and Level 5 (Platform) depend on having solid metrics to trigger automation decisions. If you don’t know your baseline latency, you can’t auto-scale based on it. If you don’t have error budgets, you can’t automate deployment gating.
Level 3 is where you build the data foundation that makes all future automation possible.
The Three Pillars of Measured Infrastructure
Pillar 1: Service Level Indicators (SLIs)
SLIs are the quantified metrics that reflect the reliability of your service. For most SMBs, these are the Four Golden Signals we covered in our observability guide:
| SLI | What It Measures | Collection Method |
|---|---|---|
| Request Latency | Time to serve a request (p50, p95, p99) | Prometheus histograms |
| Error Rate | Percentage of requests returning errors | Prometheus counters |
| Throughput | Requests per second | Prometheus counters |
| Availability | Percentage of time service is reachable | Blackbox exporter |
| Freshness | Age of last successful data sync/update | Custom Prometheus gauge |
Don’t define more than 5 SLIs per service. If you have more, you’re measuring things you won’t act on — and that’s just data hoarding, not reliability engineering.
Pillar 2: Service Level Objectives (SLOs)
An SLO is the target you set for each SLI. The magic of SLOs is that they force you to decide how reliable your service actually needs to be — and give you permission to not achieve perfection.
# service-slos.yml — SLO definitions for your services
services:
api-gateway:
slo_latency_p99: "200ms" # 99% of requests under 200ms
slo_error_rate: "99.9%" # 99.9% of requests are successful
slo_availability: "99.95%" # less than 4.5 minutes downtime per quarter
user-service:
slo_latency_p99: "500ms" # user-facing but less critical
slo_error_rate: "99.5%"
slo_availability: "99.9%"
batch-processor:
slo_freshness: "1h" # data is never more than 1 hour stale
slo_success_rate: "99%"
Key insight: SLOs for internal services can (and should) be looser than customer-facing ones. Not everything needs five nines. When we work with SMBs through our consulting services, we often find teams over-investing in reliability for internal tools that nobody depends on for revenue.
Pillar 3: Error Budgets
An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes per month. You can spend this budget however you want: on deployments, on experiments, on maintenance windows.
When the error budget is available, you can deploy faster and take more risks. When it’s exhausted, you stop shipping features and focus on reliability.
# error-budget.py — Simple error budget tracker
class ErrorBudget:
def __init__(self, slo_percentage, period_seconds):
self.total_budget = 1 - slo_percentage # e.g., 0.001 for 99.9%
self.total_period = period_seconds
self.budget_remaining = self.total_budget
self.errors = []
def record_success(self, count=1):
self.budget_remaining += count * (self.total_budget / self.total_period)
self.budget_remaining = min(self.budget_remaining, self.total_budget)
def record_failure(self, count=1):
self.budget_remaining -= count * (self.total_budget / self.total_period)
def is_budget_exhausted(self):
return self.budget_remaining <= 0
def burn_rate(self, window_minutes=60):
"""Calculate how fast we're burning through the budget"""
recent = self.errors[-window_minutes:] if window_minutes > 0 else self.errors
return len(recent) / len(recent) if recent else 0
Setting Up Your Measurement Infrastructure
Here’s how to implement this with the tools you already have from Level 2:
Step 1: Instrument Your Services
Add Prometheus client libraries to your applications. Most languages have mature support:
# Python example with prometheus_client
from prometheus_client import Histogram, Counter, generate_latest, REGISTRY
from flask import Flask, Response
import time
app = Flask(__name__)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency in seconds',
['method', 'endpoint', 'status'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
@app.route('/metrics')
def metrics():
return Response(generate_latest(REGISTRY), mimetype='text/plain')
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).observe(latency)
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).inc()
return response
Step 2: Configure Prometheus SLO Recording Rules
# prometheus-slo-rules.yml — SLO monitoring rules
groups:
- name: slo
rules:
- record: job:slo_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: job:slo_error_budget_remaining:ratio
expr: |
1 - (1 - job:slo_availability:ratio_rate30d)
/
(1 - 0.999) # 99.9% SLO target
- alert: ErrorBudgetExhausted
expr: job:slo_error_budget_remaining:ratio <= 0
for: 5m
labels:
severity: critical
slo: "99.9%"
annotations:
summary: "Error budget exhausted for job {{ $labels.job }}"
Step 3: Visualize Your SLOs in Grafana
Create a single "SLO Dashboard" that shows:
- Burn-down chart — how much error budget remains over time
- Burn rate alerts — how fast you're consuming the budget (a spike in burn rate means something is breaking)
- SLO attainment — are you meeting your targets for the current window?
- Multi-window, multi-burn-rate alerts — Google SRE's recommended approach for early warning
Building a Data-Driven Reliability Culture
Metrics alone don't create reliability. You need a culture that uses them:
The Weekly SLO Review
Spend 30 minutes every Monday reviewing error budgets for each service. If a service is burning through budget too fast, it becomes the team's priority for the week. This meeting should be the highest-signal 30 minutes of your week — no dashboard scrolling, just decisions.
Deployment Gating Based on Error Budget
Automate deployment decisions based on budget health. If the API gateway has already consumed 80% of its monthly error budget in the first week, don't deploy more changes — focus on reliability first.
# deploy-gate.yml — Example deployment gate check
deploy_enabled: true
checks:
- service: api-gateway
check: error_budget_remaining > 0.2 # Must have at least 20% budget left
action: block_deploy
- service: user-service
check: error_budget_remaining > 0.1
action: warn_only
Postmortems with SLO Data
Every incident postmortem should reference the SLO impact. How much error budget did we consume? How close did we come to exhausting it? This shifts the conversation from "who caused this?" to "what can we measure to prevent it?"
Measuring Level 3 Success
You've completed Level 3 when:
- Every service has defined SLIs measured in Prometheus
- Every team knows their SLO targets and error budgets
- Deployments are gated by error budget health
- Incident postmortems include SLO impact analysis
- You can answer "how reliable were we last month?" with one number
- When asked "should we deploy on Friday?" you check the error budget, not a calendar
What's Next: Level 4 — Automated
With your measurement foundation in place, you're ready for Level 4: Automated Infrastructure. Once you know your SLIs, SLOs, and error budgets, you can start automating:
- Auto-scaling based on latency SLOs, not CPU metrics
- Auto-remediation triggered by error budget burn rate
- Automated deployment rollback when error budget is consumed too fast
- Self-healing infrastructure that responds to measurement signals
But first — get Level 3 right. A measured foundation makes everything else easier. Skip it, and your automation will be based on guesswork.
Need help defining your SLIs and SLOs? That's exactly the kind of work we do at DevOps & SRE Hub. We help SMBs build measurement infrastructure that doesn't overcomplicate things — just the data you need to make good reliability decisions.
Need help implementing this in your company?
We help SMBs adopt these practices without hiring a full-time internal team.
Book a free consultation and discover how we can transform your infrastructure.