The SMB Infrastructure Maturity Model: Level 3 — Measured Infrastructure

Recap: Where We Left Off

In Level 1: Surviving Chaos, we built a foundation: version control for infrastructure, automated deployments, basic monitoring, and backup & disaster recovery. In Level 2: Centralized Infrastructure, we unified observability, CI/CD, and cost management into shared platforms that every team uses.

By now, your team has:

Repeatable, version-controlled infrastructure
A centralized observability stack (Prometheus + Grafana + Loki)
Standardized CI/CD pipelines
Basic cost allocation by service

You’ve moved from chaos to control. But control alone doesn’t tell you if you’re improving. That’s what Level 3 is about.

Welcome to Level 3: Measured Infrastructure — where we define SLIs, set SLOs, implement error budgets, and build a data-driven reliability culture. This is the level where you transition from reactive operations to proactive reliability engineering.

Why “Measured” Is a Prerequisite for Automation

Here’s a truth that surprises many SMB teams: you can’t automate what you can’t measure. Level 4 (Automated) and Level 5 (Platform) depend on having solid metrics to trigger automation decisions. If you don’t know your baseline latency, you can’t auto-scale based on it. If you don’t have error budgets, you can’t automate deployment gating.

Level 3 is where you build the data foundation that makes all future automation possible.

The Three Pillars of Measured Infrastructure

Pillar 1: Service Level Indicators (SLIs)

SLIs are the quantified metrics that reflect the reliability of your service. For most SMBs, these are the Four Golden Signals we covered in our observability guide:

SLI	What It Measures	Collection Method
Request Latency	Time to serve a request (p50, p95, p99)	Prometheus histograms
Error Rate	Percentage of requests returning errors	Prometheus counters
Throughput	Requests per second	Prometheus counters
Availability	Percentage of time service is reachable	Blackbox exporter
Freshness	Age of last successful data sync/update	Custom Prometheus gauge

Don’t define more than 5 SLIs per service. If you have more, you’re measuring things you won’t act on — and that’s just data hoarding, not reliability engineering.

Pillar 2: Service Level Objectives (SLOs)

An SLO is the target you set for each SLI. The magic of SLOs is that they force you to decide how reliable your service actually needs to be — and give you permission to not achieve perfection.

# service-slos.yml — SLO definitions for your services
services:
  api-gateway:
    slo_latency_p99: "200ms"    # 99% of requests under 200ms
    slo_error_rate: "99.9%"     # 99.9% of requests are successful
    slo_availability: "99.95%"  # less than 4.5 minutes downtime per quarter

  user-service:
    slo_latency_p99: "500ms"    # user-facing but less critical
    slo_error_rate: "99.5%"
    slo_availability: "99.9%"

  batch-processor:
    slo_freshness: "1h"         # data is never more than 1 hour stale
    slo_success_rate: "99%"

Key insight: SLOs for internal services can (and should) be looser than customer-facing ones. Not everything needs five nines. When we work with SMBs through our consulting services, we often find teams over-investing in reliability for internal tools that nobody depends on for revenue.

Pillar 3: Error Budgets

An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes per month. You can spend this budget however you want: on deployments, on experiments, on maintenance windows.

When the error budget is available, you can deploy faster and take more risks. When it’s exhausted, you stop shipping features and focus on reliability.

# error-budget.py — Simple error budget tracker
class ErrorBudget:
    def __init__(self, slo_percentage, period_seconds):
        self.total_budget = 1 - slo_percentage  # e.g., 0.001 for 99.9%
        self.total_period = period_seconds
        self.budget_remaining = self.total_budget
        self.errors = []

    def record_success(self, count=1):
        self.budget_remaining += count * (self.total_budget / self.total_period)
        self.budget_remaining = min(self.budget_remaining, self.total_budget)

    def record_failure(self, count=1):
        self.budget_remaining -= count * (self.total_budget / self.total_period)

    def is_budget_exhausted(self):
        return self.budget_remaining <= 0

    def burn_rate(self, window_minutes=60):
        """Calculate how fast we're burning through the budget"""
        recent = self.errors[-window_minutes:] if window_minutes > 0 else self.errors
        return len(recent) / len(recent) if recent else 0

Setting Up Your Measurement Infrastructure

Here’s how to implement this with the tools you already have from Level 2:

Step 1: Instrument Your Services

Add Prometheus client libraries to your applications. Most languages have mature support:

# Python example with prometheus_client
from prometheus_client import Histogram, Counter, generate_latest, REGISTRY
from flask import Flask, Response
import time

app = Flask(__name__)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint', 'status'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

@app.route('/metrics')
def metrics():
    return Response(generate_latest(REGISTRY), mimetype='text/plain')

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).observe(latency)
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    return response

Step 2: Configure Prometheus SLO Recording Rules

# prometheus-slo-rules.yml — SLO monitoring rules
groups:
  - name: slo
    rules:
      - record: job:slo_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      - record: job:slo_error_budget_remaining:ratio
        expr: |
          1 - (1 - job:slo_availability:ratio_rate30d)
          /
          (1 - 0.999)  # 99.9% SLO target

      - alert: ErrorBudgetExhausted
        expr: job:slo_error_budget_remaining:ratio <= 0
        for: 5m
        labels:
          severity: critical
          slo: "99.9%"
        annotations:
          summary: "Error budget exhausted for job {{ $labels.job }}"

Step 3: Visualize Your SLOs in Grafana

Create a single "SLO Dashboard" that shows:

Burn-down chart — how much error budget remains over time
Burn rate alerts — how fast you're consuming the budget (a spike in burn rate means something is breaking)
SLO attainment — are you meeting your targets for the current window?
Multi-window, multi-burn-rate alerts — Google SRE's recommended approach for early warning

Building a Data-Driven Reliability Culture

Metrics alone don't create reliability. You need a culture that uses them:

The Weekly SLO Review

Spend 30 minutes every Monday reviewing error budgets for each service. If a service is burning through budget too fast, it becomes the team's priority for the week. This meeting should be the highest-signal 30 minutes of your week — no dashboard scrolling, just decisions.

Deployment Gating Based on Error Budget

Automate deployment decisions based on budget health. If the API gateway has already consumed 80% of its monthly error budget in the first week, don't deploy more changes — focus on reliability first.

# deploy-gate.yml — Example deployment gate check
deploy_enabled: true
checks:
  - service: api-gateway
    check: error_budget_remaining > 0.2  # Must have at least 20% budget left
    action: block_deploy
  - service: user-service
    check: error_budget_remaining > 0.1
    action: warn_only

Postmortems with SLO Data

Every incident postmortem should reference the SLO impact. How much error budget did we consume? How close did we come to exhausting it? This shifts the conversation from "who caused this?" to "what can we measure to prevent it?"

Measuring Level 3 Success

You've completed Level 3 when:

Every service has defined SLIs measured in Prometheus
Every team knows their SLO targets and error budgets
Deployments are gated by error budget health
Incident postmortems include SLO impact analysis
You can answer "how reliable were we last month?" with one number
When asked "should we deploy on Friday?" you check the error budget, not a calendar

What's Next: Level 4 — Automated

With your measurement foundation in place, you're ready for Level 4: Automated Infrastructure. Once you know your SLIs, SLOs, and error budgets, you can start automating:

Auto-scaling based on latency SLOs, not CPU metrics
Auto-remediation triggered by error budget burn rate
Automated deployment rollback when error budget is consumed too fast
Self-healing infrastructure that responds to measurement signals

But first — get Level 3 right. A measured foundation makes everything else easier. Skip it, and your automation will be based on guesswork.

Need help defining your SLIs and SLOs? That's exactly the kind of work we do at DevOps & SRE Hub. We help SMBs build measurement infrastructure that doesn't overcomplicate things — just the data you need to make good reliability decisions.

Need help implementing this in your company?
We help SMBs adopt these practices without hiring a full-time internal team.
Book a free consultation and discover how we can transform your infrastructure.