
Recap: Where We Left Off
In Level 1: Surviving Chaos, we tackled the fundamentals: version control for infrastructure, automated deployments, basic monitoring, and backup & disaster recovery. If you’ve implemented those steps, you’ve moved from panic-driven operations to a repeatable foundation.
Now it’s time for Level 2: Centralized Infrastructure Management.
At Level 1, each team or service likely manages its own infrastructure independently. The DevOps team has their Terraform. The backend team has their Docker Compose files. The data team has their own scripts. This works for a while — until you realize:
- No one knows what’s actually running in production
- There are three different monitoring dashboards, all showing different things
- Each team has its own CI/CD setup with different standards
- Onboarding a new service requires weeks of tribal knowledge transfer
- Cost allocation is impossible — you can’t tell which service costs what
Level 2 solves all of this by centralizing your infrastructure tooling, observability, and governance — without creating a bottleneck that slows teams down.
What “Centralized” Means (and Doesn’t Mean)
Let’s clear up a common misconception: centralization doesn’t mean one team controls everything and everyone else submits tickets. That’s the opposite of DevOps.
Centralized infrastructure means:
- Shared tooling and platforms that every team uses
- Standardized patterns for deploying, monitoring, and scaling services
- Single source of truth for infrastructure state and costs
- Self-service capabilities so teams can deploy independently
It does NOT mean:
- A single ops team as a bottleneck
- One-size-fits-all that doesn’t fit anyone
- Removing team autonomy and ownership
The Centralization Stack for SMBs
1. Centralized Observability (Single Pane of Glass)
Every team should see the same dashboards, logging, and alerting. This is the highest-impact first step because it immediately reduces MTTR and eliminates the “whose dashboard is right?” problem.
# docker-compose.observability.yml — Centralized observability stack
version: '3.8'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports: ["9090:9090"]
grafana:
image: grafana/grafana
depends_on: [prometheus]
ports: ["3000:3000"]
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
volumes:
- grafana_data:/var/lib/grafana
loki:
image: grafana/loki
ports: ["3100:3100"]
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki_data:/loki
volumes:
prometheus_data:
grafana_data:
loki_data:
2. Centralized Infrastructure Registry
Maintain a single inventory of every service, its dependencies, owners, and cost allocation. For SMBs, this doesn’t need to be fancy:
# infrastructure.yml — Centralized service registry
services:
api-gateway:
owner: "platform-team"
repository: "github.com/org/api-gateway"
infrastructure: "terraform/environments/prod/api-gateway"
monitoring: "grafana/dashboards/api-gateway.json"
alerts: "pagerduty/api-gateway"
cloud_resources: ["ecs:api-gateway-prod", "rds:api-gateway-db"]
cost_center: "platform"
criticality: "tier-1"
user-service:
owner: "backend-team"
repository: "github.com/org/user-service"
infrastructure: "terraform/environments/prod/user-service"
monitoring: "grafana/dashboards/user-service.json"
alerts: "pagerduty/user-service"
cloud_resources: ["ecs:user-service-prod", "rds:user-service-db", "elasticache:user-sessions"]
cost_center: "product"
criticality: "tier-1"
3. Centralized CI/CD Platform
Instead of each team reinventing their CI/CD, create a shared set of reusable workflows. In GitHub Actions, this means composite actions and reusable workflows:
# .github/workflows/deploy-template.yml — Reusable deploy workflow
on:
workflow_call:
inputs:
environment:
required: true
type: string
dockerfile:
default: Dockerfile
type: string
secrets:
deploy_key:
required: true
jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Build and test
run: |
docker build -f ${{ inputs.dockerfile }} -t app:latest .
docker run app:latest npm test
- name: Deploy
run: |
ssh deploy@host "docker compose pull && docker compose up -d"
Then teams consume it in one line:
# .github/workflows/user-service.yml — Team-specific config
name: Deploy User Service
on:
push:
branches: [main]
jobs:
deploy:
uses: ./.github/workflows/deploy-template.yml
with:
environment: production
secrets:
deploy_key: ${{ secrets.DEPLOY_KEY }}
4. Centralized Cost Management
You can’t optimize what you can’t measure. Set up cost allocation tagging across all cloud resources:
# Tagging standard for all cloud resources
Required Tags:
- service: (name from registry)
- environment: (prod/staging/dev)
- team: (owning team name)
- cost-center: (product/platform/data/infra)
- terraform: (true/false)
- created-by: (tool/username)
Implementation Roadmap
Week 1–2: Centralize Observability
Deploy Prometheus + Grafana + Loki. Migrate all teams to the same stack. Create standard dashboard templates for services.
Week 3–4: Build the Service Registry
Create an infrastructure YAML file (or use Backstage if you have more resources). Map every service and its dependencies.
Week 5–6: Standardize CI/CD
Extract your most common pipeline into a reusable template. Migrate teams one at a time — don’t try to do all at once.
Week 7–8: Implement Cost Allocation
Apply tagging standards retroactively. Set up AWS Cost Explorer or GCP Cost Management dashboards by team and service.
Measuring Level 2 Success
You’ve graduated from Level 2 when:
- Any engineer can look at one dashboard to understand the health of all services
- Onboarding a new service takes less than a day (not weeks)
- You can tell exactly how much each service costs per month
- Teams deploy independently using shared, battle-tested pipelines
- The CEO can ask “how’s production?” and get a one-click answer
What’s Next: Level 3 — Measured
Once your infrastructure is centralized and standardized, you can start measuring everything that matters: SLIs, SLOs, error budgets, and business impact metrics. That’s what Level 3 covers — and it’s where you transform from “keeping the lights on” to proactive reliability engineering.
Stay tuned for the next installment, or get a head start with our infrastructure assessment — we’ll tell you exactly which level you’re at and what to prioritize next.
Need help implementing this in your company?
We help SMBs adopt these practices without hiring a full-time internal team.
Book a free consultation and discover how we can transform your infrastructure.