Infrastructure

SRE Practices Review

Free SRE Practices Review (SLOs, Error Budgets, On-Call, Toil, Incident Command)

A senior-SRE-verified review of your SLO and SLI quality, error-budget policy, multi-window multi-burn-rate alerting, on-call health, toil, incident command, post-mortem culture, and DORA delivery metrics. Benchmarked against the Google SRE Book and Workbook and the OpenSLO specification - combining 90 days of anonymised signal from your on-call, incident, and SLO tools with structured interviews of engineers and incident commanders.

  • Covers SLOs and SLIs, error-budget policy, multi-window multi-burn-rate alerting, on-call health, toil measurement, incident command, post-mortem culture, and DORA delivery metrics
  • Pulls 90 days of anonymised signal from PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call, Sloth, Pyrra, and Nobl9 - benchmarked against the Google SRE Workbook, OpenSLO, and DORA
  • Senior SRE verifies every finding - typical first review cuts pager volume 30-50%, builds an SLO programme that survives, and identifies 8-15 top automation opportunities
  • Read-only access only
  • No production telemetry exfiltrated
  • Senior SRE-verified
  • Live findings walkthrough included

Supported Platforms

PagerDuty
OpsGenie
Datadog
Grafana
Any stack

What We Review Across Your SRE Programme

Six areas - SLOs, error budgets, on-call, toil, incident command, and post-mortems - benchmarked against the Google SRE Book and Workbook, the OpenSLO specification, and DORA delivery metrics.

SLO & SLI Quality (Google SRE Workbook, OpenSLO)

Reviews whether SLIs are meaningful (latency, availability, quality, freshness, throughput), SLOs are customer-aligned and achievable, and the programme avoids vanity SLOs. Audits multi-window multi-burn-rate alerts per the SRE Workbook, error-budget calculation accuracy, and tooling fit (Sloth, Pyrra, Nobl9, OpenSLO, Datadog SLOs, Grafana SLO, Honeycomb SLOs).

Error-Budget Policy & Reliability Investment

Evaluates whether error budgets are published, agreed with product, used to gate feature vs reliability work, and trigger pre-defined consequences (release freeze, scope cut, paging change). Reviews engineering investment split across features, reliability, toil, and tech debt - benchmarked against DORA (deployment frequency, lead time, change failure rate, MTTR).

On-Call Health, Alert Quality & Pager Load

Analyses 90 days of pager data from PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, or Datadog On-Call - signal-to-noise per alert, MTTA / MTTR distributions, alert ownership, runbook coverage, off-hours pages, and burnout indicators. Returns a cleanup backlog typically reducing page volume 30-50%.

Toil Measurement & Automation Backlog

Identifies and quantifies toil - manual, repetitive, automatable operational work - using engineer time-tracking, interview signal, and ticket analysis. Produces a prioritised automation backlog (GitOps, ChatOps, runbook automation with Rundeck / Ansible / Systems Manager, self-service via Backstage, Port, Cortex) with quantified hours-saved per change.

Incident Command, Severity & Communication

Reviews incident severity classification (SEV-1 → SEV-5), incident commander role and rotation, status-page and customer communication (Statuspage, Better Stack, Incident.io, FireHydrant), executive escalation, and time-to-acknowledge / time-to-mitigate distributions - mapped to the Google IMAG and ICS-derived incident command standards.

Post-Mortem Culture, Learning & Action Tracking

Evaluates blameless post-mortem facilitation, root-cause analysis depth (Five Whys, Causal Analysis based on Systems Theory), action-item ownership, learning dissemination, and incident review cadence. Audits whether high-quality reviews actually produce reliability improvements - or end up as documents nobody reads.

How It Works

1

Register & Scoping Call

Join a 30-minute scoping call where senior SREs learn your services, team structure, on-call model, current SLO programme, and primary pain points. We agree which platforms to pull anonymised signal from and which engineers, EMs, and incident commanders to interview.

2

Read-Only Signal Collection & Interviews

We pull 90 days of anonymised pager / incident / SLO data via read-only access to PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call, Sloth, Pyrra, Nobl9, and your observability platform - plus 30-minute interviews with on-call engineers, EMs, incident commanders, and product partners.

3

Senior SRE Verification & Maturity Scoring

A senior SRE who has run on-call for high-traffic SaaS scores each discipline Reactive → Managed → Proactive → Optimising, validates findings against your architecture and team capacity, removes false positives, and rewrites recommendations into a prioritised ticket-ready backlog with quantified MTTA / MTTR / page-volume / hours-saved impact.

4

Receive Report & Live Debrief

Get your SRE Maturity Score per discipline, alert-cleanup backlog, SLO rollout plan with multi-burn-rate alerts, error-budget policy template, toil-reduction automation backlog, on-call health report, incident command playbook, and post-mortem template - within 3-5 business days, plus a 45-minute live walkthrough.

What You Get

Your report will include the following deliverables.

SRE Maturity Score (Reactive → Managed → Proactive → Optimising) per discipline
SLO and SLI quality assessment with per-service rollout plan and tool-fit recommendation (Sloth, Pyrra, Nobl9, OpenSLO, native)
Multi-window multi-burn-rate alert template aligned to the Google SRE Workbook
Error-budget policy template with pre-defined consequences and product-agreement language
On-call health report - MTTA, MTTR, page-volume distribution, off-hours load, burnout indicators
Alert cleanup backlog with quantified page-volume reduction (typical 30-50%)
Toil inventory and prioritised automation backlog with quantified hours-saved per change
Incident command playbook - severity classification, IC role, status-page communication, executive escalation
Post-mortem template, facilitation guide, and action-item tracking process
DORA benchmarking - deployment frequency, lead time for changes, change failure rate, MTTR
Prioritised SRE improvement roadmap and 45-minute live findings walkthrough

Stop fighting fires. Build the SRE programme that prevents them.

Get a senior-SRE-verified maturity report with alert-cleanup backlog, SLO rollout plan, error-budget policy template, toil-reduction backlog, and incident command playbook - read-only access only, no production telemetry exfiltrated, completely free.

Get My SRE Maturity Report

How We Handle Your Pager Data & Interviews

An SRE review must protect both your operational data and the engineers we interview. Here is exactly what we read - and what never leaves your environment.

Read-Only Viewer Access, Time-Limited

We use a read-only API key (PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Datadog, Grafana Cloud, Honeycomb, Sloth, Pyrra, Nobl9) or viewer role scoped strictly to incident, alert, on-call schedule, SLO, and runbook APIs - time-limited to the review window. We can never silence alerts, modify on-call schedules, change SLO definitions, or close incidents.

Anonymised Pager Data & Confidential Interviews

Pager and incident data is anonymised at ingestion - engineer names, customer identifiers, and free-text incident details are stripped or hashed before analysis. Interviews with on-call engineers, EMs, and incident commanders are confidential by default - quotes are anonymised in the final report and never attributed without explicit consent.

Auto-Revoked & Destroyed After Review

As soon as your SRE Maturity Report is delivered, every API key is revoked, the analysis sandbox is destroyed, and your pager / incident / SLO export is deleted. Only aggregate, anonymised findings are retained for QA - never engineer names, customer identifiers, or incident specifics.

Frequently Asked Questions

The most common questions we hear from teams running this assessment.

What access and data do you actually need? Will any of it leave our environment?

Read-only viewer access to your incident-management platform (PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call), SLO tooling (Sloth, Pyrra, Nobl9, native), and observability platform - scoped to alert, incident, on-call, SLO, and runbook APIs only. Pager and incident data is anonymised at ingestion; engineer names, customer IDs, and free-text incident details are stripped or hashed before analysis. We never copy production telemetry off your environment, and we provide the exact scopes in advance for your security team to review.

How is this different from running our own SRE retro?

Internal retros are excellent for team-level reflection but rarely produce hard maturity scoring against external benchmarks, quantified MTTA / MTTR / page-volume baselines, or comparison to dozens of similar teams. A senior SRE who has run on-call for high-traffic SaaS combines anonymised pager-data analysis, structured engineer interviews, and benchmarking against the Google SRE Workbook, OpenSLO, and DORA - into a prioritised backlog with quantified impact per change rather than a list of opinions.

Can you help us actually build an SLO programme - not just audit one?

Yes. The deliverables include a per-service SLO rollout plan mapping every customer-facing user journey to the SLI it should have (availability, latency, quality, freshness), the SLO target and rationale, multi-window multi-burn-rate alert templates per the Google SRE Workbook, an error-budget policy template with pre-defined consequences, and a tool-fit recommendation across Sloth, Pyrra, Nobl9, OpenSLO, Datadog SLOs, Grafana SLO, or Honeycomb SLOs based on your stack.

Do you measure on-call burnout objectively or just ask engineers?

Both. The objective signal comes from 90 days of pager data - page volume per engineer, off-hours and weekend page distribution, time-to-acknowledge spread, chronically-acked alerts, and shift-handoff quality. The subjective signal comes from confidential 30-minute interviews with on-call engineers, asking the questions that pager data cannot answer (psychological safety during incidents, quality of runbook coverage, IC support, sleep impact). The two perspectives almost always reveal different gaps.

Do you cover incident command and post-mortem culture?

Yes. Incident command coverage includes severity classification (SEV-1 → SEV-5), incident commander role and rotation, status-page communication (Statuspage, Better Stack), war-room tooling, executive escalation paths, and ICS / Google IMAG alignment. Post-mortem coverage includes blameless facilitation, root-cause analysis depth (Five Whys, CAST), action-item ownership and follow-through, and learning dissemination across teams - with templates and facilitator guides included in the deliverables.

How do you measure toil and produce an automation backlog?

We define toil per the Google SRE Book - manual, repetitive, automatable, tactical, no-enduring-value, scaling with service growth - then quantify it through ticket analysis, time-tracking signal, and engineer interviews. The automation backlog prioritises top-toil categories and recommends concrete tooling (GitOps, ChatOps, Rundeck, StackStorm, Ansible, AWS Systems Manager, Azure Automation, internal developer platforms with Backstage, Port, or Cortex) with quantified hours-saved per change.

Will you tell us where we should NOT have SLOs?

Yes - and this is one of the most useful parts of the review. Vanity SLOs and SLOs on internal-only services that nobody acts on are a major source of programme decay. The rollout plan explicitly identifies which services should not have SLOs, which should have only error-rate or freshness SLIs, and which should have full multi-burn-rate alerting - so the programme stays sustainable past the first quarter.

How long until we receive the report?

Typical turnaround is 3-5 business days from the moment read-only access is granted and engineer interviews are complete, plus a 45-minute live findings walkthrough at a time that suits your SRE, platform, and engineering leads. Larger teams with multiple on-call rotations can take a little longer; we confirm the timeline as soon as we see the scope.

Register for Your Free SRE Practices Review

Fill out the form below and our team will get back to you within 2 business days.

Your SRE Programme Today

These five answers help us scope the review, choose which signals to pull, and tailor the maturity report and SRE roadmap to your stack, team size, and primary driver.

Your data is protected under our Non-Disclosure Agreement.By registering, you and OpsHero are bound by our NDA - guaranteeing your data is used solely to generate this report, runs in an isolated sandbox, and is permanently deleted once complete. We retain absolutely nothing.

By clicking "Register for Free Review" you agree to our Non-Disclosure Agreement and confirm your data may be processed solely for report generation.