SRE Practices Review
Free SRE Practices Review (SLOs, Error Budgets, On-Call, Toil, Incident Command)
A senior-SRE-verified review of your SLO and SLI quality, error-budget policy, multi-window multi-burn-rate alerting, on-call health, toil, incident command, post-mortem culture, and DORA delivery metrics. Benchmarked against the Google SRE Book and Workbook and the OpenSLO specification - combining 90 days of anonymised signal from your on-call, incident, and SLO tools with structured interviews of engineers and incident commanders.
- Covers SLOs and SLIs, error-budget policy, multi-window multi-burn-rate alerting, on-call health, toil measurement, incident command, post-mortem culture, and DORA delivery metrics
- Pulls 90 days of anonymised signal from PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call, Sloth, Pyrra, and Nobl9 - benchmarked against the Google SRE Workbook, OpenSLO, and DORA
- Senior SRE verifies every finding - typical first review cuts pager volume 30-50%, builds an SLO programme that survives, and identifies 8-15 top automation opportunities
- Read-only access only
- No production telemetry exfiltrated
- Senior SRE-verified
- Live findings walkthrough included
Supported Platforms
What We Review Across Your SRE Programme
Six areas - SLOs, error budgets, on-call, toil, incident command, and post-mortems - benchmarked against the Google SRE Book and Workbook, the OpenSLO specification, and DORA delivery metrics.
SLO & SLI Quality (Google SRE Workbook, OpenSLO)
Reviews whether SLIs are meaningful (latency, availability, quality, freshness, throughput), SLOs are customer-aligned and achievable, and the programme avoids vanity SLOs. Audits multi-window multi-burn-rate alerts per the SRE Workbook, error-budget calculation accuracy, and tooling fit (Sloth, Pyrra, Nobl9, OpenSLO, Datadog SLOs, Grafana SLO, Honeycomb SLOs).
Error-Budget Policy & Reliability Investment
Evaluates whether error budgets are published, agreed with product, used to gate feature vs reliability work, and trigger pre-defined consequences (release freeze, scope cut, paging change). Reviews engineering investment split across features, reliability, toil, and tech debt - benchmarked against DORA (deployment frequency, lead time, change failure rate, MTTR).
On-Call Health, Alert Quality & Pager Load
Analyses 90 days of pager data from PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, or Datadog On-Call - signal-to-noise per alert, MTTA / MTTR distributions, alert ownership, runbook coverage, off-hours pages, and burnout indicators. Returns a cleanup backlog typically reducing page volume 30-50%.
Toil Measurement & Automation Backlog
Identifies and quantifies toil - manual, repetitive, automatable operational work - using engineer time-tracking, interview signal, and ticket analysis. Produces a prioritised automation backlog (GitOps, ChatOps, runbook automation with Rundeck / Ansible / Systems Manager, self-service via Backstage, Port, Cortex) with quantified hours-saved per change.
Incident Command, Severity & Communication
Reviews incident severity classification (SEV-1 → SEV-5), incident commander role and rotation, status-page and customer communication (Statuspage, Better Stack, Incident.io, FireHydrant), executive escalation, and time-to-acknowledge / time-to-mitigate distributions - mapped to the Google IMAG and ICS-derived incident command standards.
Post-Mortem Culture, Learning & Action Tracking
Evaluates blameless post-mortem facilitation, root-cause analysis depth (Five Whys, Causal Analysis based on Systems Theory), action-item ownership, learning dissemination, and incident review cadence. Audits whether high-quality reviews actually produce reliability improvements - or end up as documents nobody reads.
How It Works
Register & Scoping Call
Join a 30-minute scoping call where senior SREs learn your services, team structure, on-call model, current SLO programme, and primary pain points. We agree which platforms to pull anonymised signal from and which engineers, EMs, and incident commanders to interview.
Read-Only Signal Collection & Interviews
We pull 90 days of anonymised pager / incident / SLO data via read-only access to PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call, Sloth, Pyrra, Nobl9, and your observability platform - plus 30-minute interviews with on-call engineers, EMs, incident commanders, and product partners.
Senior SRE Verification & Maturity Scoring
A senior SRE who has run on-call for high-traffic SaaS scores each discipline Reactive → Managed → Proactive → Optimising, validates findings against your architecture and team capacity, removes false positives, and rewrites recommendations into a prioritised ticket-ready backlog with quantified MTTA / MTTR / page-volume / hours-saved impact.
Receive Report & Live Debrief
Get your SRE Maturity Score per discipline, alert-cleanup backlog, SLO rollout plan with multi-burn-rate alerts, error-budget policy template, toil-reduction automation backlog, on-call health report, incident command playbook, and post-mortem template - within 3-5 business days, plus a 45-minute live walkthrough.
What You Get
Your report will include the following deliverables.
Stop fighting fires. Build the SRE programme that prevents them.
Get a senior-SRE-verified maturity report with alert-cleanup backlog, SLO rollout plan, error-budget policy template, toil-reduction backlog, and incident command playbook - read-only access only, no production telemetry exfiltrated, completely free.
Get My SRE Maturity ReportHow We Handle Your Pager Data & Interviews
An SRE review must protect both your operational data and the engineers we interview. Here is exactly what we read - and what never leaves your environment.
Read-Only Viewer Access, Time-Limited
We use a read-only API key (PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Datadog, Grafana Cloud, Honeycomb, Sloth, Pyrra, Nobl9) or viewer role scoped strictly to incident, alert, on-call schedule, SLO, and runbook APIs - time-limited to the review window. We can never silence alerts, modify on-call schedules, change SLO definitions, or close incidents.
Anonymised Pager Data & Confidential Interviews
Pager and incident data is anonymised at ingestion - engineer names, customer identifiers, and free-text incident details are stripped or hashed before analysis. Interviews with on-call engineers, EMs, and incident commanders are confidential by default - quotes are anonymised in the final report and never attributed without explicit consent.
Auto-Revoked & Destroyed After Review
As soon as your SRE Maturity Report is delivered, every API key is revoked, the analysis sandbox is destroyed, and your pager / incident / SLO export is deleted. Only aggregate, anonymised findings are retained for QA - never engineer names, customer identifiers, or incident specifics.
Frequently Asked Questions
The most common questions we hear from teams running this assessment.
What access and data do you actually need? Will any of it leave our environment?
Read-only viewer access to your incident-management platform (PagerDuty, Opsgenie, Incident.io, FireHydrant, Rootly, Grafana OnCall, Datadog On-Call), SLO tooling (Sloth, Pyrra, Nobl9, native), and observability platform - scoped to alert, incident, on-call, SLO, and runbook APIs only. Pager and incident data is anonymised at ingestion; engineer names, customer IDs, and free-text incident details are stripped or hashed before analysis. We never copy production telemetry off your environment, and we provide the exact scopes in advance for your security team to review.
How is this different from running our own SRE retro?
Internal retros are excellent for team-level reflection but rarely produce hard maturity scoring against external benchmarks, quantified MTTA / MTTR / page-volume baselines, or comparison to dozens of similar teams. A senior SRE who has run on-call for high-traffic SaaS combines anonymised pager-data analysis, structured engineer interviews, and benchmarking against the Google SRE Workbook, OpenSLO, and DORA - into a prioritised backlog with quantified impact per change rather than a list of opinions.
Can you help us actually build an SLO programme - not just audit one?
Yes. The deliverables include a per-service SLO rollout plan mapping every customer-facing user journey to the SLI it should have (availability, latency, quality, freshness), the SLO target and rationale, multi-window multi-burn-rate alert templates per the Google SRE Workbook, an error-budget policy template with pre-defined consequences, and a tool-fit recommendation across Sloth, Pyrra, Nobl9, OpenSLO, Datadog SLOs, Grafana SLO, or Honeycomb SLOs based on your stack.
Do you measure on-call burnout objectively or just ask engineers?
Both. The objective signal comes from 90 days of pager data - page volume per engineer, off-hours and weekend page distribution, time-to-acknowledge spread, chronically-acked alerts, and shift-handoff quality. The subjective signal comes from confidential 30-minute interviews with on-call engineers, asking the questions that pager data cannot answer (psychological safety during incidents, quality of runbook coverage, IC support, sleep impact). The two perspectives almost always reveal different gaps.
Do you cover incident command and post-mortem culture?
Yes. Incident command coverage includes severity classification (SEV-1 → SEV-5), incident commander role and rotation, status-page communication (Statuspage, Better Stack), war-room tooling, executive escalation paths, and ICS / Google IMAG alignment. Post-mortem coverage includes blameless facilitation, root-cause analysis depth (Five Whys, CAST), action-item ownership and follow-through, and learning dissemination across teams - with templates and facilitator guides included in the deliverables.
How do you measure toil and produce an automation backlog?
We define toil per the Google SRE Book - manual, repetitive, automatable, tactical, no-enduring-value, scaling with service growth - then quantify it through ticket analysis, time-tracking signal, and engineer interviews. The automation backlog prioritises top-toil categories and recommends concrete tooling (GitOps, ChatOps, Rundeck, StackStorm, Ansible, AWS Systems Manager, Azure Automation, internal developer platforms with Backstage, Port, or Cortex) with quantified hours-saved per change.
Will you tell us where we should NOT have SLOs?
Yes - and this is one of the most useful parts of the review. Vanity SLOs and SLOs on internal-only services that nobody acts on are a major source of programme decay. The rollout plan explicitly identifies which services should not have SLOs, which should have only error-rate or freshness SLIs, and which should have full multi-burn-rate alerting - so the programme stays sustainable past the first quarter.
How long until we receive the report?
Typical turnaround is 3-5 business days from the moment read-only access is granted and engineer interviews are complete, plus a 45-minute live findings walkthrough at a time that suits your SRE, platform, and engineering leads. Larger teams with multiple on-call rotations can take a little longer; we confirm the timeline as soon as we see the scope.
Register for Your Free SRE Practices Review
Fill out the form below and our team will get back to you within 2 business days.
You Might Also Be Interested In
Infrastructure as Code Review
Free Terraform, OpenTofu, Pulumi, and CloudFormation review - code quality, security misconfigurations, state hygiene, drift detection, and CI/CD pipeline gates - verified by a senior platform engineer and aligned with CIS Benchmarks, SOC 2, and ISO 27001.
DevOps DORA Checklist
See where your delivery performance stands against Elite, High, Medium, and Low performers - automatically scored, expert-verified.
Pipeline Inspector
Find every weak link in your CI/CD - automated scanning across GitHub Actions, GitLab, Jenkins, Bitbucket, and Azure DevOps, verified by a senior platform engineer.