← Blog

Evidence-Driven Operations

Absence of signal is not evidence of health. What operational evidence actually looks like — structured run records, inline probes, and verification that outlasts the engineer who ran the job.

Ask an engineering team if their infrastructure is working correctly and most of them will say yes. Ask them how they know, and the answers are usually a mix of: nothing has broken recently, the monitoring dashboards look green, and the last deployment went fine.

None of that is evidence. It is absence of signal. Those are different things, and confusing them is one of the more common sources of operational surprise.


The assumption problem

Infrastructure operations run on assumptions more often than most teams realise. The assumption that the DR configuration is still current. The assumption that the backup job completed successfully last night. The assumption that the replica is actually replicating at an acceptable lag. The assumption that the firewall rules reflect what was deployed six weeks ago, before someone made a manual change that nobody documented.

Assumptions accumulate quietly. Each one is individually reasonable. Collectively, they produce environments that are technically running but whose actual state nobody can describe with confidence.

This is not a failure of diligence. It is a structural problem. When verification is manual, it gets skipped under pressure. When operational evidence is not captured automatically, it does not exist.


What evidence actually looks like

Operational evidence is a structured record of what ran, against what, with what result. It is produced automatically as part of execution, not written up afterwards by the operator who ran the job.

A useful run record contains the inputs the operation received, the pre-condition state of the environment before the run, the tool output (appropriately redacted), the validation probe results, and the outcome against the expected state. That record exists whether the operation succeeded or failed. It is legible to someone who was not present when the run happened.

This is different from a log file. Log files capture what a tool emitted. A run record captures what an operator intended, what the platform did, and what the result was. The distinction matters when you are trying to understand a failure three weeks after the fact, or when an external reviewer needs to verify that a specific procedure was followed correctly.


Probes and validation steps

The most useful verification happens during execution, not after it. Probes run as part of an operation, check a specific condition, and write the result into the run record.

For a database promotion, a probe might check replication lag before the switchover, confirm that the replica accepted write traffic after promotion, and verify that the application reconnected successfully. Each of those checks produces a structured result — pass, fail, or a measured value — that is part of the operation’s record.

probes:
  - name: replication-lag-check
    type: psql
    query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int"
    expect: "< 30"
    on_fail: abort

  - name: post-promote-write-check
    type: psql
    query: "SELECT pg_is_in_recovery()"
    expect: "f"
    on_fail: alert

These are not tests in the software development sense. They are operational verification steps. The distinction is that they run against real infrastructure, they produce evidence that goes into a run record, and they can halt an operation if a critical check fails. They are part of the operational design, not a post-hoc quality layer.


Why this changes incident response

When evidence exists, incident response is a different kind of work. Instead of starting from “something is wrong, let’s figure out what state the environment is in,” the engineer starts from the run records: what ran recently, what the environment state was at each point, where the probes passed or failed.

That is a much smaller problem space. The investigation starts with evidence, not with assumptions about what might have happened.

I’ve seen incidents where the entire diagnostic effort was spent reconstructing what had run over the previous 48 hours, because the operations had produced no structured output. The fix, when found, was straightforward. The time to find it was not. That time comes directly from the absence of operational evidence.


The cultural shift

Evidence-driven operations requires a change in what counts as “done.” An operation is not done when the command returns zero. It is done when the run record exists, the probes have passed, and the outcome is documented in a form that someone else can verify.

That standard feels heavier at first. It takes longer to run an operation that captures evidence than one that does not. But the investment is not in the individual run — it is in the aggregate operational model. An environment where every significant operation produces a run record is an environment that can be audited, investigated, and understood.

HybridOps structures this as a first principle: every module execution produces a run record, probes run as part of the operation rather than separately, and the output is normalised so it can be compared across runs. The evidence is not bolted on — it is part of how the platform executes.


What this requires

Building an evidence-driven operational model requires three things.

First, a definition of what a run record contains. This needs to be agreed and consistent, not left to individual operator judgement.

Second, probes that are part of the operational design — not afterthoughts. The verification steps for an operation should be designed at the same time as the operation itself.

Third, a norm that the record is what defines completion. The dashboard being green is not the same as the operation having produced evidence of correctness.

Reliable infrastructure operations are not distinguished by the absence of incidents. They are distinguished by the quality of the evidence available when incidents happen.