Evidence-Driven Operations

Ask an engineering team if their infrastructure is working correctly and most of them will say yes. Ask them how they know, and the answers are usually a mix of: nothing has broken recently, the monitoring dashboards look green, and the last deployment went fine.

None of that is evidence. It is absence of signal. Those are different things, and confusing them is one of the more common sources of operational surprise, the kind where the problem turns out to have existed for weeks before anyone noticed.

The assumption problem

Infrastructure operations run on assumptions more often than most teams realise. The assumption that the DR configuration is still current. The assumption that the backup job completed successfully last night. The assumption that the replica is actually replicating at an acceptable lag. The assumption that the firewall rules reflect what was deployed six weeks ago, before someone made a manual change during an incident that nobody documented.

Assumptions accumulate quietly. Each one is individually reasonable. Collectively, they produce environments that are technically running but whose actual state nobody can describe with confidence. The gap between “we think it looks like this” and “we know it looks like this” is where incidents live.

This is not a failure of diligence. It is a structural problem. When verification is manual, it gets skipped under pressure. When operational evidence is not captured automatically, it does not exist, and recreating it after the fact requires the kind of archaeological effort that slows incident response at exactly the moment speed matters most.

What evidence actually looks like

Operational evidence is a structured record of what ran, against what, with what result. It is produced automatically as part of execution, not written up afterwards by the operator who ran the job.

A useful run record contains the inputs the operation received, the pre-condition state of the environment before the run, the tool output (appropriately redacted), the validation probe results, and the outcome against the expected state. That record exists whether the operation succeeded or failed. It is legible to someone who was not present when the run happened, including the team lead reviewing a change window, the external auditor reviewing a compliance period, or the engineer at 3am who needs to understand what the environment looked like twelve hours ago.

This is different from a log file. Log files capture what a tool emitted. A run record captures what an operator intended, what the platform did, and what the result was. The distinction matters when you’re trying to understand a failure three weeks after the fact, or when an external reviewer needs to verify that a specific procedure was followed correctly.

Probes and validation steps

The most useful verification happens during execution, not after it. Probes run as part of an operation, check a specific condition, and write the result into the run record.

For a database promotion, a probe might check replication lag before the switchover, confirm that the replica accepted write traffic after promotion, and verify that the application reconnected successfully. Each of those checks produces a structured result, pass, fail, or a measured value, that is part of the operation’s record.

probes:
  - name: replication-lag-check
    type: psql
    query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int"
    expect: "< 30"
    on_fail: abort

  - name: post-promote-write-check
    type: psql
    query: "SELECT pg_is_in_recovery()"
    expect: "f"
    on_fail: alert

These are not tests in the software development sense. They are operational verification steps. The distinction is that they run against real infrastructure, they produce evidence that goes into a run record, and they can halt an operation if a critical check fails. Designing probes at the same time as the operation changes what it means to design an infrastructure procedure, the question is not just “what does this operation do?” but “how will I know it did it correctly, and what will I do if it didn’t?”

Why this changes incident response

When evidence exists, incident response is a different kind of work. Instead of starting from “something is wrong, let’s figure out what state the environment is in,” the engineer starts from the run records: what ran recently, what the environment state was at each point, where the probes passed or failed.

That is a much smaller problem space. The investigation starts with evidence, not with assumptions about what might have happened.

In some incidents, the entire diagnostic effort is spent reconstructing what ran over the previous 48 hours because the operations produced no structured output. The fix, when found, was usually straightforward. The time to find it was not. That time comes directly from the absence of operational evidence, and it’s entirely avoidable.

The cultural shift

Evidence-driven operations requires a change in what counts as “done.” An operation is not done when the command returns zero. It is done when the run record exists, the probes have passed, and the outcome is documented in a form that someone else can verify.

That standard feels heavier at first. It takes longer to run an operation that captures evidence than one that does not. But the investment is not in the individual run, it is in the aggregate operational model. An environment where every significant operation produces a run record is an environment that can be audited, investigated, and understood. That is a qualitatively different operational posture from one that depends on institutional memory and hopes the dashboards stay green.

HybridOps structures this as a first principle: every module execution produces a run record, probes run as part of the operation rather than separately, and the output is normalised so it can be compared across runs. The evidence is not bolted on, it is part of how the platform executes.

What this requires

Building an evidence-driven operational model requires three things.

First, a definition of what a run record contains. This needs to be agreed and consistent across the platform, not left to individual operator judgement on each run. The schema is part of the operational contract.

Second, probes that are part of the operational design, not afterthoughts. The verification steps for an operation should be designed alongside the operation itself, with explicit failure conditions defined in advance.

Third, a norm that the record is what defines completion. The dashboard being green is not the same as the operation having produced evidence of correctness.

Reliable infrastructure operations are not distinguished by the absence of incidents. They are distinguished by the quality of the evidence available when incidents happen.