Disaster recovery is usually treated as a technical problem. You set up replication, you configure a standby, you test the failover, and you document the procedure. If the primary goes down, the standby comes up. That is the model.
The technical side of that model is largely solved. Patroni handles PostgreSQL HA. Cloud SQL supports external replicas. VPN failover can be automated. The tooling exists. The harder problem is the one that comes after the technical setup works: who decides when to actually use it, based on what information, and what are the consequences of that decision?
That is a governance problem, not a technical one.
Why automatic failover is not always the right answer
Automatic failover is appealing because it removes the need for human judgment in the moment. The monitoring detects a failure, the decision is made, the standby is promoted. Fast, clean, no 3am phone call required.
But automatic failover makes a specific assumption: that the detected condition always warrants immediate recovery action. In practice, that assumption breaks down in interesting ways.
A transient network partition might look like a primary failure. Promoting a replica that is actually still replicating to a live primary creates a split-brain scenario that is significantly worse than the original blip.
A brief spike in replication lag might trigger failover logic that moves database traffic to a cloud replica — incurring egress costs and latency penalties — for a condition that resolves itself in four minutes.
An application bug that crashes one service might trigger cascading alerts that make the monitoring surface look like a site-wide outage when it is not.
In each of these cases, an automated system that fires immediately is not providing resilience. It is making decisions based on incomplete information, and those decisions have real operational and financial consequences.
Recovery as a decision
A more useful model is to treat recovery as a governed decision rather than an automatic trigger.
The signals that inform that decision — health checks, replication lag, cost posture, service availability, cross-region connectivity — are still captured automatically and continuously. The difference is that they feed a decision surface rather than directly triggering actions.
recovery_triggers:
evaluate_when:
- probe: primary-health
status: failing
duration_seconds: 120
- probe: replica-lag
threshold_seconds: 30
status: exceeded
cost_gate:
max_monthly_egress_usd: 400
action_if_exceeded: alert_and_hold
decision_mode: governed # not: automatic
approval_required: true
notify: ["oncall-lead", "platform-team"]
Under this model, the platform evaluates the signals, determines whether recovery conditions are met, and then routes the decision through a governance layer before executing. An operator confirms the action. The approval, the signal state at the time, and the recovery action are all written into an evidence envelope.
That envelope is what makes the recovery auditable. Not just “we failed over at 02:14” but: here are the signals that triggered evaluation, here is what the environment looked like, here is who approved the action, here is the outcome.
Cost posture as a recovery signal
One of the more underappreciated aspects of cloud-based DR is the cost model. Running active DR into a cloud target — Cloud SQL external replica, GCP compute, cross-region networking — has ongoing costs. Failing over into that target for an extended period has larger costs.
Most DR designs acknowledge this at the architecture stage and then ignore it at runtime. The failover happens, traffic moves to the cloud target, and nobody thinks about cost posture until the monthly bill arrives.
Including cost posture as a first-class signal in the recovery decision changes that. The platform knows the current cost position. If a failover would push egress past a defined threshold, it flags that condition before executing — not as a blocker, but as information that the operator should have before making the call.
This is not about being cheap with infrastructure. It is about making recovery decisions with complete information. An operator who knows that a failover will cost an additional £800 this month is in a better position to weigh the trade-off than one who is purely reacting to a health check alert.
What governed recovery actually looks like
The practical components of a governed recovery model are not complex. They are mostly decisions that need to be made explicitly and encoded somewhere.
What signals trigger evaluation? Define the health checks, replication lag thresholds, and connectivity probes that indicate a recovery condition might exist. These should be specific and measurable, not “the primary looks unhealthy.”
What is the decision mode? Some conditions warrant immediate automatic action — a primary that has been unreachable for twenty minutes is a genuine failover scenario. Others warrant evaluation and a human call. The model should distinguish between them.
What does the evidence envelope contain? Every recovery action, whether automatic or governed, should produce a structured record. The signal state, the decision path, the approver, the recovery steps executed, and the post-recovery verification results.
What is the failback path? Recovery into a cloud target is temporary in most DR architectures. The failback path — restoring data to the primary, reestablishing replication, cutting traffic back — needs to be as well-defined as the failover path.
HybridOps structures DR orchestration around these questions. The decision service evaluates signals against a defined contract. Governed modes route through an approval layer before executing. Every action produces an evidence envelope.
The argument for deliberate recovery
Automatic systems are valuable. They respond faster than humans, they do not panic, and they do not miss alerts at 3am.
But infrastructure recovery is not purely a speed problem. A recovery action that is fast but wrong — that fires on a transient signal, incurs unnecessary cost, or creates a worse condition than the one it was responding to — is not a success.
Deliberate recovery is slower in the moment and more reliable over time. It produces evidence. It distributes decision-making appropriately rather than encoding all judgment into a system that has no access to business context.
The goal is not to remove automation from DR. It is to put it in the right places.