When an Automated Job Fails, Don't Lose the Work

Scene: the job that died

A scheduled job in our pipeline failed partway through — the kind of failure that normally means the work is just gone. automation failure recovery is what saved us.

The job stopped. Logs showed a timeout. The dashboard marked it as failed. That normally means rerun, rebuild, explain to stakeholders. This time it was different.

Instead of disappearing, the failed item dropped into a separate list where failed work waits with its inputs kept, so it can be tried again. The system recorded the exact inputs, the step that failed, a timestamp and the error text. Nothing was deleted. Nothing was silently swallowed.

What we tried and what happened

We had already built one small mechanism: any item that errored out was written to a separate list with its inputs kept intact. We call that the holding list. When the job failed this time, the item landed there.

The next morning we replayed the held item. It completed cleanly. The original failure had been a temporary timeout, not a real problem with the work.

That replay was the turning point. We watched the job finish. The same inputs, the same steps. Success.

Before that moment our posture had been to treat failure as final. We assumed a failure meant the work was lost. We assumed there had been a bug, or a corrupt file, or some unrecoverable state. So we coded for alerting, rebuilds and blame. That wastes time. And work gets discarded.

automation failure recovery as a decision

After the replay we changed a rule. No automated job is allowed to fail silently. Every failure has to land somewhere a person or a retry can find it. That became our operational law.

We document three concrete things that showed us this mattered.

First: the failed item that landed in the holding list finished when replayed the next day. The problem had been a transient timeout. Recovery proven live.

Second: when we scanned the holding list over two weeks we found most entries healed themselves when retried. Patterns emerged quickly. Network hiccups. A downstream API rate-limit timeout. Brief infrastructure blips. These were temporary.

Third: we had also built the write-to-list step so the inputs were preserved exactly. That let us verify that the replay was running the same work, not a patched version. That was crucial for trust; we could attribute the difference to environmental conditions rather than silent changes to the task.

How the holding step behaved

Concrete detail: the holding list stores the original inputs as a serialized record, the step identifier, the error string, and a retry-count. That retry-count blocks infinite loops. We keep that detail here to explain how the system knows whether to try again. This is a taste, not a full build plan.

Operationally the holding list does two jobs. One, it acts as a buffer so the item is not lost. Two, it creates an auditable trail so someone can glance at it and decide: retry, escalate, or discard. That person is a human operator who looks only at the entries that show permanent failures.

A rule that fell out

From those moments a simple rule emerged. Every failure must be visible. If a system swallows a failure, real work disappears. If a system holds a failure, most of it comes back on replay.

We wrote that rule into our operations: hold anything that errors out; record inputs; let humans triage only the permanently broken entries; allow automatic replay for transient cases. It sounds obvious. But we had to see it several times before we trusted it.

When replay is wrong

Not every failure is transient. We found entries that would have failed forever if retried. Bad input. Invalid files. Misconfigured parameters provided by a client. These needed a human to glance at the holding list and throw the truly broken ones away.

We learned to put a short note with each held item describing what likely failed. That saved time. A human could delete or correct the ones that were genuinely garbage without trying endless retries.

One concrete counter-example: a batch from a partner repeatedly failed because a date field was misformatted. We retried it twice and it failed both times. It stayed on the holding list until someone inspected the inputs, fixed the date format, and resubmitted. Without the hold it would have been logged as ‘failed’ and the work lost.

Why this worked

Mostly, failures were environmental. We were hitting timeouts and temporary network errors. Those do not mean the work is wrong. They mean the environment is flaky. Holding and replaying gives the environment time to recover.

Second, preserving inputs lets us test whether the fault is with the data. When we replayed the item that had timed out, we confirmed the inputs were still valid. That distinction guided our next actions—investigate infra when many items time out, investigate inputs when the same item fails repeatedly.

Third, a visible holding list changes human behavior. Operators stop reflexively discarding work. They can triage quickly. That reduces waste.

We did not invent a magic fix. We changed how failures are treated. The system became resilient because we stopped assuming failure equals finality.

Questions we got

How often did replays fix the problem?

In our scan of two weeks most held items succeeded when retried. We did not measure a single percent in public notes here, but the pattern was clear: a majority. The concrete incident where a scheduled job completed after a morning replay is the clearest proof.

Won’t the holding list just become a junk pile?

It can, if nobody looks. That’s why entries include a short human-readable note and a retry-count. We also require a person to glance periodically and discard truly broken inputs. The holding list is an operating queue, not a long-term dump.

Can automatic retry cause harm?

Yes, if retries hide systemic issues or multiply side effects. We limit retries with a count and we record attempts. When the same item fails repeatedly a human intervenes. That mix of automated retry plus human triage prevented repeat damage in our pipeline.

Difference, plain: lost means gone. Held means recoverable. That is what automation failure recovery bought us.

Sources: our own content pipeline experiments and operational records from the team that ran the job and replayed the held item.

How we know

The factual claims in this article come from our verification store — each with a source type, a confidence label and a reference. The method is documented on How we know.

– We made it a rule: no automated job is allowed to fail silently. Every failure has to land somewhere a person or a retry can find it. | source: first-hand experience | conf: strong | ref: ebizapple first-hand — no silent failure
– When we replayed the held item the next morning it completed cleanly. The original failure had been a temporary timeout, not a real problem with the work. | source: product testing | conf: strong | ref: ebizapple first-hand — recovery proven live
– The turning point was realising most of our failures were temporary. Treat them as permanent and you lose real work; hold them and replay, and most heal themselves. | source: first-hand experience | conf: strong | ref: ebizapple first-hand — decision_rule
– We had built a dead-letter step: anything that errors out is written to a separate list with its inputs kept intact, so it can be retried later instead of silently lost. | source: product testing | conf: strong | ref: ebizapple first-hand — the mechanism
– Not every failure should be retried. A few were genuinely bad input that would fail forever, so the holding list also needed a person to glance at it and throw the truly broken ones away. | source: first-hand experience | conf: observed | ref: ebizapple first-hand — counter-case
– A scheduled job in our pipeline failed partway through, the kind of failure that normally means the work is just gone. This time the failed item dropped into a holding list instead of vanishing. | source: product testing | conf: strong | ref: ebizapple first-hand — the incident

When an Automated Job Fails, Don’t Lose the Work