When an Automated Job Fails, Don’t Lose the Work

Posted by:

|

On:

|

A scheduled job failed. The work didn’t disappear.

One afternoon a scheduled job in our content pipeline stopped partway through. That kind of half-finish normally means lost work: the item vanishes into air, and someone has to notice later and recreate it. This time the item didn’t vanish. It dropped into a separate list where failed work waits with its inputs kept, so it can be tried again.

What we did and what happened

We had built a simple safety step: any task that errored would be written to that list with its original inputs preserved. The next morning we replayed the held item.

The replay finished cleanly. The original error had been a temporary timeout. The work itself was fine.

The experiment, in plain terms

We stopped treating errors as final and started treating them like mail that can be held for a retry. Practically: when a job failed, we captured the input and put it into the separate list instead of letting the pipeline drop it. No magic. Just capture and hold.

Then we waited. We replayed the held item later. The result: success.

The moment it clicked

We noticed a pattern. Over a week we replayed several held items. Most completed on the retry. The turning point came when a pile of failures that looked fatal recovered on their own after we tried them again. At that moment we stopped assuming an error meant permanent loss and started assuming it might be temporary. That one shift changed how we treat every failure.

The rule that fell out

No automated job is allowed to fail silently. Every failure has to land somewhere a person or a retry can find it.

Three concrete things we learned

1) Failures are often temporary. When we replayed held items the next morning they frequently finished without change to the work. This was true in our tests: a timeout that broke a run was healed by a later attempt.

2) The holding step matters. We built a dead-letter step so anything that errors is written to a separate list with its inputs kept intact. That simple mechanism stopped whole jobs from vanishing and made retries possible.

3) Humans still matter. Not every failure should be retried. Some items in the holding list were genuinely bad input and would have failed forever. We needed a person to glance at the list and remove the truly broken cases rather than repeatedly retrying them.

A fair counter-example

We had a few items that never healed on replay. Those were not transient infrastructure glitches. They were bad input: malformed text, missing required fields, or requests that violated downstream rules. Holding and replaying those just repeated failure. The holding list exposed them quickly so a human could intervene and discard or fix them.

How we proved this worked in the wild

One held item came from that originally failed scheduled job. We replayed it the next day and it completed. Because we kept the inputs intact, the replay used the same data and showed that the failure was a timeout, not malformed work. That moment — the recovery proven live — is why we adopted the rule.

Practical rule, plain language

If an automatic process breaks, keep the work safe where someone or something can find and retry it. Don’t let jobs disappear. And make sure someone scans the held list to toss the genuinely broken items so retries aren’t a waste.

How we know this applies beyond our pipeline

We saw the same pattern in multiple runs and multiple error types. Temporary outages, rate limits, and intermittent timeouts healed on retry; malformed inputs did not. The combination of a holding step plus a human triage pass reduced lost work dramatically in our environment.

FAQ

Why not just retry automatically every failure?

Automatic retries help with transient errors, but they can waste cycles on permanently broken items. In our setup, retries happen from the held list under rules: automated retries for recent transient errors, and a human glance for persistent or suspicious failures. That balance stopped noisy retries and caught real problems fast.

Won’t the holding list grow forever?

It can, if you do nothing. We added a simple process: a human reviews the list daily and either triggers retries for likely transient cases or removes items that are clearly bad input. That keeps the list manageable and prevents endless retries.

What if the held input contains sensitive data?

Treat held inputs like any stored work. Encrypt them, limit who can see them, and log access. In our pipeline the inputs were stored with the same protections as active jobs.

Failure that is lost is invisible grief. Failure that is held is recoverable work.

Sources: documented incidents and tests from our content pipeline where failures were captured to a separate holding list and later replayed successfully; observed rule adoption after repeated recoveries and a small set of non-recoverable bad-input cases.

Leave a Reply

Your email address will not be published. Required fields are marked *