13 KiB
Workflow Runner Refactor — Design
Date: 2026-05-19
Branch base: dev
Status: Approved design, pending implementation plan
Problem
Workflow execution and the Django signal layer have repeatedly produced fragile, hard-to-fix bugs (see the revert/refix history around password removal: #12803, #12814, #12716, and the filename race #12386). Three structural causes:
-
run_workflowsis dual-mode. A single function handles both consumption (mutating aDocumentMetadataOverrides) and post-save (mutating a realDocument), branching on ause_overridesflag. The branching is concentrated in two places — the action dispatch insiderun_workflows(handlers.py:931-1001) andbuild_workflow_action_context(actions.py:33-83), each with two full code paths. Theapply_*helpers inworkflows/mutations.pyare already split by target type (apply_assignment_to_documentvsapply_assignment_to_overrides, etc.); the refactor unifies their callers, not the helpers themselves. -
File location is an implicit, timing-dependent side channel. The
DOCUMENT_ADDEDworkflow fires fromrun_workflows_added, which runs while the consumer is still inside its transaction — before the consumed file is copied todocument.source_path(document_consumption_finishedis sent atconsumer.py:658; the file copy happens after, atconsumer.py:670+). The staged path is therefore threaded through asoriginal_file/caller_supplied_original_fileparameters. Actions that read the file (password removal, email attachments) depend on this plumbing being correct. -
The workflow run races the filename rename.
update_filename_and_move_filesis a rawpost_savereceiver onDocument. When a workflow persists its changes viadocument.save(update_fields=[...]), that save firespost_saveand runs the rename while the workflow is still executing. Under concurrent Celery/UI updates the interleavedrefresh_from_db()calls corrupt state. The comment athandlers.py:980-984— deliberately excludingfilename/archive_filenamefrom the workflow save — is a load-bearing workaround for exactly this.
Note: run_workflows_added / run_workflows_updated are connected to the
custom signals document_consumption_finished / document_updated, fired
explicitly by paperless code in a handful of known sites — not to raw Django
post_save. Only update_filename_and_move_files is a raw post_save receiver.
This refactor does not change where workflows are triggered from.
Scope
In scope:
- Refactor
run_workflowsand its action helpers around an execution-context abstraction. - Delete the
original_fileside-channel plumbing. - Make the workflow-execution → persist → rename sequence explicit and deterministic.
Out of scope:
- Changing where/when workflows are triggered (custom signal call sites unchanged).
- Reworking the matching logic (
matching.document_matches_workflow). - Any change to workflow models, serializers, or the REST API.
Design
1. WorkflowRunContext protocol
New module documents/workflows/context.py defining a typing.Protocol:
WorkflowRunContext (Protocol)
source_file: Path # where the file actually is, now
build_placeholder_context() -> dict
apply_assignment(action) -> None
apply_removal(action) -> None
persist() -> None # commit accumulated mutations
record_run(workflow, trigger_type) -> None
Two concrete implementations (which need not import the Protocol — structural typing):
ConsumptionContext— wrapsConsumableDocument+DocumentMetadataOverrides.source_filereturns the staged file path. Mutations land on the overrides.persist()is a no-op (the overrides object is returned to the caller).PersistedContext— wraps a realDocument. Mutations land on the in-memoryDocument.persist()performs a single save.
Context selection — run_workflows picks the context from the call shape:
- CONSUMPTION trigger (
ConsumableDocument+ non-Noneoverrides) →ConsumptionContext. - DOCUMENT_ADDED / DOCUMENT_UPDATED / SCHEDULED (a real
Document,overrides=None) →PersistedContext.
source_file for PersistedContext. It cannot unconditionally return
document.source_path: for the DOCUMENT_ADDED trigger the file has not yet
been moved there. The staged path is therefore passed into the PersistedContext
at construction time by run_workflows_added (which still receives it from the
document_consumption_finished signal). source_file returns that staged path
when supplied, otherwise document.source_path. This relocates the staged-path
information from a chain of function parameters into a single piece of
construction state — the original_file / caller_supplied_original_file
parameter plumbing through run_workflows and the action helpers is what gets
deleted, not the staged path itself.
WorkflowRunContext is a plain Protocol, not @runtime_checkable — the runner
constructs the context itself, so no isinstance check is needed. Genuinely
shared logic goes into module-level helper functions, not a base class.
2. run_workflows becomes branch-free
run_workflows keeps its current public signature so all call sites are
unchanged. Its body:
- Construct the appropriate context once, from the argument types.
- Run a single flat match-and-dispatch loop over matching workflows/actions, delegating every action to context methods.
No use_overrides flag anywhere. The branching currently scattered across
run_workflows, build_workflow_action_context, and the apply_* helpers
collapses into the two context classes.
3. File staging via source_file
source_file is a property of the context, fixed at construction. The
original_file and caller_supplied_original_file parameters threaded through
run_workflows and the execute_* helpers are deleted; each context resolves
the path itself (see "Context selection" above).
Deferred password removal. execute_password_removal_action, when given a
ConsumableDocument, currently installs a one-shot handler on
document_consumption_finished that picks up original_file from kwargs
later (actions.py:295-308). This deferred hook lives outside the context
abstraction. The refactor must explicitly decide its fate: either keep it as-is
(the context still constructs correctly around it) or fold the deferral into
ConsumptionContext. This is called out as an open implementation decision, not
silently absorbed.
4. Explicit workflow → persist → rename sequencing
What must be deferred is the file rename, not the DB save. run_workflows
keeps its per-workflow document.refresh_from_db() at the top of each iteration
— that is deliberate concurrency protection against bulk_update_documents
running simultaneously. Deferring all saves to a single final persist() would
let one workflow's refresh wipe a prior workflow's in-memory changes. So:
run_workflowsrefreshes and applies actions per workflow, andPersistedContext.persist()saves after each matching workflow, as today.- The save deliberately continues to exclude
filename/archive_filenamefromupdate_fields. This is not duct tape: it guards a cross-process hazard — another Celery task may have moved the file and writtenfilenameto the DB, and a stale in-memoryfilenamein our save would revert it. TheContextVarguard (below) only addresses intra-process ordering, so this exclusion stays. - The rename is suppressed for the whole run and invoked exactly once, afterward, against final committed state.
The actual race being fixed: apply_assignment_to_document assigns tags via
document.add_nested_tags(...), which fires m2m_changed on
Document.tags.through before the workflow's document.save(). The
m2m_changed receiver update_filename_and_move_files then calls
refresh_from_db(), wiping the workflow's in-memory correspondent/type, and
moves the file to a path computed from stale metadata. The guard prevents this.
To stop the rename from firing mid-workflow, a ContextVar guard is
introduced (e.g. documents/workflows/context.py module-level
_workflow_in_progress: ContextVar[bool]). update_filename_and_move_files
checks the guard and early-returns when set. run_workflows wraps its entire
persisted-path execution — not just the persist() call — in a context manager
that sets the guard via set()/reset(token). Token-based reset is
reentrancy-safe for nested saves or nested workflow runs.
The guard must span the whole execution, not just persist(), because
update_filename_and_move_files is also registered to m2m_changed on
Document.tags.through and to post_save on CustomFieldInstance
(handlers.py:431-432). A workflow action that assigns tags or custom fields
would otherwise trigger a rename mid-workflow through those signals.
After execution completes, run_workflows calls persist() once and then
explicitly invokes the move logic once. The ContextVar is set/reset in the
same thread that runs these receivers synchronously, so they always observe the
value. (Celery prefork workers run each task in its own process; greenlet
pools are also contextvars-aware — non-issues, noted for completeness.)
The move body of update_filename_and_move_files is extracted into a plain
callable that the runner invokes directly. The function is already invoked
directly (as a plain call, bypassing the decorator) for version documents at
handlers.py:664-667, so this extraction has precedent. The thin post_save
receiver remains as a guard-checking wrapper.
The two post_save receivers on Document are update_filename_and_move_files
(handlers.py:433) and update_llm_suggestions_cache (handlers.py:740). The
ContextVar guard suppresses only the former — update_llm_suggestions_cache
keeps running normally, as do document_consumption_finished receivers such as
add_or_update_document_in_llm_index (which is not a post_save receiver).
This is why the guard is preferred over persisting with .update(), which would
silently suppress all post_save receivers including
update_llm_suggestions_cache.
WorkflowRun.objects.create(...) is created per matching workflow as today
(handlers.py:998-1002); it is a separate model and is not deferred.
The comment at handlers.py:980-984 is updated to describe the new flow
(per-workflow save under the guard; single explicit rename afterward) but the
filename / archive_filename exclusion it documents is kept — see point 2
above.
Testing
- Runner loop — exercised against a fake context implementing the
WorkflowRunContextsurface that recordsapply_assignment/apply_removal/persistcalls. No DB document, no staged files, no signals. - Concrete contexts —
ConsumptionContextandPersistedContexteach get focused tests: given an action, assert the mutation lands on the overrides vs. the document, and thatsource_fileresolves to the staged vs. final path. - ContextVar guard — assert
update_filename_and_move_filesearly-returns while the guard is set, and that the rename runs exactly once afterpersist(). - Regression: the racy case — a workflow that reassigns metadata while the document is subject to a filename template; assert final DB filename and file location are consistent (the #12386 scenario).
- Regression safety net — the existing
test_workflows.pysuite (~100 tests; ~19document_consumption_finished.sendsites plus many directrun_workflows(...)calls for theDOCUMENT_UPDATEDpath) must stay green unchanged. A test that needs editing signals a behavior change to flag explicitly, not a silent refactor outcome.
Per project conventions: tests grouped under classes, fixtures and test signatures fully type-annotated.
Implementation sequence
Each step is independently reviewable and keeps the test suite green:
- Introduce the
Protocol+ the two contexts;run_workflowsdelegates to them. Pure refactor, no behavior change. - Move the staged path into
PersistedContextconstruction (passed byrun_workflows_added); delete theoriginal_file/caller_supplied_original_fileparameter plumbing throughrun_workflowsand theexecute_*helpers. - Extract the move body from
update_filename_and_move_filesinto a callable; add theContextVarguard;run_workflowsinvokes the move once after the run completes. Thefilename/archive_filenameexclusion in the per-workflow save is kept; only the comment athandlers.py:980-984is updated to describe the new flow.
Pain points addressed
- Dual-mode → eliminated by the
Protocol+ two contexts; nouse_overrides. - File staging →
source_fileis a context property; side-channel args deleted. - Rename race → per-workflow save under a
ContextVarguard that suppresses the mid-workflow rename; a single explicit rename runs once at the end against final state.