Error Handling Patterns¶
YAML Workflow provides robust mechanisms for handling errors gracefully within your workflows. This guide covers common patterns and best practices.
Core Concepts¶
TaskExecutionError
: The base exception raised when a task fails during execution. It often wraps the original exception.on_error
block: A step-level configuration to define behavior when that specific step fails.- Retry Mechanism: Automatically retry failed steps based on configuration.
- Error Context: Information about the error (step name, original exception, etc.) is available in templates within the
on_error
block and potentially passed to error handling steps. - State Management: The engine tracks the status of each step (completed, failed, skipped) and the overall workflow status.
Step-Level Error Handling (on_error
)¶
For a runnable example demonstrating the concepts below (retry, continue, next, error messages), see docs/examples/error_handling/workflow.yaml
.
You can define how a step should react to its own failure using the on_error
block within the step definition.
steps:
- name: risky_api_call
task: api_call
inputs:
url: "{{ env.API_URL }}/data"
on_error:
action: retry # or fail, continue, next
retry: 3
delay: 10 # seconds
message: "API call failed for {{ args.item }}: {{ error }}" # Optional custom message
next: handle_api_failure # Optional target step for 'next' action
Actions:¶
fail
(Default): The step is marked as failed, and the entire workflow halts immediately, raising aWorkflowError
.retry
: Attempts to re-run the failed step.retry
(integer): Maximum number of retry attempts (defaults to global setting or 3).delay
(integer/float): Seconds to wait before the next retry attempt (defaults to 0).- If retries are exhausted, the behavior defaults to
fail
unlessnext
orcontinue
is also specified. continue
: Marks the step as failed but allows the workflow to proceed to the next step in the sequence. The failed step's output will not be available in thesteps
namespace.next
: Marks the step as failed and jumps execution to a different step specified by thenext
key. This allows for dedicated error handling flows.
Custom Error Message (message
)¶
- You can provide a custom error message template using
message
for logging and state tracking. - This template can use Jinja2 syntax and access standard context variables (
args
,env
,steps
). - It can also access the core error details using the
error
variable: {{ error }}
: Provides the primary error information, often the string representation of the original exception that caused the failure (e.g., "FileNotFoundError: [Errno 2] No such file or directory...").- The resolved
message
is stored in the workflow state for the failed step (e.g., accessible later viasteps.FAILED_STEP_NAME.error_message
) and is used in logs. Use this pattern to access formatted error messages in subsequent error-handling steps.
Retry Strategies¶
- Simple Retry: Use
action: retry
withretry
anddelay
for transient issues (e.g., network timeouts, temporary API unavailability). Best for idempotent operations where re-running is safe. - Retry with Backoff: While not built-in directly as an exponential backoff flag, you could potentially implement custom retry logic within a Python task if needed, though standard linear delay is often sufficient.
- Global Retries: A global default retry count can potentially be set in workflow
settings
(implementation may vary).
Error Handling Actions: When to Use What¶
fail
(Default): Use when any failure in the step is critical and the entire workflow should stop immediately.retry
: Use for transient errors where re-running the step might succeed (e.g., network issues, temporary service unavailability). Ensure the step is safe to re-run (idempotent).continue
: Use when a step's failure is non-critical, and the workflow can proceed meaningfully without its results. Subsequent steps should not depend directly on the output of the failed step.next
: Use when a failure requires specific cleanup, compensation, or notification actions before potentially stopping or continuing the workflow. Redirects execution to a dedicated error-handling sequence.
Error Handling Flows (action: next
)¶
Use action: next
to redirect the workflow to a dedicated error handling path.
flows:
default: main_flow
definitions:
- main_flow: [step_a, risky_step, step_c]
- error_flow: [log_error, notify_admin] # Defines the sequence for the error path
steps:
- name: step_a
# ...
- name: risky_step
task: some_task
# ...
on_error:
action: next
next: log_error # Jump to the start of the error_flow sequence
message: "Risky step failed for {{ args.id }}: {{ error }}" # Custom message stored in state
- name: step_c
# ... only runs if risky_step succeeds ...
- name: log_error
task: write_file
inputs:
path: "output/error_log.txt"
content: |
Workflow Failure Report
-----------------------
Timestamp: {{ workflow.timestamp }}
Failed Step: {{ steps.risky_step.name }} {# Assumes 'risky_step' is the one that failed #}
Reason: {{ steps.risky_step.error_message }} {# Access the formatted message defined above #}
Original Error: {{ steps.risky_step.error }} {# Access raw error if needed #}
Context:
Args: {{ args | tojson }}
# Add other relevant context if available
# Note: This step runs *after* the risky_step failed and was redirected here.
# It accesses the state (name, error_message, error) of the failed step.
- name: notify_admin
task: slack_notify # Example custom task
inputs:
channel: "#alerts"
message: "Workflow {{ workflow.name }} failed at step {{ steps.risky_step.name }}. Reason: {{ steps.risky_step.error_message }}. See error_log.txt for details."
Error Propagation¶
- When a step fails and the action is
fail
(or retries are exhausted with nocontinue
/next
), the engine wraps the original exception (or theTaskExecutionError
created byhandle_task_error
) within aWorkflowError
and halts execution. - The
WorkflowError
contains information about the step that failed and often includes the original exception as its cause (__cause__
or anoriginal_error
attribute), allowing programmatic inspection if the engine is used as a library. - If
action: continue
is used, the error is logged and recorded in the state, but execution proceeds. - If
action: next
is used, the error context is preserved in the failed step's state, and the workflow jumps to the specified step.
Centralized Error Handling in Custom Tasks¶
As detailed in the Task Development Guide, custom Python tasks should use the handle_task_error
utility. This ensures:
- Consistent logging format.
- Wrapping of arbitrary exceptions into TaskExecutionError
.
- Inclusion of relevant context (step name, task type, task config) in the raised exception.