Recovery
Recovery allows you to continue the execution of a process from a specific point. The Executive automatically bypasses completed steps, treating them as successful so that the process can resume immediately from the chosen recovery point. This is particularly useful for recovering from failed runs without re-executing the entire process.
Concept
When you start an operation with recovery nodes, the Executive fast-forwards the state of the behavior tree as if all nodes preceding the recovery nodes had already executed successfully. Execution then resumes from the specified recovery nodes.
Usage
To use recovery, you specify the recovery_nodes field in the StartOperationRequest.
Specifying recovery nodes
The recovery_nodes field takes a list of BehaviorTree.NodeIdentifier.
Each execution branch supports only one recovery node.
Specifying multiple recovery nodes is only possible when recovering inside a parallel node.
If a branch of a parallel node has no recovery node specified, that branch is started from the beginning.
Recovery is incompatible with specifying a start node via start_tree_id/start_node_id.
Recovering from a failed run
In a typical recovery workflow you first determine the nature of the failure and what state to recover from. Next, you restore or create a state that allows continuing the process at the point you want it to continue. Finally, you restart the operation from the desired recovery point.
- Identify the failure point: Determine which node failed, why it failed (use the ExtendedStatus report), and where you want to resume execution. It is not required to recover from the failed node.
- Save Blackboard State: Use the Blackboard Snapshots functionality to save the current state of the blackboard.
- Perform recovery operations: This can be any number of steps depending on the failure, e.g., resetting hardware, having an operator clear obstructions, or place parts in specific locations. This might involve starting different processes on the workcell, for example to open the gripper or move the robot back to home.
- Reload the failed process: This will be the starting point of the recovery. Further steps restore the solution to a state to start from.
- Restore Blackboard State: Restore the previously saved snapshot to restore the blackboard to a valid state. This ensures that the resuming nodes have the necessary data (e.g., return values from previous skills) to execute correctly.
- Adapt the belief world: Use the world service to adapt the belief world. For example, if the robot dropped an object during a motion, it might still believe the object to be in the gripper.
- Start Operation with Recovery: Call
StartOperationwith the identifiedrecovery_nodes.
Blackboard and belief world restoration are only required when a recovery point depends on pre-existing data. For instance, if a robot resumes using a pose stored on the blackboard, that data must be restored. Conversely, if the recovery point is set to the perception skill itself, the system will regenerate the pose data automatically, allowing you to skip the restoration step for a more robust 'clean start'.
This could for example be the case when a skill is moving the robot to the pose detected by a perception skill. If recovering from the move skill, one should save and restore the blackboard as otherwise the pose data will not be available. In contrast, if one decides to recover the run starting with the perception skill for more robustness, then this data will be generated and thus does not need to be restored.
Examples
These examples are intentionally kept small to ensure easy understanding. In practice these processes might not require a dedicated recovery procedure. However, the same ideas apply to large scale processes. Imagine every single skill in these examples being a complex sub-process.
Recover into a sequence
Consider a process SimpleInspection (with id
ai.intrinsic.simple_inspection). It is a simple sequence of the following
skill executions:
Move To Inspection Pose, Inspect Object, Move Home.
During execution the process fails at the Inspect Object step. Upon
investigation an operator checks the camera images and finds that the camera
lid is closed. The issue is fixed and now the recovery begins.
- Load the
SimpleInspectionprocess Delete the current (failed) operation and callCreateOperationwithai.intrinsic.simple_inspection. Alternatively in this simple case the failed process can also just be reset by callingResetOperation. - Determine Recovery Point As there were no changes to the workcell besides fixing the camera issue one can start from the
Inspect Objectnode that previously failed. - Call
StartOperationwith the node ID of theInspect Objectnode This starts the process continuing withInspect Object. The previous step(s) will not have to be executed again.
Recover with an intermediate process retaining blackboard values
We will build a new process LoopedInspection (id: ai.intrinsic.looped_inspection) by wrapping the sequence from the previous process in a loop node Main Loop with 5 iterations. Its loop counter key is set to main_loop_counter.
The process runs for two iterations and fails in the third, again at Inspect Object. The camera reports an error and the recovery procedure starts. The operator wants to investigate the camera.
- Save the current blackboard by calling
CreateBlackboardSnapshotThis saves the current state of the blackboard to a handle that we store. Note that by setting themain_loop_counterkey, the loop counter is a value on the blackboard. - Delete the failed operation and run another process Running a different process than the failed one can help in recovery: In this simple case that might just move the robot back to the home position making the camera easily accessible for the operator.
- The operator determines that a cable has become loose and fixes the issue This means that the process can safely be restarted.
- Delete the recovery process and load
LoopedInspectionback CallingDeleteOperationand thenCreateOperationwithai.intrinsic.looped_inspection. - Restore the blackboard snapshot Calling
LoadBlackboardSnapshotwith the handle from step 1 and the current operation. Note that this contains the loop counter from the failed process. - Determine Recovery Point The robot was moved back to its home position. Thus although the failure happened at
Inspect Objectwe will restart in the main sequence atMove To Inspection Pose. - Call StartOperation with the node ID of
Move To Inspection Pose. This will start the process beginning at the given node, but inside the loop. As themain_loop_counterwas restored after creating the operation the process now starts at the beginning of the sequence, but already in the third iteration. The loop will execute three times from the third to the fifth iteration and then finish. - Delete the snapshot This snapshot is not needed any more as it was taken for the recovery and thus must be deleted.
Recover into a specific loop iteration
The same LoopedInspection process is run again. It runs for two iterations as before, but this time fails at the Move Home step with a hardware error from the robot.
The operator is called in and determines that the E-Stop was pressed, possibly by accident. The recovery procedure starts similarly as before.
- Save the current blackboard by calling
CreateBlackboardSnapshotThis saves the current state of the blackboard to a handle that we store. Note that by setting themain_loop_counterkey, the loop counter is a value on the blackboard. - Delete the failed operation and run a recovery process In this case the operator just releases the E-Stop and moves the robot around to some test poses to verify everything is operational again. Finally move the robot back to its home position.
- Delete the recovery process and load
LoopedInspectionback CallingDeleteOperationand thenCreateOperationwithai.intrinsic.looped_inspection. - Restore the blackboard snapshot Calling
LoadBlackboardSnapshotwith the handle from step 1 and the current operation. Note that this contains the loop counter from the failed process. - Determine Recovery Point The robot failed at
Move Home. This was after theInspect Objectstep has been executed. Thus the objective of that loop iteration has already been achieved and we will restart in the main sequence atMove To Inspection Pose. - Update the loop counter As the previous inspection step has already been completed, it is not necessary to run that iteration again. Thus retrieve the stored value of
main_loop_counterfrom the blackboard (in this example2) and update the value ofmain_loop_counterto increase it by one (i.e., set it to3), see writing blackboard. - Call StartOperation with the node ID of
Move To Inspection Pose. This will start the process beginning at the given node inside the loop. The process now starts at the beginning of the sequence, but already in the fourth iteration. The loop will execute two times now. - Delete the snapshot This snapshot is not needed any more as it was taken for the recovery and thus must be deleted.