Skip to main content

Recovery

Recovery allows you to continue the execution of a process from a specific point. The Executive automatically bypasses completed steps, treating them as successful so that the process can resume immediately from the chosen recovery point. This is particularly useful for recovering from failed runs without re-executing the entire process.

Concept

When you start an operation with recovery nodes, the Executive fast-forwards the state of the behavior tree as if all nodes preceding the recovery nodes had already executed successfully. Execution then resumes from the specified recovery nodes.

Usage

To use recovery, you specify the recovery_nodes field in the StartOperationRequest.

Specifying recovery nodes

The recovery_nodes field takes a list of BehaviorTree.NodeIdentifier. Each execution branch supports only one recovery node. Specifying multiple recovery nodes is only possible when recovering inside a parallel node. If a branch of a parallel node has no recovery node specified, that branch is started from the beginning.

info

Recovery is incompatible with specifying a start node via start_tree_id/start_node_id.

Recovering from a failed run

In a typical recovery workflow you first determine the nature of the failure and what state to recover from. Next, you restore or create a state that allows continuing the process at the point you want it to continue. Finally, you restart the operation from the desired recovery point.

  1. Identify the failure point: Determine which node failed, why it failed (use the ExtendedStatus report), and where you want to resume execution. It is not required to recover from the failed node.
  2. Save Blackboard State: Use the Blackboard Snapshots functionality to save the current state of the blackboard.
  3. Perform recovery operations: This can be any number of steps depending on the failure, e.g., resetting hardware, having an operator clear obstructions, or place parts in specific locations. This might involve starting different processes on the workcell, for example to open the gripper or move the robot back to home.
  4. Reload the failed process: This will be the starting point of the recovery. Further steps restore the solution to a state to start from.
  5. Restore Blackboard State: Restore the previously saved snapshot to restore the blackboard to a valid state. This ensures that the resuming nodes have the necessary data (e.g., return values from previous skills) to execute correctly.
  6. Adapt the belief world: Use the world service to adapt the belief world. For example, if the robot dropped an object during a motion, it might still believe the object to be in the gripper.
  7. Start Operation with Recovery: Call StartOperation with the identified recovery_nodes.
note

Blackboard and belief world restoration are only required when a recovery point depends on pre-existing data. For instance, if a robot resumes using a pose stored on the blackboard, that data must be restored. Conversely, if the recovery point is set to the perception skill itself, the system will regenerate the pose data automatically, allowing you to skip the restoration step for a more robust 'clean start'.

This could for example be the case when a skill is moving the robot to the pose detected by a perception skill. If recovering from the move skill, one should save and restore the blackboard as otherwise the pose data will not be available. In contrast, if one decides to recover the run starting with the perception skill for more robustness, then this data will be generated and thus does not need to be restored.

Examples

note

These examples are intentionally kept small to ensure easy understanding. In practice these processes might not require a dedicated recovery procedure. However, the same ideas apply to large scale processes. Imagine every single skill in these examples being a complex sub-process.

Recover into a sequence

Consider a process SimpleInspection (with id ai.intrinsic.simple_inspection). It is a simple sequence of the following skill executions: Move To Inspection Pose, Inspect Object, Move Home.

During execution the process fails at the Inspect Object step. Upon investigation an operator checks the camera images and finds that the camera lid is closed. The issue is fixed and now the recovery begins.

  1. Load the SimpleInspection process Delete the current (failed) operation and call CreateOperation with ai.intrinsic.simple_inspection. Alternatively in this simple case the failed process can also just be reset by calling ResetOperation.
  2. Determine Recovery Point As there were no changes to the workcell besides fixing the camera issue one can start from the Inspect Object node that previously failed.
  3. Call StartOperation with the node ID of the Inspect Object node This starts the process continuing with Inspect Object. The previous step(s) will not have to be executed again.

Recover with an intermediate process retaining blackboard values

We will build a new process LoopedInspection (id: ai.intrinsic.looped_inspection) by wrapping the sequence from the previous process in a loop node Main Loop with 5 iterations. Its loop counter key is set to main_loop_counter.

The process runs for two iterations and fails in the third, again at Inspect Object. The camera reports an error and the recovery procedure starts. The operator wants to investigate the camera.

  1. Save the current blackboard by calling CreateBlackboardSnapshot This saves the current state of the blackboard to a handle that we store. Note that by setting the main_loop_counter key, the loop counter is a value on the blackboard.
  2. Delete the failed operation and run another process Running a different process than the failed one can help in recovery: In this simple case that might just move the robot back to the home position making the camera easily accessible for the operator.
  3. The operator determines that a cable has become loose and fixes the issue This means that the process can safely be restarted.
  4. Delete the recovery process and load LoopedInspection back Calling DeleteOperation and then CreateOperation with ai.intrinsic.looped_inspection.
  5. Restore the blackboard snapshot Calling LoadBlackboardSnapshot with the handle from step 1 and the current operation. Note that this contains the loop counter from the failed process.
  6. Determine Recovery Point The robot was moved back to its home position. Thus although the failure happened at Inspect Object we will restart in the main sequence at Move To Inspection Pose.
  7. Call StartOperation with the node ID of Move To Inspection Pose. This will start the process beginning at the given node, but inside the loop. As the main_loop_counter was restored after creating the operation the process now starts at the beginning of the sequence, but already in the third iteration. The loop will execute three times from the third to the fifth iteration and then finish.
  8. Delete the snapshot This snapshot is not needed any more as it was taken for the recovery and thus must be deleted.

Recover into a specific loop iteration

The same LoopedInspection process is run again. It runs for two iterations as before, but this time fails at the Move Home step with a hardware error from the robot.

The operator is called in and determines that the E-Stop was pressed, possibly by accident. The recovery procedure starts similarly as before.

  1. Save the current blackboard by calling CreateBlackboardSnapshot This saves the current state of the blackboard to a handle that we store. Note that by setting the main_loop_counter key, the loop counter is a value on the blackboard.
  2. Delete the failed operation and run a recovery process In this case the operator just releases the E-Stop and moves the robot around to some test poses to verify everything is operational again. Finally move the robot back to its home position.
  3. Delete the recovery process and load LoopedInspection back Calling DeleteOperation and then CreateOperation with ai.intrinsic.looped_inspection.
  4. Restore the blackboard snapshot Calling LoadBlackboardSnapshot with the handle from step 1 and the current operation. Note that this contains the loop counter from the failed process.
  5. Determine Recovery Point The robot failed at Move Home. This was after the Inspect Object step has been executed. Thus the objective of that loop iteration has already been achieved and we will restart in the main sequence at Move To Inspection Pose.
  6. Update the loop counter As the previous inspection step has already been completed, it is not necessary to run that iteration again. Thus retrieve the stored value of main_loop_counter from the blackboard (in this example 2) and update the value of main_loop_counter to increase it by one (i.e., set it to 3), see writing blackboard.
  7. Call StartOperation with the node ID of Move To Inspection Pose. This will start the process beginning at the given node inside the loop. The process now starts at the beginning of the sequence, but already in the fourth iteration. The loop will execute two times now.
  8. Delete the snapshot This snapshot is not needed any more as it was taken for the recovery and thus must be deleted.