Skip to content

Safety Manager Design

Joshua Williams edited this page Nov 11, 2021 · 3 revisions

Safety Manager

The safety manager is, admittedly, going to be a lot of tedious work that hopefully never pays off. However, in the event that it is needed, this could potentially save lives.

Responsibilities

The safety manager is responsible for technical failures of the system and for cases where a human needs to take over operation, not for ordinary operation! This means that a lot of things that could be classified as "safety" are not in fact the job of the safety manager. For example, although safety related, the following are NOT responsibilities of this node:

  • Keeping our speed under the speed limit
  • Stopping at stop signs
  • Steering around an obstacle (exception: if the software fails to find a path and a human driver needs to take over, it is the safety manager that steps in)

In general, anything that occurs on a regular basis or is already built into the rest of the software does not need to involve the safety manager. The responsibilities of this node, then, are as follows:

  • Monitor system vitals
  • Monitor heartbeats
  • Monitor safety events
  • Assign recovery strategies
  • Trigger alarm
  • Manage Echo node

Monitor System Vitals

The safety manager will monitor certain system-wide elements for anomalous operation. Some examples that we need to monitor are:

  • CPU, GPU, RAM, and disk usage - These should fall within some acceptable range
  • CAN bus - This acts as a "heartbeat" of sorts for the entire system
  • Temperature - An overheated processor counts as a safety issue for sure!
  • Power - This is not the battery level (which should be monitored by whichever node decodes it), but rather voltage to the onboard computer. If this drops, the safety manager should handle it in whatever way is appropriate
  • Issues with ROS itself - this will take some research to properly implement

Monitor Heartbeats

If it is possible for a node to fail in such a way that the failure will not be noticed by other nodes and raised as a safety issue, it may need a heartbeat. This may not be necessary for any nodes, or it may be necessary for several, depending on how we design the rest of the system. For any such nodes, the safety monitor should listen for the heartbeat.

Monitor Safety Events

This node is responsible for hosting a "safety event" service through which all nodes can raise alarms when things go wrong.

These messages are defined in navigator/src/msg/voltron_msgs/srv/SafetyEvent.srv and will have the following properties:

  • Event id - used to easily look up the event type
  • Description - a human-readable message to accompany the event
  • Sequence number - used to keep order-sensitive messages in order
  • Status - one of the following
    • Resolved - the event has been resolved
    • Working - the node raising the event has resources at its disposal to resolve the event
      • Example - the CAN bus has failed, but we are going to use the next 100ms to try restarting it at the OS level
    • Unresolved - the node can do no more on its own to resolve the event
  • Additional Data - a JSON string that contains additional helpful data about the event. This is for event-specific data. If a single field is common across multiple event types, consider adding it directly to the message itself.

Assign Recovery Strategies

Each node will need to implement a set of "recovery strategies" that will allow operation under a particular anomalous circumstance. Each node host a "control service", which allows the safety manager to dispatch commands (and also to release the node when operation returns to normal).

These messages will be defined in navigator/src/msg/voltron_msgs/srv/SafetyCommand.srv and will have the following properties:

  • Strategy - The id of the recovery strategy being assigned. 0 is reserved for normal operation, and 255 for graceful termination
  • Sequence number - used to keep order-sensitive messages in order
  • Additional data - a JSON string that contains additional helpful data about the event. This is for event-specific data. If a single field is common across multiple event types, consider adding it directly to the message itself.

Trigger Alarm

The safety manager should control the "human intervention" alarm that will alert a human driver when they need to take the wheel. I'm not sure how to implement this yet, but it needs to be a sort of deadman's switch in the hardware (if, say, the Jetson is unplugged, the alarm needs to trigger). The safety manager needs to send an "all clear" signal to keep the alarm off, so that it will trigger in the event that the safety manager itself is disrupted.

Manage Echo Node

The echo node is a second node that monitors a heartbeat from the safety manager, and also sends a heartbeat to the safety manager. This provides a way to ensure that ROS messages are being passed correctly (and to measure any delays in message delivery), and also provides a backup should the safety manager fail (the echo node can restart it).

Implementation Notes

The following are a few notes that may be useful for the implementation of this node:

Splitting

This node has many responsibilities, and is something of a monolith. However, splitting it up into multiple nodes adds more points of failure (in their communication). So, this part of the code is probably best left as a single node.

Testing

Testing this may be tedious, but it is obviously important. We need to trigger each type of event artificially and ensure that the car responds appropriately.

Events Strategies

The majority of the work (and maintenance) on this node will be building up a huge table of all the possible events and what to do in each case. This should NOT be a nested if statement or a switch statement - such constructs will quickly grow to be unusable. Rather, we need to develop a way to store this in some easy to read format.

One possibility would be to create a "strategy" interface that we subclass once for each event, allowing us to separate code for each event into its own file. We can then store these objects in a lookup table by event id, allowing them to be quickly loaded when needed.

Clone this wiki locally