Sympathy
From CSL Wiki
Sympathy is a failure detection and debugging system for WSN.
Contents |
Functional Overview
Also refer to the Sensys slides
Sympathy is a system to aid in debugging sensor networks. Sympathy is designed for data-gathering sensor networks where a centralized sink is collecting data from nodes. Nodes transmit various metrics to the centralized sink at regular intervals, and the sink analyzes these metrics to identify potential failures.
While using the code here for other applications may be difficult because this is a first revision, we hope the more important lessons that people can take away are:
1) metrics we chose 2) simple analysis (using networking concepts of conservation of flow) of metrics to identify network failures 3) failure localization (based on reported node topology) to minimize the number of failures a user is notified of at any one time
There is Sympathy code on:
1) the nodes in the network, which collects these metrics and transmits them to the sink 2) the sink, which receives metrics and analyzes them for failures
Metrics
The sink collects metrics in two ways: from regular transmission of metrics by nodes, and by snooping the channel. Snooping the channel is optional, and is made possible by a routing-specific plug-in at the sink which understands the routing header and can translate packets into a known data structure for the sink. Nodes send metrics to the sink once every metric period (defined in Sympathy.h).
There are three types of metrics that nodes transmit to the sink: connectivity, generic, component flow. Routing metrics are sent in one packet, and the other two metrics types are sent in the remaining packets (usually just one, depending on the number of components that have been instrumented).
1) Connectivity metrics (transmitted over the air using the datastructure: Smetrics_t defined in Sympathy.h) consist of:
* a node's routing information: the final destination, the routing parent (i.e. the next hop), and the path quality to that parent (using the [http://cvs.cens.ucla.edu/lxr/source/tos-contrib/sympathy/tos/interfaces/GetNextHop.nc GetNextHop] interface to transmit to Sympathy) * a node's neighbor table: a list of neighbors, with their incoming and outgoing link qualities (using the RadioDebugNBI interface to transmit to Sympathy)
2) Flow/node metrics (transmitted over the air using the datastructure: Sympathy_comp_stats_t defined in Sympathy.h) consist of node metrics and flow metrics
Flow metrics are critical for Sympathy's failure analysis, and are essentially information that helps track the flow of packets from a source to the destination. Flow metrics are collected from every component on a node that is the source or destination of a distinct flow of packets in the network. Node metrics are the uptime (time the node has been awake), and the number of good and CRC error packets the node overall as received.
3) Generic metrics (transmitted over the air using the datastructure: generic_comp_stats_t defined in Sympathy.h). This data structure allows the user to transmit any user-specified metrics. However, in order for the metrics to be properly printed out at the sink, the user must also write a module that will identify these generic metrics and translate them to ascii.
NODES DO NOT RESET METRIC COUNTERS. Instead, the sink calculates the difference in the current value and the previously recorded value for each counter, and aggregates that. This way, it can handle counter roll-over, and recover from lost metrics packets.
Failure Analysis
Sympathy expects all live network nodes to generate traffic of some kind, whether routing updates, time synchronization beacons, or data periodically transmitted to the sink. We call this traffic monitored traffic to distinguish it from Sympathy's own metrics traffic (statistics packets generated by nodes and transmitted to the sink). Sympathy detects a failure and triggers localization when a node generates less monitored traffic than expected.
Based on the metrics that the sink collects, it then runs a series of simple tests on the metrics to determine what the failure could be (in other words, to find the root cause). To identify failures, Sympathy uses a simple expert system based on this decision tree, and identifies one of these root causes: node crashed, node rebooted, no neighbors, no route to sink, bad path to node, bad node transmit, bad path to sink. The root causes were chosen based on identifying each possible location where packets could have been lost.
For example: bad node transmit - less than the number of packets are received because the node is not transmitting them.
Sympathy triggers failure analysis upon any event (an event can be the reception of a packet, or the timeout of a timer). It uses information that has been collected during the last epoch which is defined as some multiple of the metric period. The epoch is a sliding window based on the current metric period.
Sympathy tracks the flow of packets for each component on a node separately in addition identifying a failure for the node overall. For example, if a node has crashed, then of course every component on that node will have a root cause, but they are all attributed to the node crashing. Whereas, the node may be alive, but one component may have stopped transmitting packets for some reason. In that case, the individual component on the node will be identified as having failed.
If sympathy is running on the sink, then the user can cat the /dev/sympathy/metrics device (see Emstar documentation for more information on devices) to view all failure analysis information and metrics collected for a node.
Failure Localization
If a failure is detected on a node, Sympathy attempts to localize the failure. Sympathy's algorithms assign each detected failure a localized source, an actionable description of the most likely cause of the failure. We aim to choose the simplest localized source that explains the failure. After experimenting with larger sets of more specific sources, we decided that a small set of gen- eral sources is better: users must take the same actions for general and specific sources, such as going out into the field and moving a node, yet more specific sources are more likely to be wrong. The more specific source identification, and any information used to cal- culate it, is still available as part of Sympathy's output, if desired. There are three localized sources for a node's failure to transmit enough monitored traffic:
· Self. The node's failure has been localized to the node itself.
The node may have crashed or rebooted, there may be another
local bug preventing data transmission, or there may be a con-
nectivity issue (the node does not have a route to the sink).
Remedial action will probably involve moving or interacting
with the node itself (e.g. changing its batteries).
· Path. The node's failure is due to a failure along the path from
the node to the sink, such as a different node's failure or ex-
cessive collisions along the path. Sympathy identifies a node
along the path and the problem potentially causing the packet
loss in order to focus the user's search. Remedial action will
probably involve moving or fixing a node or the network in
its vicinity.
· Sink. Often when the whole network appears to be failing,
the simplest explanation is a failure at the sink, such as bad
sink placement or changes in the environment since deploy-
ment. Clues such as no node being able to hear the sink but
hearing other nodes point Sympathy to issues localized at the
sink. Remedial action will probably involve changing the sink
placement or examining sink metrics for bugs or other con-
nectivity issues.
Sympathy then analyzes the resulting root causes and assigns each failure a localized source: either Self (the node itself is broken), Path (the path is broken), or Sink (the sink is broken). Failures localized to Self are called primary failures, since they cannot be traced to any other cause in the network; all other failures are sec- ondary. Sympathy logs all failures, but highlights primary failures as potential causes for all other failures, encouraging the user to prioritize fixing those failures.
The algorithm is relatively simple. If Sympathy recorded a No Neighbors or No Sink Route failure for the sink, the sink is broken; this trumps all other failure conditions. The sink failure is assigned a localized source of Self, and every other failure is assigned a lo- calized source of Sink. Otherwise, for communication root causes (Bad Path To Node, Bad Node Transmit, and Bad Path To Sink: steps 5-7 in the decision tests), Sympathy looks along the path to the sink for another failure that might take precedence. Specifically, given a failure at node N with such a root cause, Sym- pathy tracks the path from N to the sink via ROUTING TABLE met- rics. If a node O on this path--but closer to the sink--has any root cause recorded, including Congestion, then N's failure is assigned a localized source of Path with O as the primary cause. This search applies to the sink as well: If the sink has Congestion recorded, then N's failure is assigned a localized source of Path with the sink as the primary cause. (While we experimented with allowing up- stream failures to explain No Neighbors and other basic-health and connectivity root causes, this mischaracterized too many failures as secondary.) If neither of these exceptions applies, then the failure is primary, and its localized source is Self.
Thus, if a node N is experiencing a failure with root cause Bad Node Transmit, but a node upstream has Crashed, then N's failure will be reported as secondary; but if N actually Rebooted, its failure is primary, despite the upstream Node Crash.
Tracking Flow
Sympathy identifies failures on nodes by identifying components on the sink that have not received sufficient data from a component on a node.
For example, a querying application (or component) could run on the sink, and expect 1 packet every minute from the corresponding query component on each node. Every time the sink transmits or receives a packet from the query component on a node, it notifies the sympathy code on the sink. Every time the query component on the node transmits or receives a packet, it logs this in its "flow counters", and transmits this info (along with all the other metrics) to the sink once every metric period. The sympathy code on the sink then collects all this information, and if a component on the sink complains about receiving insufficient data from a node, the sympathy code looks at all the flow metrics from the sink and the node to determine where the packet was lost.
Code Location
Click for an index for the graphical representation or here to go directly to the graph of the application we deployed in Bangladesh
All code is in the emstar repository.
1) TinyOS code is located in: tos-contrib/sympathy/tos
Main API file is: Sympathy.h
2) Emstar code (which runs on the sympathy-sink) is located in: emstar/devel/sympathy_devel/
Main API file is include/sympathy_dev.h
Main header file is: include/sympathy.h
Instrumenting code on the nodes (i.e NOT the sink)
Most metrics are transmitted in the form of counters. ** Counters on components SHOULD NOT be cleared. Sympathy on the sink handles all counter roll-over (either due to reboot or roll-over).
Sympathy code on the node communicates with the networking stack using these interfaces: (see example in tos-contrib/devel/multihop directory). Our implementation code length is included in parentheses.
Connectivity Metrics
A. RadioDebugNBI - to provide neighbor list
- command uint8_t RadioDebugNBI.getNeighborList (uint8_t *buf)
Sample code from (LinkEstimatorM.nc)
Sneighbor_t* n;
for (i = 0; i < LNKEST_MAX_NEIGHBORS; i++){
l = &(gStats[i]);
if (l->addr != TOS_BCAST_ADDR) {
n = (Sneighbor_t *)&buf[offset];
n->address = l->addr;
n->ingress = l->ingress;
n->egress = l->egress;
n->avg_rssi = l->avg_rssi;
offset++;
}
}
return (offset);
B. GetNextHop - to get routing table info
- command uint8_t GetNextHop.getNexthop(Snext_hop_t *next_hop, uint8_t max_sinks)
Sample code from (PathBuilderM.nc)
for ( i=0; i < PTHEST_MAX_SINKS; i++) {
if (sink_index_valid(i) && (num_sinks < max_sinks)) {
next_hop[num_sinks].sink = gSink[i].sink;
next_hop[num_sinks].next_hop = gSink[i].best_next_hop;
next_hop[num_sinks].quality = gSink[i].quality;
num_sinks++;
}
}
return (num_sinks);
Node Metrics
A. SGetPacketMetrics (example) - any time a packet is received/routed by the node at the routing layer signal these interfaces - SgetPacketMetrics.receivedPacket(Saddr_t node, uint16_t strength); (1 line) - SgetPacketMetrics.routedPacket(Saddr_t node, uint16_t strength); (1 line) - SgetPacketMetrics.packetTxFailed(Saddr_t node); (4 line) example: in SendMsg.sendDone() { if (msg->ack == 0 && msg->addr != TOS_BCAST_ADDR && msg->addr != TOS_UART_ADDR) { signal SGetPacketMetrics.packetTxFailed(msg->addr); }
Flow Metrics
A. ProvideCompMetrics (example) - event result_t ProvideCompMetrics.exposeGenericStats(uint8_t *data, uint8_t *len) Sample code from ESS_mDTNM.nc mDTN_Sym_t* d = (mDTN_Sym_t*) data; d->my_stored_pkt_count = my_stored_pkt_count; d->others_stored_pkt_count = others_stored_pkt_count;
*len = sizeof(mDTN_Sym_t);
- event result_t ProvideCompMetrics.exposeSymStats(Sympathy_comp_stats_t *data, uint8_t *len)
Sample code from TimeSynchM.nc
memset(data, 0, sizeof(Sympathy_comp_stats_t));
data->num_pkts_rx = tsPacketsRcvd;
*len = sizeof(Sympathy_comp_stats_t);
Tranporting metrics from node to sink
A. SComm - The routing
layer must export this interface so Sympathy can send/receive packets from routing layer.
(Sympathy_SRouteM.nc) is
our implementation of glue code that provides the SComm interface and uses the ESS routing layer interface (NodeI).
=== Adding Sympathy to highest level wiring file === (example):
A. Add Sympathy to components list:
#ifdef USE_SYMPATHY
SympathyC,
Sympathy_SRouteM, // adapts NodeI interface for Symp
GenericCommSympathy as Comm,
QueuedSendSympathy as QueuedSend,
#else
GenericComm as Comm,
QueuedSend,
#endif
B. Wire interfaces for SympathyC ( ~20 lines)
#ifdef USE_SYMPATHY
SympathyC.SGetPacketMetrics -> BeaconTOSM_1.SGetPacketMetrics;
SympathyC.SGetPacketMetrics -> BeaconTOSM_2.SGetPacketMetrics;
SympathyC.SGetPacketMetrics -> BeaconTOSM_3.SGetPacketMetrics;
SympathyC.SGetPacketMetrics -> TreeDispatchM.SGetPacketMetrics;
SympathyC.SComm -> Sympathy_SRouteM.SComm[MULTIHOP_SYMPATHY];
Sympathy_SRouteM.NodeI -> TreeDispatchM.NodeI;
SympathyC.SGetPacketMetrics -> Comm;
SympathyC.RadioDebugNBI -> LinkEstimatorM.RadioDebugNBI;
SympathyC.GetNextHop -> Path;
#endif
Sink Code
Devices Sympathy exports
Device files are generally located in:
/* Defined in tos-contrib/sympathy/tos/lib/Sympathy.h */
- define SSTATUS_BASE /dev/sympathy/.
If the network is run in simulation, then they will be located in: /dev/sim/group<id>/node<SINK-ID>/sympathy.
1) Metrics device (SSTATUS_BASE/metrics) Sympathy provides for each node:
- root cause for any failure on that node
- neighbor list
- route to sink
- statistics on the node (#pkts node has routed, #sympathy
packets it has transmitted, time it has been awake)
- statistics on every instrumented component:
* Sympathy specified statistics: #pkts node has tx, rx,
had to re-tx
* Component specific statistics (translated by a comonent
on the sink - see below, and not understood by Sympathy).
2) summary device (SSTATUS_BASE/summary) Will provide a summary of the status of all failures.
3) command device (SSTATUS_BASE/cmd) Provides a command interface - at a shell a user can:
- ping any node in the network:
> echo "ping=<node-id> > SSTATUS_BASE/cmd
- change the metrics period that nodes are flooding metrics:
> echo "auto_period=<period-in-seconds> > SSTATUS_BASE/cmd
4) battery device (SSTATUS_BASE/battery) Accepts updates on battery voltage status of nodes, and prints the current snapshot of all nodes.
5) component communication device (/dev/link/sympathy) Handles communication between components on the sink and sympathy - described more below.
Communication Between Sympathy and other code on the sink
The /dev/link/sympathy is a link device which handles several types of communication between the sympathy-sink and other components on the sink. The structure of the link_pkt_t.data field depends on the link_pkt_t.type field (e.g. sympathy_events.c:update_node()), interface located in: sympathy_dev.h
1) Components on the sink send updates to sympathy (e.g. if they have received a packet, or expect to receive more packets, and want sympathy to know about this), with link_pkt_t.type == SSINK_UPDATE, and link_pkt_t.data points to a sympathy_status_info_t (defined in sympathy_dev.h.
2) When the sink wants a component to translate a generic_stats packet it has received from a node - from binary to ascii link_pkt_t.type == SCOMP_STATS, and link_pkt_t.data points to the component-specific binary provided by the node. (See sympathy_print_status.c:status_receive())
3) When components translate a packet from binary to ascii - link_pkt_t.type == SCOMP_ASCII_STATS, and link_pkt_t.data points to a buf_t. (See sympathy_print_status.c:send_stats_pkt())
Routing-Specific Code
While this is optional, we have implemented a snooping plug-in so that the sink can snoop packets on the channel and translate these for Sympathy. The plug-in for ESS multihop is in sympathy_multihop.c
This code snoops the channel, and when a node broadcasts a neighbor-list or routing packet as part of the routing protocol, this code translates this into a known format (defined in sympathy_dev.h) and sends this over /dev/sympathy/link to the sympathy sink code.
Important Data Structures
The main data-structure that holds most of the information for each node is sympathy_node_info_t defined in sympathy.h.
Information for each component on a node is stored in sympathy_node_app_info_t defined in the same file above. This data structure holds the flow metrics collected from each component on a node, and for each corresponding component on the sink.
Handling Counters
Nodes do not reset packet counters. Instead, the sink aggregates the difference from the current reported value and the previously recorded value. In this way, the sink is robust to loss of metrics packets and counter roll-over. In addition, the sink distinguishes counter rollover from a mote rebooting by identifying when all counters seem to "rollover" as a mote reboot.
Translating "Generic" stats (example: sympathy_print_stats.c)
Generic stats are component-specific statistics transmitted to the sink (so Sympathy has no knowledge of their format). Once they arrive at the sink, a component on the sink must translate these statistics. So, Sympathy will send these stats on SYMPATHY_STATS_DEVICE, with a link_pkt_t.type = SCOMP_STATS.
Once the component gets it (example in sympathy_print_stats.c - it prints stats for the generic stats coming from mDTN), it checks the type of the link_pkt_t struct for the specific component that this struct is for (i.e. SCOMP_STATS1-4), and if it is destined for that component, the component looks at the stats, translates them into ascii, and sends a packet out with dst.id=LINK_BROADCAST, sypmathy_update_t.type = SCOMP_ASCII_STATS, and src=src of node that indicated in the packet sent by sympathy_sink.
Code Assumptions
If the TOSH_DATA_LENGTH field is not big enough, then the neighbor list does not get sent by the nodes (i.e. for 10 neighbors, usually you need 4B * avg_num_neighbors, + 8 (for route+pkt-hdr) for the data field in the routing packet. We decided not to just send a partial neighbor list because this is more misleading. It seemed better to just receive no neighbors.
Running Sympathy
To run an emstar simulation with sympathy try this run file. Ensure that you uncomment the binaries required in this run file:
* sympathy_devel in emstar/devel/BUILD * Compile mDtnDseTs with USE_SYMPATHY enabled in the Makefile
See the application in: emstar/tos-contrib/dse/apps/mDtnDseTs link to lxr for an example of using Sympathy with an application.
At a shell (after the simulation has run for 1 or 2 metric periods as defined in Sympathy.h):
- cat /dev/sympathy/metrics: to get all information on each node in the network
(includes failure diagnosis)
- ping any node in the network (not well tested):
> echo "ping=<node-id> > SSTATUS_BASE/cmd
- change the metrics period that nodes are flooding metrics:
> echo "auto_period=<period-in-seconds> > SSTATUS_BASE/cmd
Using Emview
To use emview with Sympathy, launch emview
> obj.i686-linux/emview/emview -G <group-id>
and you should see a visualization of many metrics that Sympathy collects. Specifically you will see node's neighbor lists, routing tables, and their failure status (as diagnoed by Sympathy).
Checking if Sympathy is working
0) Sympathy contains self-checking mechanisms, so the sink will complain if it is not getting sympathy metrics from certain nodes.
1) Run your network in simulation, and run emview to get a visualization. (to run emview: run obj.i686-linux/emview/emview -G <group-id> Check the sympathy metrics provided in the metrics device and make sure they match what you see in emview.
2) Run with motes, and launch emview to get a visualization. Check:
* all your motes appear in sympathy metrics file, and in emview
* turn a mote off, and make sure this failure is reported in the no_data
device file.
Using code in our repo
We use various pre-processor symbols to in/exclude code based on personal preferences. 1) SYMPATHY_APP - includes some experimental code to collect specific events from each node and transmit these events back to the base-station. We abandoned the event collection at the nodes as not providing useful information.
SYMPATHY_FAULT - includes fault-injection code. we used this when evaluating Sympathy. This code will inject various faults (dropping packets, simulating a crashed node) upon user-input from a command line.
PLATFORM_PC - this flag is specified by our build system when we are compiling code for simulation to run on a PC. This flag includes debug devices, and various other debugging mechanisms that would not be required when the code is running on a mote, but is useful when it is simulated on a PC in EmStar.
Fault Injection To Test Sympathy (As of 7/2006)
0) Check out a new top of tree
1) In devel/multihop/BUILD:
Uncomment the mDtnDseTs10 and mDTNTS10 lines.
2) In tos-contrib/dse/apps/mDtnDseTs/Makefile
a. Comment out the line: #CFLAGS += -DLNKEST_BEACON_PER=100000 b. Uncomment the line: CFLAGS += -DSYMPATHY_FAULT
3) In tos-contrib/sympathy/tos/lib/Sympathy.h:
Set metrics period to 180 (was 1200) (first line with METRIC_PERIOD counts)
#define METRICS_PERIOD_MSEC 180000 /* Every 3 minutes */
4) Correctly set $EMSTAR_HOME in sars.pl
5) Run sars.pl to set up an experiment (see usage for explanation of script).
This script will set up the test, check for correct output, inject traffic, create congestion situations, and put all output into a unique test directory.
Here's a sars.log output from sars.pl. The final line with the 3 in the column indicates correct failure detection, root cause, and localization:
$ less sars.log #h Run-Iter Timepassed TestIter Node-id Component Type Failure Correct # (/home/nithya/.sympathy_3/group1.600traffic30.1020die4.epoch3.iter1a.sim) Fri Jul 14 0 1:14:05 PDT 2006 1 600 1 30 Inject Command traffic,30 0 1 900 1 Inject Command monitor 0 1 1020 1 4 Inject Command die,4 0 1 1244 6 4 ** Root Failure SRC_BAD_PATH_TO_SINK 1 1 1244 6 8 ** Failure SRC_BAD_PATH_TO_SINK 0 1 1244 6 9 ** Failure SRC_BAD_PATH_TO_SINK 0 1 1624 9 4 ** Root Failure SRC_NODE_FAILED 3
Dependence on Emstar
The Sympathy code running on the node has no dependence on EmStar and runs completely in tinyOS.
The Sympathy code running on the basestation is currently implemented in EmStar, but could easily be implemented outside of EmStar. The primary dependence is on the use of device files for: 1) communication between processes
To get around the IPC dependence, Sympathy could be implemented as a monolithic process (in general not a great idea, but because there are no serious race conditions that impact the functionality in Sympathy, not a big deal). Or, if the user desires to maintain the process structure as implemented in EmStar, then any other inter-process communication could be used.
Sympathy has to be able to acquire packets from the channel (preferrably from the routing layer). And, is possible, should be able to communicate with other applications that are running on the base-station in order to collect statistics from these processes (e.g. number of packets an application has sent/received to each node), and needs a way to continually receive updated information on these metrics.
2) exporting/importing information to/from the user Sympathy needs a way to give the user all of the information it has collected on the nodes. Furthermore, ideally it could take input from the user and send these commands to the network (e.g. being able to dynamically change the metrics period on the nodes, or ping a specific node)
Publications
N. Ramanathan, K. Chang, R. Kapur, L. Girod, E. Kohler, D. Estrin, "Sympathy for the Sensor Network Debugger", To appear in Proc. of Sensys, 2005. [pdf] [Final.ppt ppt]
N. Ramanathan, K. Chang, R. Kapur, L. Girod, E. Kohler, D. Estrin, "A Debugging System for Sensor Networks", CENS Technical Report #0047, 2005. [pdf]
N. Ramanathan, E. Kohler, D. Estrin, "Towards a Debugging System for Sensor Networks", International Journal for Network Management, 2005. [pdf]
N. Ramanathan, E. Kohler, L. Girod, D. Estrin. "Sympathy: A Debugging System for Sensor Networks". (Short Paper) 1st IEEE Workshop on Embedded Networked Sensors, (EmNets-I), Florida, November 2004.
