Categories:
1)

German Goldszmidt and Yechiam Yemini. Distributed Management by Delegating Mobile Agents . In The 15th International Conference on Distributed Computing Systems, Vancouver, British Columbia, June 1995. Many papers on the concept of management by delegation. Goal is to move primitive responsibilities of network management from centralized server to distributed nodes. This is done using mobile code: i.e. downloading scripts to nodes that can perform management tasks or even dynamic tasking. These scripts are executed under local or remote control.. These scripts or "delegation agents" empower individual nodes to take action based on observed behavior instead of consuming network bandwidth to convey metrics and commands. The node is then able to monitor its own behavior, detect any problems, diagnose these issues and even repair them. ps

G. Tolle, D. Culler "Design of an Application-Cooperative Management System for Wireless Sensor Networks". Goal is to offer way to record events to permanent local storage for post-mortem analysis and also allow real-time access to human managers. Goal is to provide human user with quick knowledge of whether network is functioning or not. Does not inject extra network traffic - other than in response to user request for extra information. In this way, its similar to zhao's work on providing infrastructure Introduce idea of having SNMS dynamically loaded upon reboot/watchdog timer. Separate "policy from mechanisms" where policy is decided by user, and mechanism provided by SNMS - clearly complimentary to Sympathy. Sympathy intead focuses on what information should be collected and what is most indicative of system health.
link

J. Dunagan, N. Harvey, M. Jones, et al "FUSE: Lightweight Guaranteed Distributed Failure Notification", OSDI, 2004 Lightweight failure notification service for distributed systems. Systems are guaranteed that failure notifications reach members in a bounded period of time. Responsibility for deciding failure has occurred shared between FUSE service and distributed app. Different from a failure detection service - this is just responsible for notifying of failures. groups are formed, and if one member of a group detects a failure then it sends this notification to other members of the group. if a member of a group thinks a link has gone down, then it will stop responding to pings - in this case, the sender of the pings basically gets an implicit failure notification and then proceeds to notify its group members of the failure - it doesn't know why/what the failure is - only that there was a failure on the link. Based on the idea of membership - however membership does not allow application components to have failed wrt to one action, but not wrt to another. e.g. streaming content - even though service may not be able to meet certain threshold - another client may be fine w/ that level of service. METRICS used to evaluate system: implementation is lightweight, latency of group creation, latency of failure notification, robust to false positives They divide failure detection into 3 groups:
- unreliable failure detectors:e.g. can use pings to determine which components are "up" or "down". FUSE based on this, but also provides mechanism for distributed agreement. - weakly consistent membership services - strongly consistent membership services Handles transitive failures Questions:
1) How does it detect failures? Possibly different from sympathy - they do not talk about how to identify failure - just what to do once a failure has been identified on one node? however sympathy talks about how to identify failure on either a single node, or a distributed failure
2) seems prone to lots of failure cascading, due to false positive detection which will then propagate along the group?
link

Chandy, Lamport "Distributed Snapshots: Determining Global States of Distributed Systems" Problem of trying to capture global state, but can only take portions at a time and must somehow piece them together. Furthermore, should not disturb the process being observed, and the picture that is created must be meaningful. Problem is to define "meaningful" and how to create this picture. link

Matthias Grossglauser, Jennifer Rexford, "Passive Traffic Measurement for IP Operations"
Motivated by need to monitor/predict traffic flow across ASs. Difficult cuz packet-switching does not maintain state at routers, and other ASs are black boxes. Specifies that problems are easier to debug (as does DeBox) only if specific information is provided by the hardware. Discusses models for traffic. Used to predict how changes in configuration would impact traffic flow. Has 3 types of models, each useful for different predictions. Distinguish between packet and flow monitoring (packet has more fine-grained timing information available). And 3 types of network/traffic models (tomography, etc)

Active probes in nw => can id areas of high delay/ loss, routing anomalies (forwarding loops). also helps id which traffic causing congestion, if addl packets from dos attack, routing change, etc. Finally, can assst in evaluating possible changes to configuration. BUT, must have accurate info about traffic in NW to do any of this! ,ps

Gil's Debugging Work

Active Measurement
Performs RTT and loss measurements. Philosophy is that dense/simpler mesurement is better than complex/sparse measurement.

DeBox pdf

Robert Wahbe, Steven Lucco, Thomas E. Anderson, Susan L. Graham:
Efficient Software-Based Fault Isolation. SOSP 1993: 203-216

Provide fault isolation in software instead of hardware (having separate address spaces) - by having separate fault domains within one address space. Untrusted software modules are enforced to stay within their own fault domain and not modify code outside of it. Tradeoff between context switching (hw fault isolation) and slightly higher ET for untrusted modules. ,pdf

*** C. McDowell, D. Helmbold. "Debugging Concurrent Programs".
Summary of debugging techniques, Good motivation for concurent dist debugging Discusses probe effect, and other issues (nonrepeatability, increased complexity, lack of synchronized global clock). e.g. often with concurrent programming diff executions iwth same data result in different output. and even if you *can* observe the occurrence, with no synchronized global clock its hard to interpret the results. Discuss 3 techniques: gdb, event monitoring, static analysis. Discuss how to organizie/minimize large amounts of debug data

Distinguishes between Monitoring "process of gathering information about a program's execution" & Debugging "process of locating, analyzing, and correcting suspected faults" - KEY is suspected faults!

Divide debuggers up into breakpoint (e.g. gdb) or event based Systems that interpret events - pushing onto database or representing as prolog in order to query, or filtering based on user criterea.
** DISDEB [Lazzerini 1986] analyzes events and based on fulfillment of predicate can trigger extra debugging.
** Event Description Language (EDL)[Bates & Wileden 1983] - allows to define abstract events based on lower level events
Has various other descriptions of languages used to define event orderings (temporal logic), and actuation upon an event. pdf

"Understandng Fault-Tolerant Distributed Systems"
Good summary of fault tolerance in dist systems, and approaches
pdf


Differentiation from Parallel Programming
--------------------------------------
** Fault tolerance - increases reliability of a computer system. We are interested in debuggability, not fault tolerance.
I. WSN dont need "fully user-transparent fault tolerance" (term from [3]) instead, need fault identification & information to debug fault.
A. Dist, MP systems have long running applications that cannot tolerate reboots
- fault tolerance => avoid down time/reboots; recover from fault and cont
- e.g. checkpoints, rollback and continue [3]
- faults often caused by bad interaction bet elements that is temporary
- Failures often synchronization/lock/data access. NOT case for WSN.
- Detects node failures using are you alive msgs, recovers using checkpoints [3]
*** B. parallel programming is based on the paradigm of running applications, reading/modifying data and maintaining state - which would require checkpointing and recovery of previous state. Instead, WSN are just recording, communicating and computing
- fault tolerance often < important than fault identification
- dont necessarily want to mask a fault.
- want to know what fault is to fix issue (e.g. dead node, routing loop) often permanent fixtures, not temporary bad interactions bet elements
- checkpoints dont work - minimal storage, WSN dont need transparent recovery and doesnt make sense in this context
- want to predict a failure before it happens? not just recover from it
- Need fault isolation? consistency? atomicity? durability? [5] to achieve fault tolerance, systems incorporate redundant processing elements.
. DB systems require Atomicity,Consistency,Isolation,Durability, parallel prog systems only require atomicity and consistency[3]

"Lessons from a Sensor Network Expedition" R. Szewczyk, J. Polaster, A. Mainwaring, D. Culler, EWSN 04[GDI] Introduces the idea of using sensor readings (e.g. wildly varying values from a sensor) or traditional network/qos metrics (e.g. packet loss) to gain insight into the system pdf

Traditional Network Management Kurose & Ross ISO defined network mgmt model, and defined 5 areas - Performance management (quantify, measure, analyzes, reports, controls performance). highly utilize SNMP - Fault management - log, detect, respond to fault conditions in network. close to performance management. "immediate handling of transient network failures (e.g. link, host, router hw or sw outages)" as opposed to perf mgmt which looks longer term. use SNMP - Configuration management - accounting management (specify/control user access) - security management [K&R] state that when defining a nw mgmt protocol framework, must answer these 3 questions: - what (froma semantic viewpoint) is being monitored? and what form of control can be exercised by the network administrator? - what is the specific form of the information that will be reported and/or exchanged - what is the communication protocol for excahnging this information? However, the irrelevance of these questions to our current arch is one indicator of the diff bet nw mgmt for wsn and the internet. With the Internet-Standard Management Framework including: - definition of network management objects (MIB objects) - Data definiteion language (SMI) rules for writing/managing info - protocol (SNMP) for conveying info - Security/administration capabilities However, these are not priorities for us. For WSN, we need to begin by defining what the nw mgmt objects we will be observing/recording before we can think about the other 3 components.

Examples of failures in Sensor Networks

"Wireless Sensor Networks for Habitat Monitoring (2002) " Alan Mainwaring, Joseph Polastre, Robert Szewczyk, David Culler, John Anderson ACM International Workshop on Wireless Sensor Networks and Applications (WSNA'02)
Discusses a deployment where nodes enter failure modes that result in random data being broadcast and impact link-connectivity.

"Lessons from a Sensor Network Expedition" R. Szewczyk, J. Polaster, A. Mainwaring, D. Culler, EWSN 04[GDI]

"Root cause localization in large scale systems", Emre Kiciman, Lakshminarayanan Subramanian
Discusses root-cause localization - "the process of identifying the source of problems in a system using purely external observations" pdf

Terms to think about:
- Distributed Agreement [FUSE]
- Fault detection vs recovery vs tolerance vs notification
  * notification[FUSE]
  * transparent fault tolerance - parallel processing
-Handling transitive failures (A => B, B=>C, but A!=>C) [FUSE]