Troubleshooting a failed node

From CSL Wiki

Jump to: navigation, search

When a node fails for some reason, try these remedies!

Hardware

A node that is not functioning might be suffering from a hardware problem. Check for the following conditions:

  • If you can log in, check the output of 'df' and make sure that the flash is not full. If so, try to delete files to free up space. Note that this may lead to a corrupted filesystem -- so don't fill up the internal flash!
  • If you can't log in or are having strange problems, open the box and make sure all the connectors are in place and secure. Make sure the 802.11 card and sampling cards are properly seated, check the power connector. Note that opening the box is a little tricky because the connectors don't quite fit in the box. Carefully lift the side of the aluminum plate closest to you when facing the box (there is an ethernet connector and a power connector that are very close to the edge in the front), then pull it towards you to free the audio connector in the rear.

Software

If the system is booting up but is not completely functional, you can diagnose the software.

  • Check for crashed processes in /dev/emrun/last_msg. Any crashes may be an indicator of something wrong, especially if they are recurring (check the crash counter). Check the logs of those crashed processes in /dev/emlog/<processname>/all
  • Check /dev/emrun/status for any processes not yet started or marked 'looping'. Sometimes parts of the system won't start if parts they depend on are crashing repeatedly or are not starting properly. Look in the logs for the process that are looping to see if there is anything suspicious.
  • If emrun isn't running at all, try starting it in the foreground and see what errors occur. See the startup script for the particular flavor you are running (check .delayed-boot.sh for which startup script to use, and trace through the scripts to the emrun command line)

SSH problem!

If the node takes a LONG time to ssh to (i.e. 30 seconds), it's because it had previously been assigned an ip using dhcp. The best solution for this is to nuke the /etc/resolve.conf and touch a new one, i.e.:

rm /etc/resolv.conf
touch /etc/resolv.conf

and ssh should be working again.

Personal tools