Peru
From CSL Wiki
Contents
|
[edit]
Quickstart - Measuring Link Quality
- Start a ping
- Look at web page! The web page had more useful information in addition to the ping!
[edit]
Pinging
- Broadcast ping to find neighbors only!
- The components are already running a once a second ping
- Additional unicaast pings you send will be added into the measurements
- You can use unicast pings with a large packet to determine link quality
ping -s 1400 10.0.0.112
[edit]
Linkinfo Web Page
- Connect to the web page using your browser. For example for node 103, go to:
http://10.0.0.103:8080/
- The linkinfo table is up at the top, but click the link below the table to just see the linkinfo
- It takes 120-180 seconds for the linkinfo to get a stable reading
- If you send large ping packets in addition it will go a little faster
- The columns are as follows:
- Neighbor: the neighboring node
- Status/Gen: status is the assumed status of the node / the gen is whether we are generating prove traffic to this node
- This value is meant as a sort of quick idea of what is going on. It is more important to node that it is changing and how stable it is
- You want it to be ACTIVE ... if it jumps between STARTUP and ACTIVE or ASYMM and ACTIVE, then the link is really poor
- The 'Gen' indicates we are currently sending probes to this node (Y mean yes, N means no)
- Last Heard: how long ago we heard any information from this node about us
- If this value is high and repeatedly goes high, then that means the link is flaky since the two nodes can not continuously successfully communicate
- RSSI (dBm): the signal strength of each received packet from this node. this value is a running average and running standard deviation
- Silence (dBm): the ambient noise level before a packet was received. this value is a running average and running standard deviation
- The silence value is important to watch during deployment. The higher it is, the worse the surrounding 'noise' is and the more difficult it will be to receive packets.
- SNR: the rssi - silence
- We have only recently started displaying this information so we do not have a good 'feel' for what the RSSI, Silence, and SNR values should be for a good site.
- When setting up a site, pay attention to the rssi, silence value, and snr and how it changes. It is good to keep track of this mentally so you can gain an understanding of what good values are.
- Rate In/Out: The data rate to and from this node. 10 -> 1Mbps, 20 -> 2 Mbps, 55 -> 5.5 Mbps, 110 -> 11 Mbps
- The data rate can tell you a lot about whether a link is flaky or not.
- If the link is at 1 or 2 Mbps, that means there is enough packet loss to make the rate lower, however this may not mean that the link is not usable, but rather that you should proceed with caution, or do your best to try and adjust the antennas.
- Conn In/Out: The packet delivery ratios as a percentage for sent and received packets
- Combining the packet delivery ratios and the data rate can also tell you a lot about a link
- If you see a connectivity of less than 70\% for both links, this means the links is going to perform very poorly.
- If the connectivity is high, but the data rate is at 1 or 2. That means there could potentially be significant loss on the link especially with more traffic
- It is difficult to tell in this situation so the best thing to do is try to send some more pings by hand, or readjust the antennae
[edit]
The CDCC and you
[edit]
Deployment checklist
- Internal CDCC cable connections
- Verify all of these are properly plugged in
- Both ends of serial cable
- Both ends of Ethernet cable
- Power connector
- Pigtail connector to wifi card
- Wifi card connection to stargate and screws are holding it in place
- CF card is present
- Daughter card on stargate is not loose
- Verify all of these are properly plugged in
[edit]
Rebooting / Powering off the CDCC
[edit]
Open box
- Press and hold the black sideways facing button on the stargate
- The three lights will all turn solid and you can let go of the button
- Once the three lights turn off, wait 5 seconds and you can safetly unplug the power
- The lights will come on shortly and eventually start flashing: this indicates the stargate is booting up again
- If you have not unplugged the power yet, you can still safely do so
- You may have to press the white reset button to get things to boot properly if the lights do not come back on
[edit]
Through the web page
- The main CDCC web page has a shutdown/reboot button
- Click the button to properly shutdown/reboot the node
- Wait until a the page refreshes. It should display a note which tells you to unplug the power in 5 seconds
[edit]
Through a console
- Through the serial console, ssh, or rbsh do the following
- Type
killall duiker
- Wait a minute util duiker can properly clean up. You can do a the following a few times to see if it is still running
ps awx | grep duiker
- Once duiker has stopped type
- killall emrun
- shutdown -r now
[edit]
Understanding the CDCC web page
[edit]
Main Page
- The main page shows you all the status information about the software on the CDCC
- There are links to access the duiker page and the linkinfo page
- There is a button to properly reboot the CDCC
[edit]
Linkinfo display
- please see the information in the linkinfo page section below
[edit]
Diskinfo display
Data directory /opt/data/ contains 0 files, which are lined for deletion Filemover directory /opt/filemover/ contains 0 files, which are lined for deletion Xfer directory /opt/xfer/ contains 0 files, which are lined for deletion Total diskspace = 0.97GB Free diskspace threshold = 25.00% which is 0.24GB Free diskspace delete threshold = 5.00% which is 0.05GB Free diskspace = 44.43% which is 0.43GB
- The most important lines are the 'Total diskspace' and the last 'Free Diskspace' line
- These tell you how much space the CF card has and how much of it is left
- Other useful information is the number of files in each of the directories (see the section about the CF card structure below) and the stop thresholds
- The 'Free diskspace threshold' is the point at which this node will stop accepting data from neighboring nodes
- This tries to ensure that the locally generated data has priority over the data from other nodes
- The 'Diskspace delete threshold' is the point at which data on the node will be deleted to maintain the threshold
- This makes sure that the CF card will never become full and there is always room for the newest locally generated data
- The data is sorted by data created and the oldest is deleted. No preference is given to locally or remotely generated data: the oldest goes first
[edit]
Timeinfo display
wlan0 - 10.0.0.7:6945 Mode: DISK - 1197353978.186051 - Tue Dec 11 06:19:38 2007 Next disk write: 36.67 Next time recheck: 581.67
- This displays information about the timekeeper system. For the most part you can ignore this
- The timekeeper system attempts to make sure the stargate system time is always current
- It does this by trying to get the time form a q330, the time from neighbors, or the most recent time saved to disk (done once a minute)
- The displayed information shows this (DUIKER for q330 time, UDP for network time, DISK for disk based time)
- This is important because stargates forget what time it is when they reboot
- Having current time on the stargate, even if it is only accurate to within a few minutes is much more useful than having no time.
- NOTE: This module may slow down the startup of the software system
- This is because the timekeeper attempts to find the most current time before letting the rest of the software startup
[edit]
SinkTree display
Node 7: The sink is 0, there is no next hop node sink h tett sett fdr rdr lett rate time stat next
- This shows the current possible and current sellected paths to the sink nodes
- The most important columns are node, sink, h (hops), tett, time, stat, next
- node: The next hop node
- sink: The sink this node is sending to
- h: The number of hops away this node is from the sink
- tett: Total path ETT. The ETT is the metric for each link. Added them all up and you get total ETT. The lower the ETT, the better!
- time: The last time we heard anything from this node
- stat: Status of this node. Active means it can be considered as the next hop
- next: The path data through this node will take
- The best next hop is choosen by the lowest ett.
[edit]
Recent Transfers display
-- Outputdir: ClientTimeout 45000, ServerTimeout 40000 ---- Sending ---- ---- Receiving ----
- This will show live incoming and outgoing transfers as well as recently completed/failed transfers in the last few minutes
- You may occasionally see a file go to 100\% and then fail. This means that the transfer finished, but the final 'goodbye packets' got lost. The code should properly be able to resume the transfer and finish. If it does not, let Martin know.
[edit]
Recent sysmanlog display
- This is mostly for debugging
- This shows the last 1.5 hours of logs which are being bundled and eventually sent to the main raid
[edit]
Duiker Page
- The duiker page shows the current status output of duiker
- You can set the serial number and the site location
- You can also initiate a unlock/lock/center commands from here
- If you do not see a bunch of status information below unlock command box, that means DUIKER IS NOT RUNNING!
[edit]
Linkinfo Page
- Use linkinfo to determine if the chosen location is good enough for a deployment
- You can ping by hand, but always come back and look at the linkinfo. It shows information that is much richer than a simple ping
[edit]
How it works
- The linkinfo module is aware of all the neighbors because it is constantly collecting information from any data traffic in the network
- Once it is aware of a neighbor, it attempts to send probe packets to the neighbor
- The probe packets are sent once a second
- Every ten seconds the number of successfully sent probes over the number of sent probes are worked into an ewma which tracks the packet delivery ratio
- If the node happens to be generating other traffic on that particular link at a rate greater than 10 packets every 10 second, the node will stop generating traffic
- Using the probe packets and any other network traffic, the node is also able to determine what the most used data rate is
- 802.11b can send packets at various data rates: 1Mbps, 2Mbps, 5.5Mbps, 11 Mbps
- The higher the data rate, the faster the data can be sent between the two nodes
- The drawback to a higher data rate is that with a poor link, it has a lower probability to get through
- The 802.11b card will try to pick a rate based on whether it is successfully sending packets at the various rates
- The packet delivery ratios along and the data rate are used to compute the ETT (estimated transmission time) for the given link.
- The ETT is used to determine the best links to use to get to the sink.
[edit]
The output
- The columns are as follows:
- Neighbor: the neighboring node
- Status/Gen: status is the assumed status of the node / the gen is whether we are generating prove traffic to this node
- The status can show a couple of different states.
- It attempts to determine the state from the delivery ratio as well as the time we last heard from the node and the time since the node has reported something recent about us
- This value is meant as a sort of quick idea of what is going on. It is more important to node that it is changing and how stable it is. The states are:
- UNKNOWN: we know there might be a node there but we have not heard anything telling us it is a CDCC
- STARTUP: there is a node there but we are still trying to collect information about it
- ACTIVE: All signs point to 'Yes, this node is active and we are determining the link quality
- ASYMM: We know it is a CDCC and we can hear it, but it does not look like it knows we exist
- DEAD: Dead. Having lots of trouble sending or receiving packets to this node
- If you see a node jumping between states such as STARTUP and ACTIVE and DEAD... that means the link is probably pretty flaky. Try adjusting the antenna
- The 'Gen' indicates we are currently sending probes to this node (Y mean yes, N means no)
- Last Heard: how long ago we heard any information from this node about us
- Each node is aware of when it last heard some information about another node
- They reports to each other when the last time they heard information from each other
- If this value is high and repeatedly goes high, then that means the link is flaky since the two nodes can not continuously successfully communicate
- RSSI (dBm): the signal strength of each received packet from this node. this value is a running average
- Silience (dBm): the ambient noise level before a packet was received. this value is a running average
- The silence value is important to watch during deployment. The higher it is, the worse the surrounding 'noise' is and the more difficult it will be to receive packets.
- SNR: the rssi - silience
- We have only recently started displaying this information so we do not have a good 'feel' for what the RSSI, Silence, and SNR values should be for a good site.
- In our past deployments we have shown that you can have a good link even if the SNR is poor, however it is always best to try and setup sites with good SNR.
- When setting up a site, pay attention to the rssi, silence value, and snr and how it changes. It is good to keep track of this mentally so you can gain an understanding of what good values are.
- Rate In/Out: The data rate to and from this node. 10 -> 1Mbps, 20 -> 2 Mbps, 55 -> 5.5 Mbps, 110 -> 11 Mbps
- The data rate can tell you a lot about whether a link is flaky or not.
- If the link is at 1 or 2 Mbps, that means there is enough packet loss to make the, however this may not mean that the link is not usable, but rather that you should proceed with caution, or do your best to try and adjust the antennas.
- Conn In/Out: The packet delivery ratios as a percentage for sent and received packets
- Combining the packet delivery ratios and the data rate can also tell you a lot about a link
- If you see a connectivity of less than 70\% for both links, this means the links is going to perform very poorly.
- If the connectivity is high, but the data rate is at 1 or 2. That means there could potentially be significant loss on the link especially with more traffic
- It is difficult to tell in this situation so the best thing to do is try to send some more pings by hand, or readjust the antennae
[edit]
Important notes!!!
- It takes at least 120-180 seconds for the linkinfo to get a stable reading!
- You are more then welcome (encouraged in fact) to go and send additional pings between nodes (linkinfo will take these into consideration as readings as long as they are large packets!)
- BUT... Please do not only use the pings to make a decision about a link! Use the data rate and the SNR as well. Good links are crucial for this system to work!
- If you are going to send pings yourself, use the ping command with the -s 1400 flag added. Use broadcast pings only to determine what neighbors are available
- You are more then welcome (encouraged in fact) to go and send additional pings between nodes (linkinfo will take these into consideration as readings as long as they are large packets!)
ping -s 1400 10.0.0.112
- You should _never_ see a high data rate and a low connectivity percentage.
- This is because it there are connectivity problems, the 802.11 card will lower the data rate automatically
- If you do see this, wait 2 minutes and see what happens. Try to send a few pings.
[edit]
Talking with the CDCC over WIRELESS
- Setup the wireless or serial on the deployment laptop
- Run the script 'seismic_me' with sudo or as root
- If the script does not exist, enter the following commands as root
- To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc iwconfig wlan0 channel 11 prism2_param wlan0 pseudo_ibss 1 iwconfig wlan0 essid perunet ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 # yes, do the line above twice
- Wirelessly ping the laptop (we will assume you are next to box 103)
ping 10.0.0.103
- If ping fails, go to serial connection instructions below
- If the ping works, attempt to connect to the CDCC's web page with the url:
http://10.0.0.103:8080/
- This will bring up a webpage that shows the status of the CDCC
- The section above will explain how to understand what you see
[edit]
Talking with the CDCC over SERIAL or SSH
- These instructions are for you if
- You are plugged in directly to a stargate over serial
- You are ssh'ing into a stargate from a laptop or another stargate
- ssh is the preferred method because the terminal is a lot nicer (you will understand once you try), but serial may sometimes be necessary
[edit]
Serial setup
- You need to have minicom properly setup
- It is to annoying to explain here. Talk to Igor, Martin, or Vinayak on how to do this.
[edit]
SSH setup
- You can setup key's for you to work with ssh
- The deployment laptops should have these keys on them already and there should be a script that lets you just connect to the cdcc no questions asked
- If you do not have the keys, obtain them from Martin
- He will send you a tar.gz file
- extract the contents of the file with 'tar zxvf peru-keys.tar.gz'
- then, 'mkdir ~/ssh-peru' and copy the id_rsa to the ~/ssh-peru directory with 'cp id_rsa ~/ssh-peru'
[edit]
over ethernet
- To setup your ethernet connection, run the following commands as root or with sudo
- To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
ifconfig eth0 netmask 255.255.255.0 broadcast 192.168.100.255 192.168.100.95 up ifconfig eth0 netmask 255.255.255.0 broadcast 192.168.100.255 192.168.100.95 up # yes enter the command twice
- You can check if everything is configure correctly by typing ifconfig and looking for the eth0 interface
- Once the ethernet is configured, type:
ssh root@192.168.100.100 # the CDCC's ether net address _always_ ends in 100 # Or if you have the ssh keys, type ssh -i ~/ssh-peru/id_rsa root@192.168.100.100
[edit]
over wireless
- To setup the wireless connection, run the 'seismic_me' script, OR run the following commands as root or with sudo
- To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc iwconfig wlan0 channel 11 prism2_param wlan0 pseudo_ibss 1 iwconfig wlan0 essid perunet ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 # yes, do the line above twice
- You can check if everything is configure correctly by typing ifconfig and looking for the wlan0 interface
- Once the wireless is configured, type:
ssh root@10.0.0.103 # Make sure to replace the 103 with the ID of the CDCC # Or if you have the ssh keys, type ssh -i ~/ssh-peru/id_rsa root@10.0.0.103
[edit]
Using the console
- There is a lot of useful information available in the console.
[edit]
CF card layout
- The CF card is mounted to /opt on the stargate
- It is important to understand the layout so you can find information quickly and not mess things up :D
- If you forget this information, there is a file in /opt which shows it all again... you can just cat README.txt to get it
# cat /opt/README.txt bin - extra programs conf - station name, and anything else to be read by any apps cron - all the cron directories data - duiker will place data here dts - hidden runtime information for dts... do not delete duiker - duiker binarys and configuration files emstar - all our code filemover - any files in here will be moved to the next hop WITH a dts header log - system log will go here if turned on .log - for the systemmanager temp info... do not delete! startup - put scripts to run on startup here tmp - play here xfer - any files in here will be moved to the next hop without a dts header
- Do you best not to add extra files or directories directly to opt. If you have to do stuff use the tmp subdirectory
[edit]
Checking various status information
[edit]
Disk usage and file counts
- This is important to do to understand how much data is currently on the node. The data is on the compact flash card, which is mounted to /opt
- There are a variety of ways to do this. Below you see the quickest ways, complete with examples
- Indented text shows the output of the command
# You can ask the software running about the diskspace
cat /dev/diskmanager/status
Data directory /opt/data/ contains 0 files, which are lined for deletion
Filemover directory /opt/filemover/ contains 0 files, which are lined for deletion
Xfer directory /opt/xfer/ contains 0 files, which are lined for deletion
Total diskspace = 0.97GB
Free diskspace threshold = 25.00% which is 0.24GB
Free diskspace delete threshold = 5.00% which is 0.05GB
Free diskspace = 10.43% which is 0.10GB
# you can ask the OS about the diskspace... the thing to look for is /opt since that is where the CF card is mounted.
df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 30M 16M 14M 54% /
/dev/root 30M 16M 14M 54% /
/dev/hda1 996M 893M 103M 90% /opt
# You can see how many files are in each of the data directories
# The data directory where local duiker data goes first, filemover directory where in transit data and local data end up, xfer
# directory where logs and anything extra in transit goes
ls /opt/data | wc
ls /opt/filemover | wc
ls /opt/xfer | wc
# The first number in the output is the number of files
[edit]
Duiker status
# Check if duiker is running. You should see two entries, one being 'grep duiker', then you know duiker is running
ps awx | grep duiker
3723 ? S 8:37 ./duiker
30520 pts/0 S 0:00 grep duiker
# Alternativly, you can do the following which show a lot of status information
cat /dev/duiker/status
# Check if duiker is collecting data. Run the following command a few times in a row. Look to see if one of the packets file is
# increasing in size. The fifth column (the one right before the date) is what you look at
ls -l /opt/duiker/*.packet
-rw------- 1 root root 208616 Jan 21 18:58 /opt/duiker/20080121190026.TO.LECS.bundle_q330_packets.packets
ls -l /opt/duiker/*.packet
-rw------- 1 root root 212616 Jan 21 18:58 /opt/duiker/20080121190026.TO.LECS.bundle_q330_packets.packets
[edit]
Software status
- A lot of this is similar to the information shown on the web page
- To see the sinktree information, type the following. See the information about about the web page to understand what this shows
sinkstatus # Or, you can type cat /dev/dts/sink_status
- To see the recent transfers, type
xfers # Or, you can type cat /dev/xfer/status
- To see the linkinfo, type the following. Note this is an advanced version of what is on the webpage. Try to use
cat /dev/linkinfo/status-wlan0 # Look for the percentages at the end of the line indicating each neighbor, and look for the data rates
- You can check the status of all the software with the following command
- The things to look for here is ti make sure everything is 'running'. If you see something is 'looping' or 'waiting', record the output and let Martin know
status # or type cat /dev/emrun/status
- Also, if you do see something funny, run the following and record the output for Martin
cat /dev/emrun/last_msg
[edit]
Using rbsh
- rbsh will let you talk to one or mode nodes simultaneously without having to login with ssh
- It is like being ssh'ed to one or more nodes all at once
- It is useful for quickly checking the status of a node and it is highly recommended over using ssh!
- It is on every single cdcc as well as on the deployment laptops
- NOTE: rbsh is 'best effort'. It will try its best to send commands and get responses from nodes, but it does not guarantee any reliability!!!
[edit]
Setting up for rbsh
- If you are on a deployment laptop, you need to setup the wireless for this to work
- To setup the wireless connection, run the 'seismic_me' script, OR run the following commands as root or with sudo
- To enter the commands as root, either type 'su -' and enter the root password (then just type in the commands), or add a 'sudo' in front of all the commands which will require you to put in the current users password the first time only
iwconfig wlan0 mode Ad-Hoc iwconfig wlan0 channel 11 prism2_param wlan0 pseudo_ibss 1 iwconfig wlan0 essid perunet ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 ifconfig wlan0 netmask 255.255.255.0 broadcast 10.0.0.255 10.0.0.3 # yes, do the line above twice
[edit]
Running rbsh
- To run rbsh at the command line on the laptop or the CDCC run the following
rbsh -b wlan0
- Once it starts, press 'enter' a few times, and you should see something like
[4] rbsh 4> Node=0.0.0.13, reply to seqno=4: Exit status 0 Node=0.0.0.140, reply to seqno=4: Exit status 0 Node=0.0.0.129, reply to seqno=4: Exit status 0 Node=0.0.0.126, reply to seqno=4: Exit status 0 [4] rbsh 5>
- Pressing enter sends and empty command to all the nodes. It is a quick way to see what other nodes you can talk to
[edit]
Entering commands
- To enter a command, type it in, and press enter
- As soon as you press enter, nothing will happen, because rbsh is giving you the option to enter more commands
- Once you have entered all the commands you want to send, just push enter to submit a blank line and issue the commands
- The responses should come back within a few seconds
- Here is an example:
[7] rbsh 5> df -h [7] rbsh 5> Node=0.0.0.140, reply to seqno=5: Exit status 0 Filesystem Size Used Avail Use% Mounted on /dev/mtdblock2 30M 22M 8.2M 73% / ... [cut] Node=0.0.0.7, reply to seqno=5: Exit status 0 Filesystem Size Used Avail Use% Mounted on rootfs 30M 16M 14M 54% / /dev/hda1 996M 896M 100M 90% /opt 170 [1 missed] [6] rbsh 6>
- Note the '170 [1 missed] right at the bottom: This means rbsh thinks it did not receive a response from a node it had talked to previously. This is useful to lookout for.
[edit]
checking or aborting commands
- rbsh also lets you check the commands you have entered so for or abort them. Typing help shows whats going on:
[6] rbsh 6> help Prompt format: [X] rbsh Y>, where X is the number of nodes that replied to the last request Y is the sequence number of the next request Local commands: help: prints this message abort: discards current command check: prints current command delta: prints out any missing/added nodes since last command sorted: prints out sorted list of node IDs exit: exits [6] rbsh 6>
- To exit rbsh, type exit
[edit]
Talking to only certain nodes
- To only talk to one node or to ignore certain nodes, exit rbsh and rerun it with the --ignore or --dest flags followed by a comma separated list of node id's.
- For instance, if I only wanted to send commands to ndoes 140 and 129, do the following:
rbsh -b wlan0 --dest 140,129
- To send commands to all nodes except 140 and 129, use the --ignore flag instead
[edit]
Using dts
[edit]
Manual filemover
- Edit /etc/conf/filemover-sg.conf so the destination IP and user are correct. It should look something like this:
NEXT_HOP=192.168.100.6 FILEMOVER_DIRS="/opt/filemover /opt/xfer" SSH_KEYS=/opt/.ssh/id_rsa USER=uclanet
- The SSH_KEYS about should point to the private key of the transmitting host
- If you are on a CDCC, it is /root/.ssh/id_rsa ... note /root/.ssh soft links to /opt/.ssh/ !!!!
- Append the public key to authorized_keys
- If you are on a CDCC the public key could be /etc/conf/filemover_idrsa or /opt/.ssh/id_rsa.pub or /root/.ssh/id_rsa.pub
- authorized_keys is in ~/.ssh/authorized_keys
- Copy /opt/bin/filemover-sg to /opt/cron/cron.hourly
cp /opt/bin/filemover-sg /opt/cron/cron.hourly/filemoversg # NOTE: You MUST not have any -'s or .'s in the filename
- This will make the files be copied every hour to the remote host
- To make this happen only once a day, use cron.daily or setup a crontab file in /opt/cron/cron.d/ ... ask Martin how to do this.
[edit]
Upgrading a live CDCC
- Get the latest CF image!
- The basic idea is like this:
- Stop cron
- Stop all the processes you can. If you can not stop some, prevent them from starting up and reboot.
- Save some configuration files
- Format card if files are corrupt OR delete certain files to ensure new versions
- Extract the tarfile, save logs
[edit]
REPLACE RUNNING SYSTEM VERSION
killall duiker cd /opt/startup rm * remount rw cd /opt/duiker cp duiker.conf /root remount ro # wait 30 seconds for duiker to stop, check with ps awx |grep duiker shutdown -r now # once back up, scp CF image to /opt/tmp, then cd /opt/ tar zxvf /opt/tmp/peru-CFcard_200... # wait 10 seconds sync # start duiker /opt/startup/duikstart start # wait 10 seconds # start dts /opt/startup/dts start # make sure everything is running! # duiker: cat /dev/duiker/status # dts: status # Look to make sure everything is 'running' and not 'looping'. # If you see 'waiting' then they are waiting for the time from neighbors. # If the node has duiker and things are 'waiting', then check to make sure duiker is running (the restart dts) # It is ok if 'bozohttpd' is looping or disabled. # If things are looping, then some of the # binaries may have gotten corrupted on the # overwrite, to fix this: /opt/startup/dts stop sleep 5 killall emrun sleep 5 killall bozohttpd emrun sinktree dis_service emproxy \ systemmanager mhsyncf timekeeper udpd dts linkinfo \ diskmanager filemover tcpxfer cd /opt rm -rf emstar cd /opt/tmp tar zxvf peru-CFcard_200... sync cp -r emstar ../ sync sleep 10 # try again: /opt/startup/dts start # if still problems... try this all again?
[edit]
FORMAT CARD VERSION
/etc/init.d/cron stop /opt/startup/duikstart stop /opt/startup/dts sleep 10 killall bozohttpd emrun sinktree dis_service emproxy \ systemmanager mhsyncf timekeeper udpd dts linkinfo \ diskmanager filemover tcpxfer sleep 10 killall bozohttpd emrun sinktree dis_service emproxy \ systemmanager mhsyncf timekeeper udpd dts linkinfo \ diskmanager filemover tcpxfer sleep 5 killall -9 bozohttpd emrun sinktree dis_service emproxy \ systemmanager mhsyncf timekeeper udpd dts linkinfo \ diskmanager filemover tcpxfer sleep 2 ps awx |grep -E "(linkinfo|systemman|timekeep|dts|diskman|filemover|tcpxfer|emrun|mhsyncf)" \ | grep -q -v grep if [ $? -eq 0 ]; then echo "Stuff still running. Try kill by hand"; break; fi ps awx |grep duiker | grep -q -v grep if [ $? -eq 0 ]; then echo "duiker still running. wait 30 seconds and try this again"; break; fi remount rw cp /opt/duiker/duiker.conf /root/ cp -r /opt/.log /root cd .. umount /opt echo "Card should have umounted succesfully" echo "If not, rm /opt/startup/* then reboot, and try to unmount again" echo "Do these next commands by hand" # mkreiserfs /dev/hda1 # mount /opt # cd /opt/ # # copy cfimage to the card # tar zxvf peru-CFcard_200* # cp -r /root/.log /opt/ # cp /root/duiker.conf /opt/duiker/duiker.conf # sync # echo "RESTART CRON " # /etc/init.d/cron start # remount ro # # reboot, wait 10 mins, or do this: # /opt/startup/duikstart start # sleep 5 # /opt/startup/dts start
[edit]
Current Bugs
- Timekeeper: If duiker and dts start at same time, duiker actuall creates time device, so timekeeper thinks it has duiker time
- Leave duiker as is (in future fix so only creates device once has time)
- Have timekeeper recheck when it has duiker time as well
- Linkinfo
- add and test refractory setting for status client
- add rate limiting to probing
[edit]
Current Feature requests
- Make console linkinfo simpler and similar to web version
- Something to be able to check sensor lock/unlock state
- Setup to log Q330 event detection with duiker
-
Refine centering script - add configurability and randomization- distribute -
Partial duiker file recover script- distribute - Teach Richard DTS
-
Setup base emstar stack on all Peru PCs for logging -
Figure out and setup rsync to replace manual filemover - dont forget log output from it- Get last two internet connected CDCC's - Is delete bug really still there?
- CF tests on local testbed
[edit]
Web Page
- Make shutdown button smaller and not right in easy to click place
- Add rbsh interface and dtsh interface
[edit]
Peru Web App
- Organize Day-Site inspection page better
- Add Q330 event display from logs
-
Add filesystem error log output -
Sparklines of boom positions and power and temp - Figure out problems with link quality display on google maps - do we average? Do we show ett? How do we show both directions?
-
Create CDCC-PE lookup page... make sure it is 'correct' -
Fix all scripts to do proper CDCC-PE lookups -
Do GPS averaging page
- Figure out how to show repeater status and on what pages it would be useful
[edit]
Log output
- All entries have the following information with them:
time - 1197934497.044046 node - 114 seqno - 2354
- time is in seconds.microseconds since 1970 (it is the standard struct timeval)
- When making the table, please include two time columns. Once called systime and the other called tstime. Set systime to the above value
- node is between 1 and 255
- seqno is a 64 bit unsigned int
- seqno are mostly unique per node (two nodes can and will have the same seqnos)
- BUT, as you will see below, a single node can use the same seqno a few times in certain situations
- The first is for reporting multiple data events at the same time. An unfortunate consequence of how things are reported.
- The second is if the seqno gets reset. We will know when this happens since the system is aware of it.
Note: all formats can be changed into whatever before being inserted into db
[edit]
Log information
- This is provided for every log file processed
- Each node generates one log file an hour
- It is usefull to keep track of this information since it attempts to show what software version is running on the node!
node - node start - seconds.usec representation of time the log started startasc - ascii representation of time the log started. Format is: %Y%m%d%H%M%S end - seconds.usec representation of time the log ended endasc - ascii representation of time the log ended fs - file system version - done as ascii time the fs was created like with above format cf - creation date of software on the CF card - ascii time format as above processtime - time log file was processed proccesstime_asc - ascii time log file was processed # example: 182 1198019849.654704 20071218231729 1198020149.658536 20071218232229 200712171818 0 1199486688.58 20080104224448
[edit]
q330 status information
- We get status reports every 10 mins about the q330
- The come as three separate messages
- The time is different between the group of three messages by about 10-20 milliseconds
- The seqno for each of the three messages is unique
- We can do things so that the time is all the same and these get put into the DB as one entry into one table
# global info clockqual - clock quality - displayed as hex number... stored as 16 bit unsigned int (uint16_t) minsinceloss - minutes since loss (of gps) - stored as uint16_t secoffset - seconds offset - uint micsecoffset - microseconds offset - uint totalsec - total time in seconds - uint powsec - power on time in seconds - uint lastsync - time of last resync - uint vco - current vco - uint16_t miscin - misc inputs - displayed as hex - uin16_t site_code - optional - 6 chars max # example: 0x51 10119 249436455 999996 137585117 76251468 249436453 2081 0x00 # gps info powtime - power on time - uint16_t powind - power on indicator - uint16_t numsatuse - number of satellites in use - uint16_t numsatrange - number of satellites in range - uint16_t gpstime - gps time string - 10 characters max gpsdate - gps date string - 12 characters max gpsfix - gps fix string - 6 chars max gpsheight - gps height string - 12 chars max lat - latitude string - 14 chars max lon - longitude string - 14 chars max lastgood1pps - time of last good 1PPS signal - uint site_code - optional - 6 chars max # example: 36 1 0 12 "23:16:26" "17/12/2007" "NONE" "124.1M" "3404.1748N" "11826.5046W" 250642564 # power and temp info ( all are uint16_t's ) boomone - channel one boom position boomtwo - channel two boom position boomthree - channel three boom position possup - positive power supply (10mv incr) inpow - input power supply (150mv incr) systemp - system temperature (C) maincurr - main current (1ma incr) gpsantcurr - gps antenna current (1ma incr) site_code - optional - 6 chars max # example: 13 89 89 546 99 27 60 0 # command info cmd - one of: unlock, lock, center site_code - optional - 6 chars max
[edit]
Link information
- Every 10 mins information about the links
- Each report contains information about multiple nodes so it is split up
- Because it is split up, multiple lines being put into the database will have the same seqno and time
host - the neighboring node - an ip address stat - the status - one word either: UNKNOWN STARTUP ACTIVE ASYMM DEAD Inval rss - ewma of receive signal strength - float rssdev - running stddev of rss - float sil - ewma of silence value (do rss - sil to get snr) - float sildev - running stddev of sil - float recvr - most common recv data rate for the last 10 seconds - integer < 255 - divide by 10 to get actual data rate sendr - most commong send data rate for the last 10 seconds - integer < 255 - divide by 10 to get actual data rate succp - success percentage of incoming packets - float - 0.00 to 100.00 succewma - success percentage of outgoing packets - float - 0.00 to 100.00
[edit]
filemover information
- When a file is received it generates one of these messages
- We can get this information from elsewhere and I think it may be a more reliable source and easier to work with
file - the filename (see example below) dst - the node which received the file - ip address src - the node which sent the file - ip address xtime - time in seconds the transfer took btime - time in seconds the transfer took including the wait in between retries ret - number of retries bw - approximate bandwidth size - file size # example: /opt/filemover/20070430040711.TO.LECS.bundle_q330_packets.tar.gz.dts 10.0.0.97 10.0.0.182 9351 9351 0 274.812744
- There is also some error messages with talk about problems accessing or deleting files
file - file filename msgsrc - the error id msg - some message the error
[edit]
path information
- Every 10 mins information about the path to the sink
changes - number of times path has changed since the last report ett - full path metric to sink - value ranges from 0.00001 to a large integer path - a variable length string or 0. For example <182<192<149 # example: 16 0.001122 <182
[edit]
disk status/deletion information
- Every 10 mins the disk space is reported
freeds - free disk space (in megabytes) freedsp - free disk space percentage usedds - used disk space (in megabytes) useddsp - used disk space percentage total - total disk space (in megabytes) thresh - threshold at which no files are accepted dthresh - threshold at which files are deleted # example: 767.72 78.55 209.68 21.4 977.40 25.00 5.00
- Whenever a file is deleted, a message is added to the log
file - the filename # example: 20070430070711.TO.LECS.bundle_q330_packets.tar.gz.dts
[edit]
reboot and seqno information
- There are three system messages that show up in the logs
- They have no extra information beyond the time, the node, and the seqno that all the nodes report
SYSMAN_REBOOT - The node was experienced a reboot. This message happens when a node starts up BUTTON_RESET - The button to reboot the node was pressed SYSMAN_RESET_SEQ - The CF card was probably switched because a seqno was found but it was for a different node. So we reset the seqno.
[edit]
Data processing
- In addition to this we have log messages for all the data files that get processed on the RAID. This is basically a timestamp and the filename.
