PDA

View Full Version : Quad WAR Crashes on Save (Beta-14)


nelson05
03-29-2006, 04:00 PM
After evaluating the recently added WEP support and having good luck in the lab we went for this weekend and put our first WAR into production as an AP.

We have a very basic setup at the moment where only one out of the four CM9s is being used with the other three waiting to connect to other WARs to feed our backhauls at 5.x. The card is configured to hide its SSID, InterBSS Relay is Off, Short Preamble is On, SuperA/G is on, and Operating Mode is 802.11g only. The BSS channel is set to 2412, Transmit Rate to 24, Link Distance to 6.00 miles, Tx Power Override is default, and cloaking is off. WEP is enabled using two 104-bit keys, Shared Authentication and Key 1 chosen as the default. The interface is also running the dhcp-autoauth service. We were originally running mesh in the lab and left it enabled to prepare for the other links, but disabled it when we started having problems. The only other services running are ntp and syslog which I only enabled for troubleshooting purposes.

All wireless clients are WRAP boards running StarOS V2 with CM9s with the lowest signal around -76.

The AP is completely routed and connection tracking is disabled with the WAR fed by a a 10/100 switch that connects to our existing Proxim backhaul. A remote monitoring board that I can't say enough good things about is also connected to the switch (I think these guys have been mentioned on this forum before, but checkout www.bndcom.com (http://www.bndcom.com) and their Remote Monitoring System Version 2- a lifesaver). A Cisco AP is also in the mix. The WAR is fed by a 4A 12 to 24 V DC to DC converter. I've also tried running it with an inverter and the 800 mA 24V AC to DC adapter Valemount originally shipped with the board.

Now that I've written a book about our configuration, I'll get to what actually happened. The conversion went extremely well and performance was great- we moved 61 clients in stages off of our Cisco AP to the WAR. All links appeared to be stable with no packet loss. Along the way, we ran into a couple of serious issues that would have been major problems if we didn't have the Remote Monitoring board I mentioned above. After switching about ten clients over and entering their client names in the description field on the client display list, we went to save our changes. The SSH screen showed the BUSY indicator in the corner and then the interface seemed to freeze. We noticed our ping to one of the associated clients had stopped. The WAR had basically become inaccessible from the Wireless or Ethernet interfaces. The only way to restore connectivity was to power cycle the board, which we accomplished using the RMS board referenced above.

Once the board was accessible, we started entering the descriptions again, held our breath, and then saved changes. This time the busy indicator appeared and then went away as expected and the WAR was ready for more. Thinking we had run into a fluke, we converted a few more clients, hit save and the system responded as expected. We continued to add clients, saving every five or so and then ran into the same problem where the WAR completely locked and the only way to recover was to power cycle. The same pattern or lack thereof was observed as we kept converting clients (with fingers crossed) to the WAR. Sometimes the save would process with no problems multiple times in a row and other times it would freeze on two consecutive saves or the first time after a reboot. We encountered the same issue over multiple days with this morning being our most recent encounter with the crash. As long as I don't save changes, the WAR is stable, throughput is good, and everyone's happy. I haven't noticed a lockup when simply activating changes... is there something going on behind the scenes when a save is initiated?
One last thing... after power cycling the board, I've had it freeze a couple of times if I go in too quickly via SSH after it reboots. I'll get the logon prompt and enter it and the password, only to have the WAR immediately lock and stop responding to pings. If I wait a bit (a couple of minutes) after the restart, I can logon. I'm guessing the system is being bombarded with re-associations and DHCP requests though it still seems that it shouldn't lockup entirely. The syslog doesn't offer any clues.
Things are working now though I'm not saving changes if I can help it.

tog
03-29-2006, 04:47 PM
Are you syslogging to a remote syslog server so you have the last syslog message(s) before the lock up?

Do you have another WAR board to try and/or another power supply? What kind of power supply are you using? This could be an odd hardware problem just as easily as it could be buggy new software at the moment...

nelson05
03-29-2006, 05:11 PM
Yeah, I was thinking the same thing. I just ordered five more Quad WARs for our other links with one extra for reserve so I will be able to test the flakey hardware theory soon hopefully. Just seems weird that everything runs and runs until I save changes (and even then, oftentimes the save changes works fine).

I'm not looking forward to reconfiguring a replacement board. It takes a long time to handenter the descriptions for the clients.... will starutil be updated soon to be able to download and upload configurations?

It was hidden in my book of a post, but here it is again regarding the power supplies I'm using: I've tried it with a 4A 12 to 24V DC to DC converter and the power supply (24V 800 mA AC to DC adapter) Valemount shipped with the board originally.

Finally, I am remote syslogging....nothing gets sent/logged when save changes is selected.

tog
03-29-2006, 05:21 PM
Yeah, sorry I missed that. So you have reproduced the problem with a couple different power supplies...

It is worth trying to reproduce it with another board, too. If you do reproduce it on the new board, you might as well put the old fully configured one back anyway. Assuming swapping boards is less hassle than re-doing the config.

I have also noticed that the process of saving no longer logs anything like it did in v2.

tony
03-29-2006, 05:29 PM
When saving failed (strange one), do you recall what the CPU load was before you hit save?

greg
03-29-2006, 05:30 PM
I had a two port War do that to me last night. I changed a route, saved and applied changes when it didn't come back up. I waited over 5 mins and got ready to drive to the shop for a power cycle. I could get into the next hop and logged into it. There are 3 CM9's in it (PC Server) and the one that connects to the War was searching for the AP. I applied settings to it et voila, the link came back up. Saved me a trip in to town but I'm wary of applying changes from a remote site.

tony
03-29-2006, 05:35 PM
Greg, what version are you using?

nelson05
03-29-2006, 06:47 PM
I'm not sure what the CPU load was the first time around, but subsequently it has varied from 2 to 10%. After the second freeze, I monitored it very carefully the next time around before clicking save changes as I was also suspicious of the CPU load. I tried to click it when it was around 2% though it would often jump to 8% or so, no more. The board usually hovers around this range, even with all of the clients associated. While I know this isn't what you were after, I took a couple of screen shots of the frozen WAR screen after a few of the crashes. You'll notice the number of clients varies and that I turned off mesh partway through to eliminate it as a possible problem.

http://www.springvillewireless.com/images/WARCrash1.PNG
http://www.springvillewireless.com/images/WARCrash2.PNG
http://www.springvillewireless.com/images/WARCrash3.PNG
http://www.springvillewireless.com/images/WARCrash4.PNG

Greg, my issue is a little different than yours as I can't even ping the device from any interface. No radio link, no ethernet response...nothing. The only thing that fixes it is a remote power cycle. Also, my issue appears to be related to saving changes, not applying for whatever reason (though I did see that you saved and applied settings, so maybe the save caused your issue).

tony
03-30-2006, 07:48 AM
I have been able to duplicate, once, the saving issue shown above. We will work towards correcting it for the next release.

greg
03-30-2006, 09:28 AM
Beta-12

greg
03-30-2006, 09:33 AM
Nelson, I was unable to access the war from my location. I did get into the closest hop (pc server) that connects to the War via radio. That is where I was able to apply changes which brought back up the link. I can't explain why but it worked and saved me a drive in at night. The link had been down at least 10 mins at that point. The CM9 that connects back to the war was scanning for a link when I logged in.

nelson05
03-30-2006, 10:00 AM
Glad that you didn't have to make a trip- my issue was that I couldn't access the WAR from any interface. Activating changes on a client radio, clearing my ARP table on a machine that could access the unit through its wireless interface...nothing short of a power cycle could wake the unit up.


Things have changed a little bit now. After struggling with clients dropping off of the AP (switching to "N" in the client display list) and remotely power cycling the board to try to bring them back up, my Quad WAR finally stopped responding- even after a power cycle. The unit does not appear to respond to pings on the ethernet interface or broadcast anything on any of the wireless interfaces. I wish I could access the console remotely and see how far it is getting when it is starting up. Looks like I'll have to make a trip up the hill and bring the board down to see what is going on in the lab. At least I'll be able to replace the unit and take a flakey piece of hardware out of the equation. However, it will have to be a WAR-2 as my new WAR-4s probably won't make it till next week.

One last bit- the WAR-4 is from the first batch of boards Valemount offered. I put a pre-order in, so it should be one of the first units. Not sure if the hardware has changed a bit since this version. Also, when I was able to get into the unit, I checked after the power cycle following the crash caused by the save, and saw that the settings I would change, would never actually get saved. So, it appears the unit was not even getting far enough to write the changes to flash, before it crashed.

greg
03-30-2006, 01:58 PM
I was flashing the beta 14 upgrade to my 2port War this morning when it died. It took the upload fine but locked up during the flashing process. I waited about 10 mins to see if it would finish to no avail. I couldn't ping either IP ethernet address. This was only my main router with 50 static routes in it. It impacted 100's of customers and was down for about an hour til I could get all configured again. Shortened my life to some degree, I'm sure. Sure would be nice to have a config backup and restore very soon! Things have calmed down again and I've been looking at the unit to see if I can access it but it looks to be long gone.

I would have just put my old PC server back up but it didn't like a flash upgrade I did on it last night. All went well but it won't boot. I tried to take it back to an earlier ver but the mac was an old one and I had no record of the key. Lonnie has since provided that but it wasn't an option at the time. I've tried upgrading several PC's over the past couple of weeks and all won't boot after the upgrade. I haven't flashed any in the field yet, just no confidence that they won't roll over too. They all went to 4693, except two, with no problems. Most are 2.4 to customers and I'd like to be able to play with the preamble settings.

tony
03-30-2006, 03:17 PM
The problem has been resolved. Sorry that some of you have been afflicted with it.

This specific problem can hit during the firmware upgrade, or saving, if your system load is above 15-25% on average. (average before the save, or upgrade begins)

When upgrading to beta-15 (due out shortly), please make sure your system is not busy at all. Shape your Atheros cards to a suitable rate to accomplish this if needed.

Update: These precautions need to be taken with all past releases. While the problem does not occur all the time, it's best to be safe.

nelson05
03-31-2006, 05:49 PM
Are we getting close to Beta 15? I was hoping to give it a go this weekend as I am just about to run the replacement WAR up the tower and would rather not have the save issue bite me.

tony
03-31-2006, 05:57 PM
It's on the website, but have not posted it in the forums yet.

Thanks!

nelson05
04-02-2006, 11:22 PM
Thanks Tony. Beta-15 is on the replacement Dual WAR and everything has been perfectly since.

Unfortunately, my Quad WAR is in pretty bad shape. When I pulled it down off the tower and fired it up with the console cable, I see that it has reset to factory defaults. I'm sure the power cycles I put it through after it would freeze when saving, are to blame. I would reconfigure it and it would be no big deal other than the tower climb and service interruption except that this board's ethernet 1 went out.

I called Lonnie and he had a feeling it was due to static damage and asked if I could get by without the first port. Not great, but not a major problem except that I can't get into and configure it other than setting it back to the defaults through the console. I actually can get to the SSH interface through the terminal and see all of the status information (such as WPCI scanning for an AP in 5.x), but can't activate any of the menu options to configure Ethernet 2's IP.

Do you have any suggestions? I was nervous putting this board up because I was afraid something like this would happen and I would be sunk if it ever reverted to the defaults, but it was my only Quad WAR and I wanted to get the sector fired up. I PMed you with a suggestion to incorporate an ethernet 2 IP of 192.168.2.1 into the factory defaults for the next firmware or activate the DHCP client on one of the other interfaces, but that won't really help me now either.

Thanks.

tony
04-03-2006, 07:59 AM
The factory restore you see is a symptom of the save problem, unfortunately. Before you put it back in service, upgrade it to beta-15 on the bench first, then re-configure it as needed. Having a default IP on the second ethernet port is a good idea, and will be added in the next release.

Thanks!

nelson05
04-03-2006, 10:23 AM
Unfortunately, my problem is that I can't upgrade it to beta 15 because I can't communicate over ethernet 1 and that is the only interface with an ip assigned after the reset. Any suggestions on a way to communicate with the board or will it have to be sent back to be re-programmed via the JTAG interface?

tony
04-03-2006, 10:54 AM
If Ethernet 1 is the only interface with an IP, and that port is no longer functioning (eg. no link light), then it may have to be sent back for RMA. We will be able to upgrade the system for you, and place a factory IP on the secondary interface.