PDA

View Full Version : OSPF issues with Activate Settings


lonnie
03-23-2006, 02:15 PM
A number of people have been reporting an issue with activations in that OSPF will lose track of devices and generally do bad things to the network.

We have identified a new way to activate that might make this less of an issue. Our current activation is the way it is because of PCMCIA devices which are totally dynamic. We had to rip the network down and rebuild it as the only way to ensure PCMCIA devices were properly configured. Unfortunately this seems to be bad for OSPF and RIP which like to be left alone and running.

We are testing a new activate method that will not affect routes or IP's unless you change a route or IP. All quagga programs are no longer restarted which will mean they will not be making any down/up announcements. I cannot say if this will affect anything later on, but it should help with the problem as identified.

Expect a new beta by the end of this week.

Equis
03-23-2006, 02:35 PM
Cool, Thanks :-)

oscarBravo
03-23-2006, 04:18 PM
That's good news. I never understood why activating changes tore everything down; now it's clearer.

Looking forward to the new approach.

That said, 2.11.0 is still unusable for us because of the WPA-PSK issue, so if the new beta is built incrementally on it, we won't even be in a position to test it unless that's fixed. :(

bminish
03-23-2006, 04:22 PM
we won't even be in a position to test it unless that's fixed. :(
On client facing AP nodes, that is. Few of these run OSPF

.brendan

tony
03-23-2006, 05:10 PM
That's good news. I never understood why activating changes tore everything down; now it's clearer.

Looking forward to the new approach.

That said, 2.11.0 is still unusable for us because of the WPA-PSK issue, so if the new beta is built incrementally on it, we won't even be in a position to test it unless that's fixed. :(

Can you PM me your WPA-PSK settings, including the last known version that it worked? All my tests with WPA-PSK seem to work well, so would like to try and duplicate your setup as close as possible.

Thanks!

lonnie
03-24-2006, 08:38 AM
The v2 release will be delayed until next week. We still have to deal with PCMCIA because it is supported and people use it. Doing the new activate sequence with V3 was easy and was released yesterday but v2 will take more time.

tony
03-24-2006, 08:42 AM
This will also give us time to look into the WPA-PSK issue closer as well.

Beebe
03-24-2006, 09:07 PM
I think maybe you are barking up the wrong tree? My OSPF problems often occur when I'm not messing with anything, routes just disappear and part of the network drops offline, often it's an activate changes that fixes it, not breaks it... And when it breaks in this way, I can log into the interface on my side of the radio with the downed link, and then ssh into the next radio so the link is actually working, but for some reason OSPF is not passing on it's routes.

I have noticed how sometimes when this happens, I can ping through the network to my side of a radio, but I can not ping the IP on another interface on that same radio. OSPF seems to loose all knowledge of a subnet on it's own radio.

Anyone concurr that this is the problem?

Thanks,
Roger

lonnie
03-24-2006, 11:24 PM
What version are you using? What are the radio cards?
I think maybe you are barking up the wrong tree? My OSPF problems often occur when I'm not messing with anything, routes just disappear and part of the network drops offline, often it's an activate changes that fixes it, not breaks it... And when it breaks in this way, I can log into the interface on my side of the radio with the downed link, and then ssh into the next radio so the link is actually working, but for some reason OSPF is not passing on it's routes.

I have noticed how sometimes when this happens, I can ping through the network to my side of a radio, but I can not ping the IP on another interface on that same radio. OSPF seems to loose all knowledge of a subnet on it's own radio.

Anyone concurr that this is the problem?

Thanks,
Roger

bradg
03-24-2006, 11:32 PM
Anyone concurr that this is the problem?

I can confirm this behaviour and resolution as well - again. I think I've posted this several times over the last 6-9 months. Using 2.10.1b5 on every OSPF enabled WRAP, CM9's for every radio link.

bminish
03-25-2006, 01:49 AM
Anyone concurr that this is the problem?


Yes this is all the same problem. If you go into OSPF when it's in this state you will see that S I O R shows no awareness of the locally attached interface that it problematic.
I have only ever seen this happen on atheros links
it is common for only one interface to be missing, the other interfaces will be present and fully routable .
It happens following an 'apply changes' or following a break in the link

usually SSHing across to the neighbour of affected node and applying changes will bring things back up but this may in turn break it's neighbours.
restarting ospf does not seem to have the same effect
Lonnie does restarting OSPF only restart the ospfd daemon or does it also restart zebra and watchquagga?

Lonnies proposed change to allow changes to be applied sounds like a good one in general but can He also alter things slightly so that ALL the quagga daemons are restarted if ospf is restarted, this should allow us to kick a node back into action much faster than having to do a full reboot when OSPF goes funny
This could perhaps be made conditional on no other routing daemons such as BGP or RIP being active at the time, last one off stops other daemons?

.brendan

bminish
03-25-2006, 01:57 AM
Just a though.

Lonnie, if you are making the proposed change to 'activate changes' perhaps adding an option to do a 'warm init' which would do what 'activate changes' does in the current version would be a good idea?


.brendan

tony
03-25-2006, 09:49 AM
This is something we can look into.

Thanks!

Beebe
03-25-2006, 06:35 PM
I'm using all atheros on my backhauls, StarOS 2.10.0 (4693) all on wrap boards.

What version are you using? What are the radio cards?

lonnie
04-15-2006, 10:34 AM
The incremental activate v2 has been scrapped. The reason is simple -> we must still take into account the PCMCIA interface and it requires the rebuild all for a proper activate.

Tony has tried many different things but there are too many issues to deal with. In reality this is the sort of thing why we did the ground up rebuild for V3. The v2 architecture has too many things that force the process and make changes and additions difficult.

bradg
04-15-2006, 12:24 PM
The incremental activate v2 has been scrapped. The reason is simple -> we must still take into account the PCMCIA interface and it requires the rebuild all for a proper activate.

Tony has tried many different things but there are too many issues to deal with. In reality this is the sort of thing why we did the ground up rebuild for V3. The v2 architecture has too many things that force the process and make changes and additions difficult.

How should we interpret this report, Lonnie?

The question that I need answered is wether the reported OSPF issues will get some real attention, work and debugging toward finding the fix, and with some luck, work properly in v2:

1) working on it right now
2) in the very near future
3) some indeterminate date in the future
4) never
5) none of the above

I can report to you that 2.11.0 did nothing to help or hurt the OSPF issues, if that's of any value. According to the changelog, nothing approaching OSPF was touched, so I didn't expect it to do anything really.


Brad

tony
04-15-2006, 12:36 PM
The OSPF fix / solution is related to the way the system activates changes. At the moment, in V2, the OSPF daemon is restarted each time the system is activated. Once in a while, if the Atheros or Prism cards have not finished associating before OSPF does it's initial 'hello', then it will sit in limbo for a little while before attempting to do so again. Of course, the network picks up that the system is no longer 'available', and causes larger network issues.

The v3 technique for activation is incremental, and only restarts the service or options that need to be handled at that time. OSPF, and the other routing daemons are not restarted and automatically pick up on any system changes.

We were hoping to bring this activation mechanism to V2, however the system design has provided a large roadblock that would require a massive redesign.

The 2.11.0 release resolves several outstanding issues, but nothing relating to OSPF unfortunately.

tog
04-15-2006, 01:08 PM
Is it possible that rebooting rather than activating changes in v2 would be a reliable workaround for those running OSPF? StarOS does not take that long to reboot.

Or does this described OSPF problem occur sometimes during initial startup as well?

tony
04-15-2006, 01:28 PM
With Wireless, there is always a chance that the clients have not fully associated by time the OSPF services are started. On bootup, there is a longer period of time between the Wireless card activation, to the point where OSPF is started giving the clients enough time to associate.

lonnie
04-15-2006, 04:26 PM
The way to interpret it is that we are not able to do the incremental activate as we had hoped. The issue with OSPF timing out before stations associate is one that Quagga should address. OSPF should keep trying on a device that is listed as being used but is not alive when it does its checks. As soon as the device comes live OSPF should be able to work.

We are not OSPF experts and we never claimed to be. We offer a third party GPL module and we have to leave it up to them for maintenance and updates. I am sure they are aware of the issue but like any software project they have to find the time to address it.
How should we interpret this report, Lonnie?

The question that I need answered is wether the reported OSPF issues will get some real attention, work and debugging toward finding the fix, and with some luck, work properly in v2:

1) working on it right now
2) in the very near future
3) some indeterminate date in the future
4) never
5) none of the above

I can report to you that 2.11.0 did nothing to help or hurt the OSPF issues, if that's of any value. According to the changelog, nothing approaching OSPF was touched, so I didn't expect it to do anything really.


Brad

bminish
04-15-2006, 05:01 PM
The way to interpret it is that we are not able to do the incremental activate as we had hoped. The issue with OSPF timing out before stations associate is one that Quagga should address. OSPF should keep trying on a device that is listed as being used but is not alive when it does its checks. As soon as the device comes live OSPF should be able to work.

Have you actually Identified the bug as being in this area?
If Not, then this is just your Hypotheses Lonnie


We are not OSPF experts and we never claimed to be. We offer a third party GPL module and we have to leave it up to them for maintenance and updates. I am sure they are aware of the issue but like any software project they have to find the time to address it.


I have not seen this behaviour in OSPF using Quagga on any platform other than staros. I have done extensive testing in a number of test environments including scenarios where the interfaces are not up before quagga (it recovers just fine every time)
I run Quagga on a number of platforms other than staros
If you truly have identified a bug then file your bug report with the Quagga developers.
If you haven't identified a bug then stop blaming everyone else, You are selling a us a product that claims to have OSPF support.

Tony, I offered to set you up with a test environment that would demonstrate the problem that we are seeing, if you suppled a few wraps with atheros links, you never took me up on the offer.

.brendan

lonnie
04-15-2006, 07:05 PM
Tony cannot do the WRAP thing because we do not have any WRAP boards in our backhaul any longer. It is all WAR boards. All of our WRAP boards are at client sites.

My idea is just a guess. I have nothing but theory based on what we have heard. We attempted a fix but are not able to do anything with v2. V3 was done and it makes an activate real quick. Nobody seems to be using WAR boards.

I guess all I can do is wish everybody a Happy Easter.

Have you actually Identified the bug as being in this area?
If Not, then this is just your Hypotheses Lonnie




I have not seen this behaviour in OSPF using Quagga on any platform other than staros. I have done extensive testing in a number of test environments including scenarios where the interfaces are not up before quagga (it recovers just fine every time)
I run Quagga on a number of platforms other than staros
If you truly have identified a bug then file your bug report with the Quagga developers.
If you haven't identified a bug then stop blaming everyone else, You are selling a us a product that claims to have OSPF support.

Tony, I offered to set you up with a test environment that would demonstrate the problem that we are seeing, if you suppled a few wraps with atheros links, you never took me up on the offer.

.brendan

tog
04-16-2006, 03:05 AM
I am very much looking forward to being able to use v3 when my last remaining showstopper bug is resolved, I have a big box of WAR boards. Judging from your brisk sales I'm not the only one with a big box of WAR boards! They must have gone somewhere :)

Tony cannot do the WRAP thing because we do not have any WRAP boards in our backhaul any longer. It is all WAR boards. All of our WRAP boards are at client sites.

My idea is just a guess. I have nothing but theory based on what we have heard. We attempted a fix but are not able to do anything with v2. V3 was done and it makes an activate real quick. Nobody seems to be using WAR boards.

I guess all I can do is wish everybody a Happy Easter.

lonnie
04-16-2006, 11:03 AM
Sorry, but I am not aware of any remaining show stopper bugs with the WAR V3 software. The latest release, beta16, has been solid. I updated our backbone 4 days ago and all machines have not skipped a beat. One machine has three 5 GHz feeds and one 2.4 GHz AP. Another has a 5 GHz feed and three 2.4 GHz AP for customer use.

Mostly people are using the WAR boards and obtaining the performance improvements. We regret there are some missing features but for the most part what we do have works well. I particularly like the mesh routing.

Sales have been very brisk, and we are probably going to be sold out of the duals before they even arrive here. The next stock will be in the middle of May.


I am very much looking forward to being able to use v3 when my last remaining showstopper bug is resolved, I have a big box of WAR boards. Judging from your brisk sales I'm not the only one with a big box of WAR boards! They must have gone somewhere :)

tog
04-16-2006, 05:04 PM
I have been 100% concentrating on v3's ability to act as my 2.4GHz PtMP client-bearing access points since I believe that's the most difficult job for the driver. My only "in production" v3 setup is a WAR2 with a single CM9 in it handling the same clients that v2 was handling perfectly (aside from reboots every 5 - 15 days) with even the same CM9 the WRAP was using.

I have had what I think are three separate showstopper issues with v3 "freaking out" when it is just left alone to do its job and tony has fixed two out of three over the last few weeks and it's much much better than when we started.

The first and second problems were definitely the worst and they seem to be fixed so my v3 AP works much better than it did.

The last problem I am seeing basically results in my needing to reboot the AP manually every 4 - 24 hours whenever it starts pinging like 3, 3, 1000, 2000, drop, 2000, drop, 2000, 1000, 1000, 3, 3, 1000, 1000... It will stay like that for as long as I leave it, as soon as it is rebooted it is perfectly happy again. But, it does not drop existing associations or stop accepting new associations. All the clients on it can only get about 10 - 20Kbit of throughput and minimum 10 - 20% packet loss when it gets like that though. It's definitely not sudden interference or high traffic, interbss relay is off of course and traffic is usually very quiet when it gets into that state, below 50kbit.

The rest of the time when it has not yet reached that state it works great.

I feel this is a showstopper for me because I'm afraid if I put 20 of these up now I will spend all day and all night manually rebooting the 2.4GHz client-bearing APs and I'll never get any sleep.

Given the progress made over the last few weeks between beta 1 and 16, I am feeling pretty good about v3 and the WAR platform. But, as I said, with this one final problem plaguing me still, I'm not ready to deploy it everywhere.

Feature-wise, all I absolutely *NEED* is WEP and a dhcp server and I have both!

Sorry, but I am not aware of any remaining show stopper bugs with the WAR V3 software. The latest release, beta16, has been solid. I updated our backbone 4 days ago and all machines have not skipped a beat. One machine has three 5 GHz feeds and one 2.4 GHz AP. Another has a 5 GHz feed and three 2.4 GHz AP for customer use.

Mostly people are using the WAR boards and obtaining the performance improvements. We regret there are some missing features but for the most part what we do have works well. I particularly like the mesh routing.

Sales have been very brisk, and we are probably going to be sold out of the duals before they even arrive here. The next stock will be in the middle of May.

lonnie
04-16-2006, 05:49 PM
Have you tried to disable power saving on the AP? All it takes is a single out of sync client to ask for power saving and not come back and announce it is ready. This depletes ram resources big time and will crash the AP.

I have been 100% concentrating on v3's ability to act as my 2.4GHz PtMP client-bearing access points since I believe that's the most difficult job for the driver. My only "in production" v3 setup is a WAR2 with a single CM9 in it handling the same clients that v2 was handling perfectly (aside from reboots every 5 - 15 days) with even the same CM9 the WRAP was using.

I have had what I think are three separate showstopper issues with v3 "freaking out" when it is just left alone to do its job and tony has fixed two out of three over the last few weeks and it's much much better than when we started.

The first and second problems were definitely the worst and they seem to be fixed so my v3 AP works much better than it did.

The last problem I am seeing basically results in my needing to reboot the AP manually every 4 - 24 hours whenever it starts pinging like 3, 3, 1000, 2000, drop, 2000, drop, 2000, 1000, 1000, 3, 3, 1000, 1000... It will stay like that for as long as I leave it, as soon as it is rebooted it is perfectly happy again. But, it does not drop existing associations or stop accepting new associations. All the clients on it can only get about 10 - 20Kbit of throughput and minimum 10 - 20% packet loss when it gets like that though. It's definitely not sudden interference or high traffic, interbss relay is off of course and traffic is usually very quiet when it gets into that state, below 50kbit.

The rest of the time when it has not yet reached that state it works great.

I feel this is a showstopper for me because I'm afraid if I put 20 of these up now I will spend all day and all night manually rebooting the 2.4GHz client-bearing APs and I'll never get any sleep.

Given the progress made over the last few weeks between beta 1 and 16, I am feeling pretty good about v3 and the WAR platform. But, as I said, with this one final problem plaguing me still, I'm not ready to deploy it everywhere.

Feature-wise, all I absolutely *NEED* is WEP and a dhcp server and I have both!

tog
04-16-2006, 06:21 PM
Have you tried to disable power saving on the AP? All it takes is a single out of sync client to ask for power saving and not come back and announce it is ready. This depletes ram resources big time and will crash the AP.

That was problem 1 of 3 that was fixed :)
(In other words, yes, I've disabled power saving)
I've never crashed v3 before, only made it act funny.

The general oddities tony was reproducing when he blasted the lab v3 AP with interference was problem 2 of 3 fixed.

lonnie
04-16-2006, 08:12 PM
Why not try it at one more site? If it does not act the same, then you have a clue that it is site related. Then the task will be to see what is different at that site.

That was problem 1 of 3 that was fixed :)
(In other words, yes, I've disabled power saving)
I've never crashed v3 before, only made it act funny.

The general oddities tony was reproducing when he blasted the lab v3 AP with interference was problem 2 of 3 fixed.

tog
04-17-2006, 12:52 AM
Why not try it at one more site? If it does not act the same, then you have a clue that it is site related. Then the task will be to see what is different at that site.

I'm going to do that. The next site will be worse than this one. More 2.4GHz noise exists in the area, twice as many clients, two 2.4GHz APs on one system.

lonnie
04-17-2006, 12:30 PM
You know, if you have that much noise you would be much better off to use the X2 or even X4 cloaking.

tog
04-17-2006, 01:14 PM
Of course I would, but I've got 40+ clients on this AP currently incapable of cloaking. I think about half of them are WRAP boards running v2, the other half will never be capable of cloaking. When I have a new generation of cloaking-capable cheapie CPE, I will actually run around and swap some existing non-StarOS CPE out...

When v3/x86 is available I will probably update a lot of the WRAPs to v3 and when they're all ready, convert one of those two APs to 2X cloaking. That will be nice. With QoS, selling voip around there will be a snap. Everybody's already beating down my door wanting my voip but I am having to tell them to wait.

I am very much looking forward to using cloaking, but it will take time to make the transition. I haven't tested it outdoors yet, but from what I can see I will be able to put those APs in auto rate (yay!) and clients should be able to get up to 1500K/sec using 2X depending on how good their path/signal is. This would be a huge improvement over my 11mbit rate APs which can do about 600K/sec under perfect conditions and use giant sloppy chunks of spectrum.

The general 2.4GHz noise doesn't really phase StarOS v2, performance when forced to 11mbit transmit rate at the AP is excellent, general background noise doesn't bother it. The only thing that has any noticeable effect on the performance is the kind of interference that screams in your ear louder than your clients and actually affects everybody's displayed signal strength.

bairdc
04-18-2006, 12:27 PM
Putting the thread back on topic...

The incremental activate v2 has been scrapped.

Would it be possible to have association trigger a restart of Quagga? In other words, if association happens on any wireless client interface, at any time, simply restart Quagga. Maybe I'm over-simplifying things, but it seems to me like this would work.

Craig

bminish
04-18-2006, 01:25 PM
Putting the thread back on topic...



Would it be possible to have association trigger a restart of Quagga? In other words, if association happens on any wireless client interface, at any time, simply restart Quagga. Maybe I'm over-simplifying things, but it seems to me like this would work.

Craig


It would not work. Restarting quagga from the interface does not as a rule resolve this issue. You have to use apply changes, often on the neighbouring node.
At best the proposed change was going to be a partial workaround to avoid breaking OSPF for those of us who need to apply changes from time to time due to network administration issues.
At worst it was going to remove the only remaining tool we had short of a full reboot to get things going again
It was not going to fix the underlying problem with OSPF in staros.

.brendan