View Full Version : An OSPF issue and a Fix
bminish
06-28-2005, 04:46 AM
Some of us have been having some issues with OSPF failing on occasion within 30 minutes or so of a reboot.
Paul and I think we may have found the problem and have found a work around
The problem is caused by big jumps in the system time due to NTP updating the system clock for the first time following a boot.
Wrap boards do no have a battery backed clock so they come up from a boot or power cycle with a time and date that is some years in the past.
Zebra (the open source routing agent that Staros uses to do OSPF) uses system time to keep track of the age of entries in it's routing database and big jumps in system time cause zebra to get into a state that it can not always recover from.
This is not a problem that is specific to StarOS, We have confirmed that this is a Zebra issue by recreating the problem with zebra running under debian.
Paul (OscarBravo) has brought this issue to the attention of the zebra developers so perhaps a fix may be a possibility.
For now the fix is to turn off NTP clock setting in staros. This prevents the time jump that causes Zebra to break.
I hope this helps explain some issues some of you may have been seeing.
.Brendan
bairdc
06-28-2005, 11:55 PM
I wish this were the fix for OSPF in my case. Unfortunately, I don't have NTP running on any of my OSPF boxes, and yet I still see the occasional OSPF weirdness when a box reboots. I sure wish I knew in what way my OSPF setup is different from that of those who aren't having this problem.
In the meantime, running RIP along with OSPF has stabilized things. I just wish I didn't have to resort to that...
Craig
bminish
06-29-2005, 02:33 AM
... and yet I still see the occasional OSPF weirdness when a box reboots.
Craig
What sort of weirdness? What does
sh ip os ne
sh ip os ro
sh ip os da
show when things aren't running smoothly?
Does it recover eventually or do you have to intervene?
What's your network topology?
The NTP date thing was the issue that we were seeing.
.Brendan
bairdc
06-30-2005, 12:56 PM
Occasionally, when a machine reboots, or if I do an "activate changes", one of its neighbors will stop exchanging routes with it. When this happens "sh ip ospf neighbor" will sometimes show:
Full/DR
or:
Full/Backup
and other times it will show:
Init/DRother
Doing a "sh ip ospf route" on the machine that was just rebooted shows routes from everywhere except those coming through its neighbor that is no longer talking. If I do a "sh ip ospf route" on the neighbor that isn't talking, it shows all routes coming from everywhere except the machine that just rebooted.
There are two things that fix this problem. Either I get into the router that is not exchanging routes, and do an "Activate Changes", or I can wait for 30 minutes, at which point, things seem to synchronize, the routers exchange routes, and everything is fine. I'm assuming that after 30 minutes, the routers do an LSA refresh.
Anyway, the interesting thing I've noticed with this is that I've only ever had this happen on Atheros 802.11a interfaces. I've got OSPF running on ethernet, and over Orinoco-based wireless links, as well as Atheros. The only links affected by this problem so far are the Atheros ones.
It makes me wonder if it has something to do with the time Atheros cards take to associate. I've noticed that Atheros 802.11a links sometimes take considerably longer to associate than Orinoco or Prism based 802.11b links. Could it be that OSPF is coming up and trying to establish a link with its neighbor before an association has been established? Just a guess...
Craig
Beebe
06-30-2005, 02:32 PM
I'm also having OSPF troubles. Unlike some people though, I don't have to do anything to mess up my network... It just messes itself up sometimes it'll go a day and sometimes it will mess up 5 times a day... and all the problem seems to be with the same backhaul radio.
If I wait half an hour it'll probably just start working again. If Log into a radio sometimes it just starts working, a few times OSPF has been disabled, and I have to go into the advanced routing menu and it comes back on.
The funny thing is though, it's worked great in all my other 7 wrap boards on the network, smooth sailing, no troubles, but on this one it's where all the problems seem to be. And as far as I can tell the configuration is pretty much the same as all the others. And I've tried all manner of different settings including messing around with the router id to no avail.
It's a simple wrap board with a single atheros card in it for backhaul. It's running in client mode.
I have the same hardware configuration here at my house and It gives me no problems.
I think what we really need here is to compare notes. To come up with theories as to what might be the problem and to disprove them.
Maybe for a first step we can list the hardware configuration of the radios giving trouble. It's been mentioned before it appears to be prevalent on atheros links.
First question: Has anyone seen these problems on a non atheros link?
oscarBravo
06-30-2005, 03:10 PM
We were able to produce all sorts of problems on our test network, which consists of virtual PCs in VMware. These PCs have no radio cards as such; we simulated all our radio links with virtual Ethernet links, and still had problems.
oscarBravo
06-30-2005, 03:17 PM
Occasionally, when a machine reboots, or if I do an "activate changes", one of its neighbors will stop exchanging routes with it. When this happens "sh ip ospf neighbor" will sometimes show:
Full/DR
or:
Full/Backup Those are cool. and other times it will show:
Init/DRother That's not good, if it lasts longer than 10 seconds (or whatever your "hello" interval is). "DRother" just means that the router is not either the designated or backup router for that link. "Init" means it's initiating a connection with the neighbour, so it shouldn't stay that way for long - it should go to "ExStart" and then to "Full". I'm assuming that after 30 minutes, the routers do an LSA refresh. Think so. 30 minutes seems to be the MaxAge value. It makes me wonder if it has something to do with the time Atheros cards take to associate. I've noticed that Atheros 802.11a links sometimes take considerably longer to associate than Orinoco or Prism based 802.11b links. Could it be that OSPF is coming up and trying to establish a link with its neighbor before an association has been established? Just a guess... Does it take longer than 40 seconds (assuming that's your dead interval)? When a router comes up, if it has neighbours explicitly specified, it will be in "Attempt/DRother" state until it makes contact with the neighbour, or until the dead interval has passed - whichever comes first. Even at that, it should establish contact as soon as it associates.
Beebe
06-30-2005, 03:42 PM
Well I just noticed the router next to the one I thought was giving problems has re-booted itself the last two times the problem occurred. At least judging by it's uptime. It's uptime is the same amount of time ago as the call log on my cell phone from the customer who told me it's down.
Will keep an eye on this to see if it happens every time.
bairdc
07-01-2005, 10:36 AM
Those are cool.
Well, they're cool if OSPF is working. They're not cool if it's not. :-) I realize you're saying that this means OSPF is up and running. However, that's not always the case.
Does it take longer than 40 seconds (assuming that's your dead interval)? When a router comes up, if it has neighbours explicitly specified, it will be in "Attempt/DRother" state until it makes contact with the neighbour, or until the dead interval has passed - whichever comes first. Even at that, it should establish contact as soon as it associates.
I don't think it usually takes anywhere close to 40 seconds. However, I'm not always around to watch when a router reboots, so I can say for sure.
bairdc
07-01-2005, 10:44 AM
The funny thing is though, it's worked great in all my other 7 wrap boards on the network, smooth sailing, no troubles, but on this one it's where all the problems seem to be. And as far as I can tell the configuration is pretty much the same as all the others. And I've tried all manner of different settings including messing around with the router id to no avail.
I have one suggestion on this. Interestingly, I have one Atheros link that will not do multicasts. It's very strange. If I tcpdump both sides, I can see the multicasts being transmitted on one side, but the other side never hears them. This doesn't happen on any of my other Atheros links. Anyway, because of this strange behavior I have all my neighbors specified in both OSPF and RIP. Things work fine on this link so long as I specify neighbors. If you're not already specifying your neighbors, maybe give that a try on the box that is acting up, as well as its neighbors.
Craig
Beebe
07-02-2005, 08:54 AM
I have all the wireless neighbors specified since they are set to non broadcast. The ethernet links are broadcast and it discovers it's own neighbors...
Now here's something interesting...
Yesterday my network looked like this...
client1.........ap1____NOC____ap2............clien t2
......=atheros link
____=ethernet link.
Before, OSPF would have it's problems on client2 - that's where OSPF would turn itself off etc. So I reversed the roles, made ap2 into client2 and client2 into ap2. Now the problem has moved to ap1. ap1 now keeps turning off OSPF now, whereas before it was solid on this router since I turned on ospf.
This help anyone come up with any theories? Can anyone duplicate this?
Thanks,
Roger