View Full Version : OSPF Reliability Issues
mmc1800
05-09-2005, 07:47 PM
Hi,
OSPF randomly turns itself off on my StarOS boxes from time to time, and I am at a total loss as to what could possibly be the problem. What happens is that after I get a notification that a StarOS AP has lost connectivity, I log into the StarOS box with SSH and it shows the OSPF service stopped in the service listing. I open the "Advanced Routing" dialog, and the check mark is already selected for OSPF to be on but the service shows "Stopped" and as soon as I open the dialog the service turns itself back on without me clicking anything.
All the boxes are currently running Build 4619, but they had the same behavior before I made the upgrade. I was hoping beyond hope the new Zebra in 4619 would fix this, but it hasn't so I thought I would reach out for some advice. I also have them talking to Zebra (actually Quagga) running on Linux machines, but the Zebra on the linux machines has been 100% solid never needing any attention from the day I turned it on.
Currently I have 12 StarOS wrap APs making up the backbone of my network all running OSPF. They have no default routes (except the boxes that run HotSpot and are forced to have a default route in place), and behave perfectly when OSPF will stay turned on, the problem is that sometimes for unknown reasons, the OSPF service stops on some of the boxes.
To add to the problem, it seems to spread from one to another, for example, one StarOS box will drop OSPF and then suddenly one adjacent will turn off OSPF etc. until there are 4 or 5 boxes all with no OSPF and half my network is down. Sometimes the service turns itself off as above so I have to log in and open the Advanced routing box so it just starts itself, or sometimes OSPF will stay on, but it will refuse to reconnect to its neighbors that have had the above problem and I need to reset the boxes completely.
Generally a run around the network via direct hop SSHs and rebooting will bring everything back online, but this is happening at least every couple days (sometimes twice a day) and it is very annoying.
Rather than post all of my configs I will post one that is pretty typical.
ospfd# show running-config
Current configuration:
!
hostname ospfd
password ####
!
!
!
interface eth0
!
interface lo
!
interface tunl0
!
interface gre0
!
interface wpci0
ip ospf network non-broadcast
ip ospf cost 54
!
interface wpci1
ip ospf network non-broadcast
ip ospf cost 54
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface br0
!
interface cbq
!
router ospf
network 10.0.0.0/24 area 0.0.0.0
network 10.100.0.0/30 area 0.0.0.0
network 10.100.0.12/30 area 0.0.0.0
network 10.100.0.32/30 area 0.0.0.0
network 192.168.0.0/24 area 0.0.0.0
neighbor 10.100.0.2
neighbor 10.100.0.14
neighbor 10.100.0.34
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist deny any
!
line vty
access-class vtylist
!
end
I use /30 subnets to make the PTP connection from one router to the next (this one peers with 3 other StarOS boxes over wireless and 3 other OSPF routers via broadcast over eth0).
Any ideas or experience would be greatly appreciated.
mmc1800
05-12-2005, 12:26 PM
Just another little piece of information on this, the problem seems to happen when a connected OSPF box is reset. I have not done a lot of testing on this, and we have not been keeping logs, but we have been running OSPF for a few months and I was thinking abotu it and I am pretty sure this holds true most of the time.
As long as we are 100% green OSPF seems to stay very solid, it is when a connected OSPF router is reset that the OSPF service will stop and need to be manually restarted, of course it is unfortunate that when radios are resetting and dropping out is when we need OSPF the most.
bairdc
05-16-2005, 10:51 AM
Yes, from what I have seen, OSPF works great so long as the network is stable. As soon as the network starts to have issues, OSPF falls to pieces about 60% of the time.
I personally have seen issues similar to yours, but not exactly the same. I haven't seen any issues with OSPF turning off. However, I have had problems with having to either reboot boxes or manually turn OSPF off, then back on again before the machine will start exchanging routes with its neighbors. Like you said, every time this happens, it's after a network "event". For example, I have some WRAP boards that will occasionally reboot for no apparent reason. When they reboot, OSPF will often times stop working. The strange thing is that it's not the OSPF box that rebooted that seems to have the trouble. It's always a box one, two, or three hops away from the box that rebooted. In every case I've seen, logging into the dead OSPF box, and stopping OSPF, then starting it again fixes the problem.
Also, one other interesting thing that I've noticed: I've only ever had this happen on machines with Atheros Interfaces. I've got OSPF running on machines with ethernet, Orinoco, Prism, and Atheros interfaces, and the only boxes this ever happens on are the Atheros ones. All the rest have been rock solid.
Anyway, due to all the trouble I've seen with OSPF, I'm going to be switching to RIPv2. It's not an option I really like, but it appears to be the only one at the present time if I want something that works properly.
Craig
mmc1800
05-16-2005, 03:57 PM
All of my OSPF enabled wraps are Atheros, so I have nothing to compare it with.
RIPv2 yuck.. I guess I am going to have to make that change also on my StarOS wraps which is a total drag. On a few though I really need OSPF features that RIP just doesn't have which was one of the primary reasons why I am using StarOS in the first place.
I am probably going to have to go to a non StarOS product that will run on my WRAPs and let me install a stable version of Quagga that will allow for equal cost multipath load balancing and failover. This is really not something I am looking forward to, puts us back quite a chunk of time on our rollout plans. StarOS really should not advertise itself as an OSPF platform if we are not going to get as least basic OSPF functionality working and stable. There seems to be some sort of institutional denial regarding this issue as it has not been officially accepted as a known issue as far as I can see from the boards here. Nothing personal, but if it isn't working they should at least let people know if they are not going to fix it.
For 2 or 3 redundant connections, with failover features and load balancing when they are all working, OSPF is the only ticket in town as far as I can tell. RIP just does not support this in any way I can find documented.
mmc1800
05-17-2005, 03:31 PM
I made some configuration changes last night that seem to have had a positive effect on OSPF. It is a little early to tell if it is really the magic bullet, but I have not had any of my typical OSPF failures, and I have been resetting radios on purpose to try to brake it, which has always brought OSPF down on the neighbors in the past.
Oddly enough, hard coding the ospf router-id to the values that were showing up in the "show ip ospf neighbors" list when no router-id was defined seems to have helped. I was having some problems with ospf-router id early on that made some of the radios not connect with OSPF at all, so I removed the router-id from the configs, but I was using a different set of IPs for the IDs than what shows up when there is none defined.
It seems to not be following the OSPF standard for which IPs it is picking as the router ID. but I just logged into a neighbor radio, went into the OSPF configuration and did a "show ip ospf neighbors" and used the router-id that was showing up in the list for every radio when there was no router-id defined, and since then everything seems to be better.
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.200.2.1 1 Full/Backup 00:00:32 10.100.0.41 wpci1:10.100.0.42 0 0 0
^^^^^ used this addresss where before usually I was ^^^^^^^ using this address
These are the addresses that showed up when there was no router-id defined on the neighbor.
I don't think this should make any difference, but it seems to be. Might be that OSPF is getting confused with the router-ids and that is why it wont reconnect or shuts itself down as the case may be. (I have my fingers crossed)
Like I said it is a little early to claim victory, but I thought someone else who is having problems might experiment along with me and see if this helps (I don't think it can hurt).
bairdc
05-18-2005, 12:29 AM
Okay, I'm in for the test! I've got router-id set on all my OSPF routers now (previously it was unset). I really hope this works! Thanks for the suggestion!
Craig
mmc1800
05-18-2005, 04:05 PM
Looking good so far... Things have been very solid the last couple days, whew what a relief. Hope this holds.
lonnie
05-18-2005, 05:57 PM
I always use a router-id so maybe that is why I have never seen OSPF go unstable.
Good luck if you think quagga is more stable. If you remember we "upgraded" a few releases ago and we almost had a mutiny so that within 2 days we had removed it and put the zebra back.
OSPF is very easy and difficult to use. It is difficult to understand but once you do understand it, it is quite easy. Have you purchased the Cisco OSPF book that was recommended (under Library topic, set date filter to more than 2 months)? I did, and fell asleep more times than enough, but you know some of it stuck and now OSPF makes sense. It is stable from our own experience.
May we continue to advertise that we do in fact have a basic, stable OSPF?
mmc1800
05-18-2005, 10:58 PM
This is strange behavior for OSPF, but appears to be working better now. There is no reason letting the router-id default should make OSPF drop in and out (and eventually stay out) on neighboring radios, and require that you log in to the radio and restart OSPF to make it active, but it appears to have this behavior.
I have done a ton of reading on OSPF in the last few months, and know more about it now than I ever wanted to know, or needed to know for that matter. What I am doing with OSPF in StarOS I handle different ways on the other networks I have been maintaining for 20 years, BGP, Spanning Tree, 802.1q etc. This whole OSPF issue has been very frustrating as what we have been attempting to do is simple, not requiring any real in depth knowledge of the protocol (single area, small number of routers, basically static topology).
Nobody seems to have much trouble in the non StarOS world with OSPF on linux either Quagga or Zebra with this kind of simple network, so I don't think that is a big deal either way (I have no real preference but Gentoo for example only has quagga in the portage tree).
I am not going to comment on if this is a Zebra issue or a StarOS issue as I do not have access to try to fix it, and you never really know what causes an issue until you can fix it, but no doubt the interface between StarOS and OSPF is for sure buggy and I am hoping that is resolved, but for the moment at least it does not appear to be mission critical.
I will let you know how this all works out. I am very glad to have found at least an apparent resolution this issue, but having my network bouncing up and down and needing baby sitting for these months has been a hassle to say the least. I rewrote my configurations in just about every possible configuration before I came up with this, and it is not just having the router-id defined, it is having the router-id defined the "correct" way, but since we are dealing with a bug and not documented behavior, the "correct" way is kind of a hack to make OSPF work with StarOS and not really part of the OSPF spec, and certainly not anything a book would have solved. Attempting to point to your customers lack of experience with a protocol to mask some clear bugs and not trusting them enough to investigate and recreate the situation yourself is not proper IMHO, but then again that is just my opinion, and I understand you have to pick and choose what you work on with limited resources. Just FYI because something happens to work for some people doesn't mean there isn't something wrong with it that you might need to look at.
I have never found the chapter in any OSPF book that says the routers will turn OSPF off on neighbors if you allow OSPF to define a default router-id, and that hard coding the exact same ID in the configs will make it more stable. This behavior is specific to StarOS in my experience so far, and I still think it should be corrected (but buggy and working with a nonsense hack is much better than buggy and broken in my book). My opinion is that this issue will continue to cause various problems until it is fixed, the advanced routing checkbox window is flawed to say the least.
Just glad to have slept through a night finally without my pager yelling at me it was time to get up and reset another StarOS box.
lonnie
05-19-2005, 01:57 AM
Michael, I do resent your saying that there is a problem with StarOS specifically and zebra. The zebra is absolutely unmodified and the kernel is Linux with totally straight up TCP. There is nothing different here than ANY other system you can try this on. As for undocumented behaviour, well I am tired of that. Go RTFM from the zebra group or buy the Cisco book.
You found out you have to assign a router-id. For some reason I did that automatically since all the examples I saw had the router-id and mine worked. Go figure.
I am glad you are getting some sleep, but you gotta stop bashing us and just admit you did not have it configured properly.
mmc1800
05-19-2005, 02:17 AM
If you missed the details, none of the straight Linux boxes running zebra or quagga exhibit this behavior (3 different distributions, 6 different machines, different versions etc.) configured the same way.
I did not have OSPF mis configured, and the only OSPF I have ever seen show this behavior is on your StarOS machines. Having router-id set is optional in OSPF, and not having it set is not a misconfiguration, and regardless should never end up with the behavior StarOS shows in that situation which is impossible to duplicate on any linux distro, and simple to duplicate on StarOS.
I am running a large preexisting network which I fit StarOS into, and as other people have reported, only the StarOS OSPF routers have these problems.
I have no access to see how your software deals with starting and stopping the services, and making sure they are running, that is where I would look first if I was looking, but since your interface is closed source we don't have that option and it is not our job.
Sorry to report these facts, but this is what they are. Resent it all you want, pretending it isn't true isn't going to change it.
lonnie
05-19-2005, 09:09 AM
I am not pretending anything. I have systems running for weeks and months with no troubles like you report. You seem to have stumbled onto a "fix" that I have always used because all examples I have ever seen assign a router-id rather than letting the system choose. Call me picky but I like to assign anything that is an ID that gets used to identify a machine.
Anyway, this is going no where. We will keep an eye out for a new zebra release and will upgrade when one comes out.
There really is not much more we can do.
mmc1800
05-19-2005, 11:16 AM
It may be going nowhere for you, I am trying to get my network working.
I started with router-id defined in every configuration, that is not the only caveat to the "fix". There are several situations where putting a perfectly valid router-id makes the OSPF behave far worse than having none. If just adding any valid IP on the radio in router-id configuration was the answer to this problem I would have not had a problem about 2 hours after the first time I saw this issue.
I have been running some more testing as I have had a whole segment of my network that had previously been working fine start to fall apart after I added a router-id into the configurations, which gave me a chance to chase the issue down to a little more detail. This is how I ended up with no router-id in most of my configurations, because the network was more stable without one.
The only reason I am posting here is to attempt to return the favor of people who have posted before to get me as far as I am, but honestly there are a couple more steps between here and where we should be and I am not trying to attack you, just trying to help us all get there. I really appreciate everything you do, and I am for better or worse in your StarOS boat and I am not currently unhappy I am there. I would still make the choice to use your products, and I recommend them early and often. Just trying to get things to work as they should.
I have a few WRAPs that have no IP address on eth0 at all, or that pull their eth0 address from DHCP and it can not be hard coded into the configuration. I think the key to my earlier "fix" was that I happened to be entering one of the IP addresses of eth0 into the router-id field (which was also FYI the original configuration I used for all my radios). The reason the rest of my network was working relatively well was because there was no router-id's defined at all. The problem appears to be putting a certain type of IP in the router-id field rather than if you have one or not.
When there is no IP on eth0 or you cannot define it the natural tendency is to put a static IP from one of the Atheros interfaces into the router-id field. I think this is where things start to go south. On the repeater radios that have no IP in eth0 or where eth0 is dynamic I have just taken out router-id all together, as putting any of the available IPs in that field makes MANY of the radios on the network unstable.
It is not just the radio with the router-id defined or not defined that starts acting up, it is the entire segment of the network around that radio. The radios when they blow up do one of the following 3 things at what seems like random (but probably isn't) sometime within the first couple hours of OSPF being turned on (much sooner if you start reseting radios around the network and letting OSPF try to heal itself):
1] The radios will just turn OSPF off completely (earlier mentioned symptoms where you have to log in and just open the "advanced routing" window and the service will restart itself because it is already checked). Most of the time this is all that is required, but sometimes a full restart.
or
2] they will leave OSPF on and functioning but only carry routes in one direction (ie they get all the routes from their OSPF peers, but they no longer send out updates for the networks they are responsible for) resulting in loss of connectivity to anything attached to the radio, but it still showing good pings to the NOC and the "ping watchdog" not bothering to reset the radio (the is a most annoying behavior).
3] they will leave OSPF on but they will no longer send updates to any of their neighbors no matter what you do (including restarting every radio on both sides) OSPF will never reestablish. The "show ip ospf routes" will only show directly attached networks, but the "show ip ospf neighbors" will show the correct neighbors as if they are all in a good state but none of the routes are in the table (could do some more packet sniffing here to see what is actually going on). The only way to bring the radio back is to log into the radio and restart OSPF manually, or restart the radio completely.
I am still trying to lock down the exact symptoms and how to get them to repeat with some testing (things get slower to test when they start working better), we are up for 4 hours now with no OSPF crashes after changing the radios that have no static eth0 address to just not have a router-id defined at all rather than the atheros interface IP.
Craig reported he only had issues on his atheros radios, and it seems like I might only be having issues when I use an IP from an atheros interface, so possibly at this point it looks like it might be a good idea not to put the IP of an atheros radio in the router-id field, and no router-id seems to be better than one of those.
I am curious how things are going for you Craig after making the changes, and if any of the router-ids you ended up with were atheros interfaces?
bradg
05-19-2005, 02:02 PM
Not jumping into the fray here really, just putting in my experience with Star-OS and OSPF.
My hardware setup is mostly as follows:
1) All Atheros radios operating in the 5GHz bands, SSID hidden, all features enabled, and Inter-BSS relay disabled
2) All WRAP or Soekris net4501/4801 hardware
3) Latest Star-OS release 2.01.5 (although I had it working on 2.01.2 as well)
My OSPF configuration is as follows:
1) All wireless interfaces configured for non-broadcast
2) All wireless neighbors defined
3) All units in area 0 (so far anyway)
4) Redistribute kernel, connected are defined (all other redistribution statements removed)
5) Router ID uniquely defined as an arbitrary private IP address
6) Unused interfaces (ethernet or wireless), or customer access interfaces are defined as passive
7) Ethernet neighbors are not defined, allowed to be discovered via broadcast
And, after all of this, it "just works" finally.
General observations are:
I did have a *lot* of trouble when I was attempting to integrate it into our existing Cisco network, but I took a shotcut/cop-out and just statically routed the three Class C's needed on the wireless network over to the first Star-OS box acting as the uplink and it's not been an issue since.
I probably won't revisit the Cisco integration issue until we're much further along, but as I said, it's working great now.
I think disabling Inter-BSS relay is part of the issue with needing to define neighbors on the wireless interfaces (which does make sense), but even with Inter-BSS relaying enabled it was a little hit-and-miss.
I had the "service showing stopped, then started" issue at one point, but that was when I was integrated into the Cisco network, and I've not seen it since.
Other than that, it's been working very well for a couple of months now, and we just added (upgraded) a site on the backhaul that is fully "OSPF'ed", and it came up and worked first time, and continues to behave as it should.
My $0.02 worth.
Brad
mmc1800
05-19-2005, 03:07 PM
Thanks Brad.
Just curious when you say the routers have a random private IP in the router-id, is that IP usually assigned to any of the interfaces?
I seem to be honing in on the issue at hand and it seems to be tied to the choice of the router-id IP and where it is assigned on the radio (not 100% sure yet but I can get the symptoms to start and stop by changing that variable). I have not tried using a random address that is not even assigned to the radio (that seems like a bad idea but who knows at this point).
Unfortunately I cannot work around having OSPF stretch back past the StarOS radios onto my other routing equipment as the default routes change directions from one end of the network to the other depending on network outages and backhaul availability, and jump off in the middle for certain VPN routes that need to be rerouted if those links go down. So I cannot set static routes to one end or the other and work from there, I need dynamic routing all the way back. Fortunately integrating with the linux routers I am using does not seem to be any problem, the links from the StarOS OSPF to the Linux OSPF are almost 100% reliable so far except when the OSPF shuts off on the StarOS side and even then usually the linux/StarOS part of the OSPF will work if OSPF stays on at all (and for now I have gotten those particular radios pretty solid).
Thanks for throwing in your experience, the more the merrier when collecting data.
bradg
05-19-2005, 03:16 PM
Not necessairly random, but unique, and in a private IP range that's not routed or assigned anywhere.
A long time ago, related to Cisco reading, the Cisco router ID would be assigned based on the highest (or lowest - I forget which) numbered IP address on an interface in the system, unless the loopback interface was defined, in which case it used that IP address.
According to what I remember, the ID was mostly irrelevant, as long as it was unique to each router. Real OSPF experts here could probably chime in here if my memory is bad.
But, what I've been doing is assigning each OSPF enabled router a unique, sequentially numbered ID from a private IP address range, starting at 10.255.255.1 (at the uplink), and increasing the last dotted quad by one for each hop deeper in the network (up to a maximum of 40 per area IIRC). It's not assigned to any interfaces, just used as the router ID.
It's worked flawlessly for me as long as I've been doing OSPF on Cisco, and now Star-OS.
I really should be purchasing the Cisco OSPF book for an "all in one" reference, but thus far have learned and found what I needed online.
Brad
mmc1800
05-19-2005, 03:43 PM
Interesting. I will play with that a bit and see what happens. OSPF is supposed to choose the highest IP address for its router-ID by default if it is not defined, but I had not thought to assign a router-ID that was not actually assigned to any IP on the box anywhere, that might help.
Of course you are supposed to be able to use any actual IP on the box for the router-ID. Having the ID be an actual routed IP on the box helps when you are trying to find out which box is having problems or advertising a route it shouldn't be (not having tracepath on the StarOS box can make that kind of a long process sometimes), or where a route is actually going from the router tables, but I can for sure live with keeping a seperate chart that shows my actual IPs mapped with the unused (and unrouted) IPs for router-id only.
bairdc
05-19-2005, 05:37 PM
Well, following Michael's suggestion, I put router-ids on all my StarOS boxes. Since doing this, I haven't seen any OSPF issues, but then I haven't had any network trouble either. The only time I ever saw OSPF have issues was when my WRAP's were unexpectedly rebooting, or when I had to do an "activate changes" somewhere on an OSPF router. I haven't had any of that happen since I put in the router-id on the boxes. I'd like to try rebooting some boxes or doing some "activate changes" to see if the problem is fixed, but I have too many users on the network to have the luxury of experimentation. Unfortunately, I'll just have to wait for a "network event" that causes OSPF to have to recalculate its routes.
My personal opinion is that this is an issue with Zebra doing OSPF over Atheros interfaces. This would explain why Michael has not seen the problem on his non StarOS routers running Zebra. I've been running OSPF since October of last year, and like I mentioned before, the only machines that I've ever had trouble with are my Atheros-based backbone routers. I've got several Orinoco links and a lot of ethernet links doing OSPF, and I've never seen a router lose communication with its neighbor on those--only on Atheros links. I've got a total of four WRAPs with Atheros cards in them running OSPF, and I have seen this happen on every one of them at one time or another. OTOH, I've got several non-Atheros WRAP and PC-based StarOS boxes running OSPF, and they've never had a problem.
This is a real issue. Michael isn't just blowing smoke. However, there also has to be an explaination for why Michael and I are seeing it, but others aren't. The router-id thing definitely seems to be a common factor. Michael and I weren't using it. Others who seem to not be having this problem are using it. Personally, I'm excited to think that maybe this is the cure for the problem I've been seeing. Maybe I won't have to switch to RIPv2 after all...
Craig