PDA

View Full Version : OSPF and the clock


Skaught
08-19-2005, 01:56 AM
Does the clock have to be in sync for OSPF to work?

We seem to be able to get it to work if the clocks are in sync, but without a route to the internet, the other end cannot sync it's clock and it always thinks it is febuarary 2000 after a reboot.

Has anyone tried installing a battery on the Wrap on the RTC battery connection? Am I smoking something?

bradg
08-19-2005, 01:10 PM
Does the clock have to be in sync for OSPF to work?

We seem to be able to get it to work if the clocks are in sync, but without a route to the internet, the other end cannot sync it's clock and it always thinks it is febuarary 2000 after a reboot.

Has anyone tried installing a battery on the Wrap on the RTC battery connection? Am I smoking something?

I've been casually watching the Zebra mailing list on this for a couple months now, and there appears to be a possible "fix" for it, but I'm not sure it's been implemented, or even can be. See here - http://marc.theaimsgroup.com/?l=quagga-users&m=111982795304388&w=2

Paul Cunnane (who I believe posts on this board also) first pointed it out to the Zebra team back in May. One discussion about it is here - http://marc.theaimsgroup.com/?l=zebra&m=111221861625809&w=2 and another here - http://marc.theaimsgroup.com/?l=zebra&m=111973314518527&w=2

I may do some digging to see if anything like what was discussed in the Quagga thread was checked into CVS. I'm not a programmer, but if this would "fix" OSPF in Zebra and make it behave, I'd gladly pay a halfway decent sum of money to help motivate someone to implement it.

Brad

bradg
08-19-2005, 01:35 PM
I've been casually watching the Zebra mailing list on this for a couple months now, and there appears to be a possible "fix" for it, but I'm not sure it's been implemented, or even can be. See here - http://marc.theaimsgroup.com/?l=quagga-users&m=111982795304388&w=2

Well, further digging in the Zebra 0.95 source shows that all time references I could find in Zebra appear to be of the "gettimeofday" variety, and not a single instance of "clock_gettime" - needed for "monotonic time".

man gettimeofday - http://www.cl.cam.ac.uk/cgi-bin/manpage?2+gettimeofday

man clock_gettime - http://www.cl.cam.ac.uk/cgi-bin/manpage?3+clock_gettime

I have not looked in the current CVS to see if it's been addressed.

Grrrrr, wish I were a more experienced programmer to try this out.

Skaught
08-19-2005, 04:21 PM
So basically OSPF will not work for our network?

How is everyone else doing it, I must be missing something here.

bradg
08-19-2005, 11:50 PM
So basically OSPF will not work for our network?

How is everyone else doing it, I must be missing something here.

Well, it does "work", just not as solid as I think it should be - definitely not "Cisco" solid, and that's my end goal.

I did a bunch of playing with "clock_gettime" today and I'm going to take a blind, Hail-Mary poke at it this weekend just to see if it will compile if nothing else. I couldn't program my way out of a wet paper bag anymore, but if nothing else, it will help brush some of the dust off - even if it is a kludge.

I couldn't say for sure if the clock issue is the root of the overall OSPF wierdness, but it is at least confirmed to be an issue in certain cases (like NTP setting the system clock after routes come up).


Brad

oscarBravo
08-20-2005, 12:08 PM
Paul Cunnane (who I believe posts on this board also) first pointed it out to the Zebra team back in May. I did indeed. It's interesting to note that Paul Jakma (the lead developer of Quagga) responded to that thread, but Kunihiro (Zebra's developer) didn't.

OSPF in Zebra seems to be fairly badly broken in a number of ways. Our testing shows Quagga is a lot less broken, but Lonnie's had bad experiences with it.

oscarBravo
08-20-2005, 12:12 PM
I did a bunch of playing with "clock_gettime" today and I'm going to take a blind, Hail-Mary poke at it this weekend just to see if it will compile if nothing else. I couldn't program my way out of a wet paper bag anymore, but if nothing else, it will help brush some of the dust off - even if it is a kludge. I've done some half-assed C hacking in my time - you have my gmail address if you want to run anything by me.

bradg
08-20-2005, 12:58 PM
I've done some half-assed C hacking in my time - you have my gmail address if you want to run anything by me.


Well, at this point, you're the resident expert on Zebra's OSPF implementation! Care to elaborate on other brokenness you've seen?

I'm just wondering if it's worth hacking on to see if it gets any better, or try to convince Tony/Lonnie to try (a newer version of) Quagga again?

I think fixing the time calls will be relatively easy if the calls are simply replaced with an appropriate bit of code calling clock_gettime (CLOCK_MONOTONIC, &...), although it may break some of the portable-ness for non *NIX platforms. But the real question of the day is - even as a Linux specific "hack", will it significantly help iron out the goofiness?

I really, really hate having to dig into this so deep as it distracts me from the real task at hand, but I also really, really need OSPF to "just work", and I do not want to have to revert to RIP if there is anything I can possibly do to help it. God Forbid if I ever have to go back to static routing. Oh, I'm getting a headache...

I also wonder if Tony/Lonnie would be willing to implement a "fixed" Zebra on a private beta basis, assuming I or someone can get the time issues fixed to see if it helps? I guess that's the only way we'd know since I don't have any other machines running Zebra. And, at this point, I'm more than willing to be a Guinea Pig on a brand new site I'm building (3 dual radio WRAP's) if it's of any value.


Brad

oscarBravo
08-20-2005, 02:42 PM
Well, at this point, you're the resident expert on Zebra's OSPF implementation! Care to elaborate on other brokenness you've seen? ospfd crashes sometimes, which is mucho bad. It also has some blatantly erroneous behaviour: the worst I've seen is a recurring situation where a router is unaware of the network directly attached to an interface, even though "sh ip os in" shows that OSPF is running on the interface, and "sh ip os ne" even shows neighbours attached to the network in question. There's no way this should be possible, and I've brought it to Kunihiro's attention, but didn't get anywhere. I'm just wondering if it's worth hacking on to see if it gets any better, or try to convince Tony/Lonnie to try (a newer version of) Quagga again? I've been pestering Lonnie to have another go at Quagga... right Lonnie? I think fixing the time calls will be relatively easy if the calls are simply replaced with an appropriate bit of code calling clock_gettime (CLOCK_MONOTONIC, &...), although it may break some of the portable-ness for non *NIX platforms. But the real question of the day is - even as a Linux specific "hack", will it significantly help iron out the goofiness? clock_gettime() is a POSIX function, so it shouldn't really break portability. Even if it did, at this stage I'm happy to live with a hack that will make my network sane again. I really, really hate having to dig into this so deep as it distracts me from the real task at hand, but I also really, really need OSPF to "just work", and I do not want to have to revert to RIP if there is anything I can possibly do to help it. God Forbid if I ever have to go back to static routing. Oh, I'm getting a headache... Sounds like we're in very similar situations. OSPF really is our only option, as we need the routing redundancy. Problem is, babysitting flakey OSPF routers is becoming a full-time job. I also wonder if Tony/Lonnie would be willing to implement a "fixed" Zebra on a private beta basis, assuming I or someone can get the time issues fixed to see if it helps? I guess that's the only way we'd know since I don't have any other machines running Zebra. And, at this point, I'm more than willing to be a Guinea Pig on a brand new site I'm building (3 dual radio WRAP's) if it's of any value. We've got a virtual lab where we're more than happy to stress-test any experimental builds. C'mon Lonnie, can we, huh please, can we can we can we, c'mon pleeease?

bradg
08-20-2005, 03:31 PM
ospfd crashes sometimes, which is mucho bad. It also has some blatantly erroneous behaviour: the worst I've seen is a recurring situation where a router is unaware of the network directly attached to an interface, even though "sh ip os in" shows that OSPF is running on the interface, and "sh ip os ne" even shows neighbours attached to the network in question.

I've seen that too (unaware of local interface networks), although I never paid enough attention to the circumstances at the time to see if it was a result of something else going on. You know, customers to keep happy, data to move. That's what it's about.

Later this weekend, I'll download the latest stable and CVS Quagga sources and peek to see if it suffers the same timestamp issues that Zebra does.

I've been pestering Lonnie to have another go at Quagga... right Lonnie?

I really hate to pester Lonnie/Tony much on the issue right this instant, but I would at least love to have a list of possible fixes (maybe even tested) that I can give or point them to and ask/beg for a little attention after I've done some of the footwork. OSPF is a really, really important part of the network right now, and going another route is going to be pretty painful at best.

... babysitting flakey OSPF routers is becoming a full-time job.

I agree, it the *only* remaining nuisance I have with my almost 100% Star-OS wireless network. And, it'll run fine for days, or even a couple weeks, but as soon as a link goes down (due to whatever - including "activate changes"), then I get paged and need to chase the ghost through the network one or more hops to get things to converge again. You shouldn't need worry about that - ever.

We've got a virtual lab where we're more than happy to stress-test any experimental builds. C'mon Lonnie, can we, huh please, can we can we can we, c'mon pleeease?

The newest site I'm building is a sectorized setup with three /27 networks at each sector radio, three /30 PtP radios, and a shared /29 on the ethernet interfaces. OSPF is/will be active on all but the AP interfaces. One of the three PtP radios will connect another site, which will then connect back to our main site and close the loop for redundancy on that set of links. Since there will be only a couple (very tolerant) customers on the two new sites until into early October, I'd be more than happy to volunteer them for test in the name of making this problem go away.

I'm going to do what I can to dig into it deeper. I can't afford to make it my full time crusade, but I really want to get to the bottom of it. It is the last remaining oddity to iron out, and it's going to be nice to see this particular bug go away!


Brad

oscarBravo
08-20-2005, 03:44 PM
Later this weekend, I'll download the latest stable and CVS Quagga sources and peek to see if it suffers the same timestamp issues that Zebra does. We tested Quagga on our virtual lab. It loses routes when the clock changes, but it seems to reliably rebuild the link state databases.

The monotonic mod is still an important change, but I'd like to see Quagga in StarOS even as is - it's hard to imagine how it could be any worse than Zebra.

bairdc
08-26-2005, 12:56 PM
Geez, I'm glad to hear that others are still having the same OSPF issues as me! I was under the impression that almost everyone except me had figured out how to make it behave, and that either my config or my network was just plain screwy.

So does Tony's and Lonnie's silence on this mean that:

A. They aren't reading this thread.
B. They aren't willing to give Quagga another try no matter what.
C. They're just plain too busy with other stuff to bother with it.
D. They're currently in the process of testing a new StarOS release with Quagga, which fixes the OSPF problem and makes it stable, and therefore don't have time to respond to this thread (Let's hope...).

Seriously, I know how busy Tony and Lonnie are, and that they're being pulled in a million different directions. But I really feel that this is a major issue. I implemented OSPF in order to have better stability through redundant routes. However, I think I've ended up with less stability than before.

Oh how I long for a stable OSPF.... :-(

Craig

Skaught
08-26-2005, 03:59 PM
We tried it on ine link that had 2 subnets behind it, the route would go up and down every 10 mins. So I think our config was sound since it works 50% of the itime.

I went back to static routes until someone tells me why it behaved that way.

lonnie
08-26-2005, 10:46 PM
bairdc, I have asked the guys to test quagga in their lab and tell me that it fixes the problems. They have a VMWare test system and it is easier for them to test quagga than it is for us to rebuild with quagga.

Sorry, but the last time was something I won't repeat. As I said, we had been told that quagga was solid yet when we took the time to build with it, it was a disaster and we had to undo it and bring out a new release in less than 2 days. Sure it is under active development but that does not mean it is better. It just means they are adding and fixing, but could be fixing things they have added.

It is not fair to us to go through that process unless you are sure it is better. That is all I am asking. If it is so important spend some time and help us out here.

bminish
08-27-2005, 03:54 AM
bairdc, I have asked the guys to test quagga in their lab and tell me that it fixes the problems. They have a VMWare test system and it is easier for them to test quagga than it is for us to rebuild with quagga.


We tested Quagga in late June and as far as OSPF is concerned It has none of the issues that have been plaguing us with Zebra. Quagga does not like the clock Jump when NTP sets the correct time either but it does recover just fine from it.

A test Setup based on debian is going to be significantly different to starOS since we have no idea of how Lonnie's Kernel is configured as well as only having virtual ethernet ports. However we were able to do the following

1/ setup a 10 node interlinked staros network in vmware (all virtual ethernet ports) and observe the issues that we are seeing on the real network

2/ Setup a 10 node virtual debian network 2.4 kernel / Zebra and observe the issues that we were seeing on the starOS network including being able to fully diagnose the clock jump problem since we could manipulate the clock at will.

3/ Set up a 10 node Virtual debian network as above except with Quagga as the routing agent. We could not see any serous issues and were unable to get the Quagga routed network into any of the broken states that we were seeing with Zebra. Quagga also has a clock jump issue BUT Quagga will rebuild it's databases and recover whereas Zebra often won't.

I do not know if Quagga will introduce RIP problems. I have never used RIP nor do I really understand RIP. I just know that RIP can't do what we need for our network

We reported our findings to Lonnie at the end of June. As it currently stands Staros DOES NOT have a stable working OSPF. We also offered to test a Staros with Quagga before making it a general release. I can also set it up in vmware so that Lonnie can play with it (to verify that RIP is ok for example)
We are still waiting (and babysitting an unstable network)

I have recent version of free BSD, it uses Quagga as the routing agent, Debian uses Quagga as the routing agent. These are conservatively managed distros with large user-bases that have moved away from Zebra in favour of Quagga

The ball is very much in Valemount's court on this one

.Brendan

oscarBravo
09-01-2005, 03:33 PM
Any thoughts, Lonnie? As it stands, OSPF in StarOS doesn't work. It's going to be an obstacle to the adoption of StarOS if it can't handle a redundantly-routed network. We've put a lot of work into diagnosing this so far, and we're prepared to put more work into testing potential fixes.

This needs to be fixed.

lonnie
09-01-2005, 06:43 PM
We'll try Quagga but do not expect it for at least a month.

I have been running an OSPF segment for 6 months right now and it does not have issues. I suspect that if I tried to do more than I am it might, but to say OSPF does not work is a bit extreme.

This config is on a repeater that has a CM9 in AP mode and Client mode. I do not have very much happening and it never quits or gets fooled. What would you add to my configuration? Not much I hope, since this works perfectly.

Current configuration:
!
hostname ospfd
password 1234
!
!
!
interface eth0
ip ospf cost 100
!
interface wpci0
ip ospf cost 500
!
interface wpci1
ip ospf cost 500
!
interface lo
!
interface tunl0
!
interface gre0
!
interface eth1
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface cbq
!
router ospf
ospf router-id 10.10.237.254
network 10.0.0.0/16 area 0.0.0.0
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist deny any
!
line vty
access-class vtylist
!
end

lonnie
09-01-2005, 06:46 PM
This unit is at the end, where my ADSL lines are lcoated.

Current configuration:
!
hostname ospfd
password 1234
!
!
!
interface eth0
ip ospf cost 100
!
interface eth1
ip ospf cost 100
!
interface lo
!
interface tunl0
!
interface gre0
!
interface eth2
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface eth3
!
interface eth4
!
interface cbq
!
router ospf
ospf router-id 10.0.0.1
network 10.0.0.0/16 area 0.0.0.0
network 204.50.71.112/28 area 0.0.0.4
network 204.50.234.88/30 area 0.0.0.5
default-information originate
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist deny any
!
line vty
access-class vtylist
!
end

lonnie
09-01-2005, 06:48 PM
This is an end unit. It has no statics and no default route. It gets everything from OSPF along the chain.

Current configuration:
!
hostname ospfd
password 1234
!
!
!
interface eth0
!
interface lo
!
interface tunl0
!
interface gre0
!
interface eth1
ip ospf cost 100
!
interface eth2
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface cbq
!
interface eth0.100
!
router ospf
ospf router-id 10.10.250.9
network 10.10.250.0/24 area 0.0.10.250
area 0.0.10.250 stub no-summary
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist permit 10.10.250.0/24
access-list vtylist deny any
!
line vty
access-class vtylist
!

lonnie
09-01-2005, 06:50 PM
This is another router in my ADSL section.

Current configuration:
!
hostname ospfd
password 1234
!
!
!
interface eth0
ip ospf cost 100
!
interface eth1
ip ospf cost 100
!
interface lo
!
interface tunl0
!
interface gre0
!
interface eth2
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface cbq
!
router ospf
ospf router-id 10.0.0.2
network 10.0.0.0/16 area 0.0.0.0
network 204.50.234.92/30 area 0.0.0.1
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist deny any
!
line vty
access-class vtylist
!
end

oscarBravo
09-02-2005, 03:44 AM
I have been running an OSPF segment for 6 months right now and it does not have issues. I suspect that if I tried to do more than I am it might, but to say OSPF does not work is a bit extreme. When we had a relatively small network, it worked also. The problems seem to scale non-linearly with the size of the network - Brad, has that been your experience also? This config is on a repeater that has a CM9 in AP mode and Client mode. I do not have very much happening and it never quits or gets fooled. What would you add to my configuration? Not much I hope, since this works perfectly. Our configurations are basically vanilla also. The problems don't seem to arise with configuration issues, so much as with scaling the network and adding redundant routes.

bradg
09-02-2005, 08:40 AM
When we had a relatively small network, it worked also. The problems seem to scale non-linearly with the size of the network - Brad, has that been your experience also? Our configurations are basically vanilla also. The problems don't seem to arise with configuration issues, so much as with scaling the network and adding redundant routes.

It seems on my network, that the further (in hops) away from the default route you get, the more pronounced the problem gets - which does make some sense when I think about it.

And, the hops aren't all wireless - in three places it's 100Base-T, so link quality isn't an issue there. And the wireless links are all rock solid as well (qualities in the mid 20's at a minimum).

But it may be worth noting that I have never once lost the route from our office (bandwidth source - default route) to our main uplink hub site which then feeds four PtP sites out of town, and grows from there.

Additionally, I experienced the most instability when I had the wireless (mostly Star-OS) network fully OSPF integrated with my other Cisco OSPF areas. I've now statically routed the wireless netblocks to the Star-OS uplink box, and everything in the wireless network is running a standalone area 0. I was going to attempt another area within the wireless network this fall, but hesitant to do so at this time for fear of adding another variable to the mix.

I've also not implemented any redundant routes yet, since it would involve the landline (Cisco) network right now. I'm in the process of building a couple new sites which will aloow me to close two rings and give those areas some redundancy.

Also, it appears to be the router furthest away from the default route that goes wonky and loses it's route out (toward the default route). When a router or network segment drops off, I have always been able to SSH into the router that's reachable deepest in the network toward the breakage from the default route, then use the Star-OS SSH client to SSH across the hop that isn't routing to reach the other router - the link itself has always been fine. Then, either "activate changes", or reboot the other machine and it will usually come up.

The "usually" part is what drives me crazy though. I'd venture about a quarter of the time, I have to "activate changes" or reboot both sides of the non-working link to get routes to propagate. Unfortunately, when it's down to that, I'm in a hurry to get the network back up, so I haven't taken the time to try to diagnose why sometimes both units need to be poked to bring OSPF back to life - I just know that if I poke in a certain order, it'll return to normal operation for an unpredictable period of time.

But therein lies the problem - it's in a production environment, with paying customers, and we need a stable OSPF that doesn't cause these headaches in the first place.

I really don't mind chasing hardware related issues in my network - antennas damaged or blown out alignment, power supply issues, CPE issues - all of those I can diagnose and fix. The OSPF issue is one that we as users of Star-OS cannot fix ourselves - there appears to be no magic configuration or design workaround.

I know time is an issue - it always is and always will be for all of us. But Lonnie, you need to trust us that these issues do exist, and that these problems obviously do have a profound impact on our network when they decide to seemingly randomly appear. What I think Paul and I are asking is that a stable OSPF implementation in v2 become some kind of a priority on the to-do list. I know there are things that are of higher priority on the list, but please - at least put this on the priority list somewhere.

bairdc
09-02-2005, 01:01 PM
I also didn't see problems when I had OSPF running on a small segment. I ran it on a three-hop segment from last October to around January or February. It never missed a beat. It wasn't until I moved a large portion of my network over to it that it started going nuts. Incidentally, I also was not receiving a default route via OSPF on my initial test. Default was still static. When I expanded my OSPF use, I started injecting the default route, and that's also when my trouble started so whether the problem has to do with running OSPF on a large network, or receiving a default route, I don't know.

My problems only seem to occur when I do an activate changes or if I reboot an OSPF router somewhere (or in the case of a WRAP, if it reboots on its own). After doing that, about half the time I end up with a router, either one or sometimes several hops away, that just loses its mind. It basically stops exchanging routes on the interface facing the router that rebooted (or where changes were activated).

To fix it, I can either wait for about 30 minutes, at which point the problem usually seems to resolve itself. Or, I can jump in and start poking at stuff to try to get it to work again. I usually end up SSH'ing into the router that is having trouble, and stopping and re-starting OSPF. As Brad mentioned, sometimes that fixes it, and sometimes it doesn't. If it doesn't, I may have to get into the other end of the link, and restart OSPF there. In some cases, I've found that restarting OSPF on the machine having trouble fixes it, but then another router down the line somewhere goes nuts and I have to jump in and fix it. All I can say here is thank goodness that StarOS has an SSH client in it so I can reach the dead router(s).

With regard to default routes, I probably have a different setup from most others. I'm injecting two default routes into my network, since I have two physically separate points where traffic leaves my network to the Internet. I don't know if this affects my OSPF stability or not, but if the problem is related to the default route, I suppose this could be an issue.

Right now, I'm running both OSPF and RIP. Since doing this, things are now stable. However, I don't know how long this will work. I'm wanting to move more of my network over to dynamic routing, but I'm afraid RIP's 16-hop limit is going to cause me grief.

Lonnie, I appreciate your offer to try Quagga one more time. Since it's a concern to you , you could release it as a beta (or alpha) with a big announcement saying that it is for testing purposes only, and may break your network. After it's been tested fully, and proven not to have issues, you could make a general release of it.

Craig

lonnie
09-02-2005, 01:34 PM
We will try Quagga, but it will be impossible for us in the next two weeks so it will be at least 3 weeks before we can get a test image.