PDA

View Full Version : OSPF-over-wifi links flapping when no other traffic is running on them?


kb1_kanobe
02-04-2005, 03:21 PM
I've been trying to get some simple OSPF route passing working and involving my Wrap-based StarOS 2.01.0 boards. Everything works fine until the link goes quiet for an extended period, then it seems to be hit and miss if the Hellos make it over the air, generally resulting in a link-down propagating across the area.

The general arrangement is:

EndA(Quagga)--[100Mbps Ethernet]--RtrA(Zebra)--[100Mbps Ethernet]--WifiA(Zebra)--[802.11g]--WifiInterchange(Zebra)--[802.11g]--WifiB(Zebra)--[100Mbps Ethernet]--RtrB(Zebra)--[100Mbps Ethernet]--EndB(Quagga)

If I observe the local routing table on EndA with nothing else running on the network the routes associated with WifiInterchange and beyond flap quite regularly (I've lowered my hello interval and dead timers for testing). If I open an SSH session to WifiInterchange and leave it sitting there then the routes associated with WifiB and beyond continue flapping, but everything else stablilises. If I SSH to WifiB then everything stays up solid. I get the same effect if I push streams of pings across the wire to the various hosts.

My two 802.11g legs are on seperate CM-9 radios and use seperate channels (albeit hosted on the same Wrap board) and operate without problems under all other circumstances. Eg. losses seem minimal on the links: >3million pings at 10ms intervals lost 38 packets from end to end, with an average rtt of 5ms.

What I was wondering was if the radios are going into some sort of 'power save' or 'frame consolidation' mode that might cause excessive delays when transfering small infrequent traffic, such as Multicast OSPF Hello frames.

For those of you who might be more familiar with OSPF internals, consider this debug information from around link-down event (notice the apparently delayed hello at 01:28:48):

...
2000/03/04 01:28:41 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:41 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:42 OSPF: Hello received from [135.132.23.1] via [wpci0:10.1.59.200]
2000/03/04 01:28:42 OSPF: src [10.1.59.3],
2000/03/04 01:28:42 OSPF: dst [224.0.0.5]
2000/03/04 01:28:42 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (HelloReceived)
2000/03/04 01:28:42 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: nsm_ignore called
2000/03/04 01:28:42 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (2-WayReceived)
2000/03/04 01:28:42 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:42 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:43 OSPF: Hello received from [135.132.23.1] via [wpci0:10.1.59.200]
2000/03/04 01:28:43 OSPF: src [10.1.59.3],
2000/03/04 01:28:43 OSPF: dst [224.0.0.5]
2000/03/04 01:28:43 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (HelloReceived)
2000/03/04 01:28:43 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: nsm_ignore called
2000/03/04 01:28:43 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (2-WayReceived)
2000/03/04 01:28:43 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:43 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:44 OSPF: Hello received from [135.132.23.1] via [wpci0:10.1.59.200]
2000/03/04 01:28:44 OSPF: src [10.1.59.3],
2000/03/04 01:28:44 OSPF: dst [224.0.0.5]
2000/03/04 01:28:44 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (HelloReceived)
2000/03/04 01:28:44 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: nsm_ignore called
2000/03/04 01:28:44 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (2-WayReceived)
2000/03/04 01:28:44 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:44 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:45 OSPF: Hello received from [135.132.23.1] via [wpci0:10.1.59.200]
2000/03/04 01:28:45 OSPF: src [10.1.59.3],
2000/03/04 01:28:45 OSPF: dst [224.0.0.5]
2000/03/04 01:28:45 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (HelloReceived)
2000/03/04 01:28:45 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: nsm_ignore called
2000/03/04 01:28:45 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Full (2-WayReceived)
2000/03/04 01:28:45 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:45 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:46 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:46 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:47 OSPF: Hello sent to [224.0.0.5] via [eth0:10.255.255.6].
2000/03/04 01:28:47 OSPF: Hello sent to [224.0.0.5] via [wpci0:10.1.59.200].
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Timer (Inactivity timer expire)
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: State change Full -> Down
2000/03/04 01:28:48 OSPF: nsm_change_state(): scheduling new router-LSA origination
2000/03/04 01:28:48 OSPF: DR-Election[1st]: Backup 10.1.59.200
2000/03/04 01:28:48 OSPF: DR-Election[1st]: DR 10.1.59.200
2000/03/04 01:28:48 OSPF: DR-Election[2nd]: Backup 0.0.0.0
2000/03/04 01:28:48 OSPF: DR-Election[2nd]: DR 10.1.59.200
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: neighbor deleted
2000/03/04 01:28:48 OSPF: Hello received from [135.132.23.1] via [wpci0:10.1.59.200]
2000/03/04 01:28:48 OSPF: src [10.1.59.3],
2000/03/04 01:28:48 OSPF: dst [224.0.0.5]
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Down (HelloReceived)
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: State change Down -> Init
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: Init (2-WayReceived)
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: State change Init -> ExStart
2000/03/04 01:28:48 OSPF: DR-Election[1st]: Backup 10.1.59.3
2000/03/04 01:28:48 OSPF: DR-Election[1st]: DR 10.1.59.200
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: ExStart (AdjOK?)
2000/03/04 01:28:48 OSPF: DR-Election[1st]: Backup 0.0.0.0
2000/03/04 01:28:48 OSPF: DR-Election[1st]: DR 10.1.59.3
2000/03/04 01:28:48 OSPF: DR-Election[2nd]: Backup 10.1.59.200
2000/03/04 01:28:48 OSPF: DR-Election[2nd]: DR 10.1.59.3
2000/03/04 01:28:48 OSPF: DR-Election[1st]: Backup 10.1.59.200
2000/03/04 01:28:48 OSPF: DR-Election[1st]: DR 10.1.59.3
2000/03/04 01:28:48 OSPF: NSM[wpci0:10.1.59.200:135.132.23.1]: ExStart (AdjOK?)
....

Any thoughts?

kb1_kanobe
02-04-2005, 03:38 PM
I forgot to mention that the debug I posted shows that OSPFd on RtrB (ie. via eth0) was not running at the time the debug information was captured - that was by design and was intended to help isolate the origin of the flap (and cut down on the volume of debug data generated).

bminish
02-04-2005, 05:33 PM
Are your Wireless links defined as non broadcast and do they have their neighbours defined?
there seems to be some broadcast stuff going on over the wireless links and this does not always work correctly, it's fine over ethernet interfaces however.
with wireless interfaces you are better off defining the interfaces as non broadcast and specifying the neighbours.

have you disabled unneeded redistributes, in particular redistribute kernel breaks things but you should probably disable all redistributes and only enable any you actually need or you run the risk of creating feedback loops which will cause router flap

What do your costings look like?

it would be helpful to see your OSPF config files

.Brendan

kb1_kanobe
02-04-2005, 05:46 PM
At the moment everything on all the links is broadcast. I can try NBMA or point-to-point based configuration over the wireless and see if that takes care of it however I had understood NBMA to be unreliable in Zebra.

I am also redistributing kernel and connected - I will remove kernel and see what happens. The configuration file from the interchange location (esentially the same as the other Wrap boxes) reads as :

hostname i80211bg
password xxxx
!
interface eth0
ip ospf hello-interval 1
ip ospf dead-interval 3
!
interface lo
!
interface tunl0
!
interface gre0
!
interface eth1
ip ospf hello-interval 1
ip ospf dead-interval 3
!
interface wpci0
ip ospf hello-interval 1
ip ospf dead-interval 5
!
interface wpci1
ip ospf hello-interval 1
ip ospf dead-interval 3
!
interface ecb
!
interface ipacct
!
interface beacon
!
interface wlanbr
!
interface cbq
!
router ospf
redistribute kernel
redistribute connected
network 10.0.59.0/24 area 0.0.0.1
network 10.1.59.0/24 area 0.0.0.1
!
access-list vtylist permit 127.0.0.1/32
access-list vtylist deny any
!
line vty
access-class vtylist
!
end

As you can see it's pretty near default, except for turning down the hello and dead intervals during testing. I haven't got any other hosts involved yet - I want things to be stable with a straight forward network before making it complicated. :-D

kb1_kanobe
02-04-2005, 07:54 PM
Hmmm... I don't seem to be able to get non-broadcast ospf running between the three wrap boards. Would anyone mind posting a sample or reference configuration for non-broadcast mode over a wifi link?

Thanks.

bminish
02-05-2005, 04:32 AM
Hmmm... I don't seem to be able to get non-broadcast ospf running between the three wrap boards. Would anyone mind posting a sample or reference configuration for non-broadcast mode over a wifi link?
thanks.

Post 14 on this (http://forums.star-os.com/showthread.php?t=3537) thread is pretty much how we have it working here

a few things about your config

1# since the ethernet links can be broadcast, the interfaces that 'meet' quagga will not change. non broadcast will require you to specify your neighbours but the Zebra version that is implemented in staros is rock solid, Our whole network runs using OSPF exclusively, all wireless interfaces are non broadcast.

Non broadcast interfaces require the neighbours to be defined
in the router ospf section
like this
neighbor x.x.x.x

2# redistribute kernel definitely breaks things, including at the quagga ends. the kernel routing table is generated by OSPF, redistributing this to other OSPF nodes creates all sorts of problems (feedback loops) and I cannot think of a situation where this might be desirable.

Don't use redistribute connected unless you are sure you need to do this.

The node at the network edge that has the default route needs to be set
default-information originate
under router ospf

3# get rid of your tweaked timings, at least until such time as you get things working ok. The default timings are 10 and 40 and are reasonable settings

4# If this is a backbone you should be in area 0 since area 0 is for backbone routes.

.Brendan

kb1_kanobe
02-06-2005, 11:52 AM
Non broadcast interfaces require the neighbours to be defined in the router ospf section like this
neighbor x.x.x.x

I was trying same with the interface defined as 'point-to-point' and/or 'point-to-multipoint' rather than 'ip ospf interface non-broadcast'. Now I've corrected that I'm actually seeing the unicast Hellos on the wire.

2# redistribute kernel definitely breaks things, including at the quagga ends. the kernel routing table is generated by OSPF, redistributing this to other OSPF nodes creates all sorts of problems (feedback loops) and I cannot think of a situation where this might be desirable.

Noted.

The node at the network edge that has the default route needs to be set
default-information originate
under router ospf

There isn't a 'default' route out of this network - it's just managing multipath route availability for an entirely private network.

4# If this is a backbone you should be in area 0 since area 0 is for backbone routes.

Ah yes - I already have other ospf devices running on the Ethernets on both sides, one of whom is area 0. I was using 1 to avoid propagating this information into the production side until I could work out why the flaps were occuring. Now it's behaving I'll get the right areas configured in.

Thanks again for that first tip, it was the key missing piece of information for me.

:-)