PDA

View Full Version : Olsr


therealboss
04-29-2007, 02:08 AM
I know I have a routing issue but cant find what it is, see below for layout.

AP1 10.20.1.2
|
|
AP2 10.20.1.1 AP3 10.20.1.3 AP4 10.20.1.4
| | |
| | |
AP5 AP6 AP7
| | |
| | |
AP8 AP9 AP10


All traffice going to internet has to pass through AP1, but every now and then there is a packet storm and traffice to and from AP1, AP3, AP4 back to AP2, AP1 AP4 And then back to AP1 before going out to the internet.

A Trace Route can give any of the following:-

AP8 > AP5 > AP1

or

AP8 > AP5 > AP1> AP4 > AP3 AP1 > AP2 > AP1

I am running OLSR on all AP's, this has run fine for months now but over the past few weeks I have had this problem on and off.

I have checked all the configs over and over.
I have got another WISP engineer to check my configs and he too cant find any problem.

We have changed the abover layout and given all 3 AP being fed from AP1 there own subnet and at the same time set up 2 VLAN's.

This has forced the traffic to take the correct route but now we see traffic going between AP2 to AP1 back to AP2 and then out to the internet over AP1.

After looking deeper into this I now see packet loss radio to radio and also Lan to LAN, I have replaced the CAT5, replaced the radios but when I het these packet storms I get packet loss over all the connections.

Links are in the range of -72 with a qty of 90% or above.

This is killing us and our heads are wrecked at this stage, over the past 3 weeks we have replaced ALL the equipment from the Gateway AP along for the next 4 hops.

Are there any bugs in OLSR that might give strange results when feeding 3 AP's from 1 Radio??

Please HELP, or I think I will go mad!!!!!!!!!!!!!


UPDATE

As you can see below the packet loss is TTL exceeded error, thi swas a ping from Eth0 on one ap to Eth0 on the next WAR, cable run 25'. Cable was replaced with one I tested all night with no loss.

Its all pointing to routing, but I can't find anything and I don't understand why for the past 6 months its all worked fine and now its all gone to pot. Anything new added over the past months has been turned off and still the same problem.


│10.30.1.3 : [23], 84 bytes, 0.92 ms (1.41 avg, 0% loss) ^│
│10.30.1.3 : [24], 84 bytes, 0.80 ms (1.38 avg, 0% loss) │
│10.30.1.3 : [25], 84 bytes, 0.91 ms (1.36 avg, 0% loss) │
│10.30.1.3 : [26], 84 bytes, 1.12 ms (1.35 avg, 0% loss) │
│ICMP Time Exceeded from 10.30.1.1 for ICMP Echo sent to 10.30.1.3 │
│10.30.1.3 : [28], 84 bytes, 1.82 ms (1.37 avg, 3% loss) │
│10.30.1.3 : [29], 84 bytes, 10.8 ms (1.71 avg, 3% loss) │
│10.30.1.3 : [30], 84 bytes, 10.1 ms (2.00 avg, 3% loss) │
│10.30.1.3 : [31], 84 bytes, 0.72 ms (1.96 avg, 3% loss) *│
│10.30.1.3 : [32], 84 bytes, 1.19 ms (1.93 avg, 3% loss) │
│10.30.1.3 : [33], 84 bytes, 0.91 ms (1.90 avg, 2% loss) │
│10.30.1.3 : [34], 84 bytes, 0.97 ms (1.87 avg, 2% loss) │
│10.30.1.3 : [35], 84 bytes, 0.91 ms (1.84 avg, 2% loss) │
│10.30.1.3 : [36], 84 bytes, 0.91 ms (1.82 avg, 2% loss) │
│10.30.1.3 : [37], 84 bytes, 0.91 ms (1.79 avg, 2% loss) │
│10.30.1.3 : [38], 84 bytes, 1.01 ms (1.77 avg, 2% loss) │


I realy need to sort this out as I have about 25 users that use our network for VPN and they loose their connection and they are starting to get pissed.

tog
04-29-2007, 05:36 AM
If you believe you may have a dynamic routing problem, simply add static routing entries in for the duration of your troubleshooting.

You may have packet loss over one of your links and it may be causing OLSR to drop the neighbor.

therealboss
04-29-2007, 06:50 AM
If you believe you may have a dynamic routing problem, simply add static routing entries in for the duration of your troubleshooting.

You may have packet loss over one of your links and it may be causing OLSR to drop the neighbor.




Tog
I have a static default in on each of these boxes but get the same problem each time I do a trace route to AP1 sometimes it will do it in 4 hops and other times it can take 9 hops or just time out.

lonnie
04-29-2007, 09:48 AM
Does AP4 have any knowledge of the AP5? It should not have a static route nor HNA statement.

It would seem that AP2 to AP1 is having trouble and AP4 steps in to "help" out.

therealboss
04-29-2007, 10:42 AM
Lonnie
As per TOG's post, I have placed statics on these first 4 hops and things are running better but still not right. I understand that I should not put in any statics and never have; but this problem is killiing me, so if a static helps get the clients of my back while I look deeper in to this, thats ok with me.

These are the results from some test s I have and still am running.


AP2 ping to AP1 No loss, 3600+ pings (So this is good)
AP2 ping to AP5 No loss, 3700+ pings (So this is good)
AP2 Ping to AP8 No Loss 3799+ pings (So this is good)
AP8 ping to AP1 10% loss 3600+ pings (Big problem)

lonnie
04-29-2007, 05:53 PM
We'll try and get a quickie release out next week with new kernel and newest OLSR and I note a new quagga. It could be a busy week.

I just replied to your PM but then I decided we should get you updating with the upcoming release and newly fixed programs, so just sit tight and we'll get moving on this right away.

therealboss
04-29-2007, 06:19 PM
We'll try and get a quickie release out next week with new kernel and newest OLSR and I note a new quagga. It could be a busy week.

I just replied to your PM but then I decided we should get you updating with the upcoming release and newly fixed programs, so just sit tight and we'll get moving on this right away.


Lonnie
This is killing me over 150 clients with little or no internet for 3 days solid, this is one of these problems that has got bigger and bigger over the past 2 weeks and now its that bad clients are ringing me 27/7.

lonnie
04-29-2007, 11:54 PM
I just logged in and you have design issues. I cannot begin to change things from here because of all the interactions that I can mess up.

Basically you have an Ap on the first unit with subnets of 10.20.1.x, 10.20.2.x, and 10.20.3.x.

It is my strict rule to have a single subnet on a physical segment, so please get that renumbered and make the AP 10.20.1.1 with the clients as 10.20.1.254, 10.20.1.253, and 10.20.1.252. I believe it complicates the network and possibly confuses routing to have multiple subnets on a segment. I realize it should not, but I think OLSR has trouble and quagga certainly does. Many things that interface with kernel routing only function with the first IP subnet on an interface.

I cannot start to make changes like that since I could easily lose contact and have changes half made, which would spell disaster for you.

So, please get those systems renumbered and rememeber any physical segment (an AP and clients or Ethernet and switch with all Ethernets that connect to the switch). There is NO reason to have multiple subnets and the system becomes a lot easier to remember with fewer subnets and IP's to think of.

It is a design issue for us, and we always make the main unit on the subnet .1, thus the AP is .1 and the clients work down from .254, assuming a /24 subnet of course. Adjust as necessary for the subnet size.

kbldawg
04-30-2007, 10:17 AM
It is my strict rule to have a single subnet on a physical segment,...I believe it complicates the network and possibly confuses routing to have multiple subnets on a segment. I realize it should not, but I think OLSR has trouble and quagga certainly does. Many things that interface with kernel routing only function with the first IP subnet on an interface.

There is NO reason to have multiple subnets and the system becomes a lot easier to remember with fewer subnets and IP's to think of.



WOW, I didn't know that.

ripv
05-29-2007, 09:16 AM
So, please get those systems renumbered and rememeber any physical segment (an AP and clients or Ethernet and switch with all Ethernets that connect to the switch). There is NO reason to have multiple subnets and the system becomes a lot easier to remember with fewer subnets and IP's to think of.

It is a design issue for us, and we always make the main unit on the subnet .1, thus the AP is .1 and the clients work down from .254, assuming a /24 subnet of course. Adjust as necessary for the subnet size.

Sorry Lonnie - just to check my design
I have a Backhaul & 2 AP's switched together as follows

Star4 - Backhaul - eth 10.10.2.1/24
wpci1 - 172.16.1.2/24 [ other end 172.16.1.1/24]

Star5 - AP
eth 10.10.2.2/24
wpci1 - 172.16.50.1/24
[clients dhcp -> 172.16.50.10/24 to 172.16.50.200/24]

Star6 - AP
eth 10.10.2.3/24
wpci1 - 172.16.51.1/24
[clients dhcp -> 172.16.51.10/24 to 172.16.51.200/24]

ripv
05-30-2007, 04:02 AM
I guess what I'm really asking is for confirmation that

1) a switch with everything connected to it is 1 physical segment and
2) an AP Radio side with connected clients is another physical segment and
3) another AP Radio side (even though the the ethernet side is on the same switch as 1 & 2 above) with its connected clients is another
physical segment ?:confused:

lonnie
05-30-2007, 09:09 AM
A segment is denoted by a switch for Ethernet. Everything that plugs to a common switch is on the same segment.

For a wireless segment each AP denotes a segment and all radios that connect to that AP are the same segment.

DrLove73
05-30-2007, 12:00 PM
lonnie, I know one VERY GOOD reason to have 2,3,4 subnets on the same phisical device (same radio card, right?).

If you need for client1 to see client2, and client3, that are all on the same radio, you can not (without InterBSS) do that with same subnet. That are your own words, an so true.

It has something to do with ruter "knowing" that if you are on the same subnet, you can see each other without its interference, and since InterBSS is OFF (for a VERY good reason=low pings, etc.) you can not.

Oo, yes, if I missed the point, I apologise.

tog
05-30-2007, 12:42 PM
InterBSS off = inter-client communications ignored, only AP<-->client can communicate.

InterBSS on = inter-client communications allowed, client<-->client can broadcast arp and see each other directly and the Linux kernel level will never see it, it's all handled by the Atheros AP playing its role as a hub/switch. That means no CBQ limiting, no firewall rules, etc.

As I've said before, if you need two clients on the same AP to be able to communicate with each other the most hassle-free way to get it done that I've used is routing a /30 to each client.

lonnie
05-30-2007, 01:08 PM
It is acceptable for this and possibly a few other scenarios. The main thing I was getting at is to keep it organized and not spread a subnet over several segments (bridged). Don't have a bunch of /24 subnets on a segment. Try and keep it to minimum. Just don't bridge.

lonnie, I know one VERY GOOD reason to have 2,3,4 subnets on the same phisical device (same radio card, right?).

If you need for client1 to see client2, and client3, that are all on the same radio, you can not (without InterBSS) do that with same subnet. That are your own words, an so true.

It has something to do with ruter "knowing" that if you are on the same subnet, you can see each other without its interference, and since InterBSS is OFF (for a VERY good reason=low pings, etc.) you can not.

Oo, yes, if I missed the point, I apologise.

knolan
06-03-2007, 08:13 AM
Lonnie,

I'm having the same type of OLSR problems that "TheRealBoss" was having.


I believe the network design we have is what you advise.

Fully routed
No Bridges
One IP per interface (except where we have multiple subnets for customers)
OLSR HNA statements only for CPE subnets
One HNA Statement for 0.0.0.0/0

Using Togs cut down version of the OLSR script.


OLSR in in use because we have 2 routes back to our core (soon to be 4 routes) and each of these nodes also has a route to other nodes not via the core.)


Before I go back to using RIP which works very well except when you have a ring in the network (we have 1 today with another 2 rings planned) can you give me any idea why OLSR would send a tracert to the wrong host, and that host ends up sending the tracert back before it times out.

The problem this is having on the network is everything looks ok, pings are working throughout the network, then for a time I get timeouts with a host on a different segment of the network reporting the ttl expired in transit, before request timed out, and then everything works ok again.


Thanks
Keith

tog
06-03-2007, 09:58 AM
It sounds like you are losing your OLSR conversation, perhaps your link has some packet loss or is actually cutting out. If you want to troubleshoot you can temporarily slap some static routes on there alongside your active OLSR and the static routes will remain in place even if the OLSR session goes down.

knolan
06-03-2007, 05:14 PM
I'm back running RIP and all the traffic is routing correctly now on the network = with no packet loss etc., only issue is I needed to disable the radio card which provided a secondary route to our core as RIP in StarOS doesn't like multiple routes to a distination.


Regarding what the problem was, I'm not too sure, but I am interested in the next version of StarV3 which will have a new kernel and the newest OLSR.

We'll try and get a quickie release out next week with new kernel and newest OLSR and I note a new quagga. It could be a busy week.

I just replied to your PM but then I decided we should get you updating with the upcoming release and newly fixed programs, so just sit tight and we'll get moving on this right away.



I'm thinking that OLSR got it self confused, which caused routing loops, which in turn caused lots of traffic to be routed back and fourth over some links, and this caused packet loss, which caused the LQ of the links to drop and caused more routing loops as OLSR tried different routes to get the traffic to the destination.

The only question is what cased the initial confusion, and how do I stop this happening again.

It sounds like you are losing your OLSR conversation, perhaps your link has some packet loss or is actually cutting out.

I don't think it was packet loss or a link issue as RIP is working correctly with no packet loss.

I've been running OLSR for months with no issues and I am very impressed at how fast it can see a link being down and routes the traffic over alternative links.
I believe OLSR is the best routing protocal for use on Wireless networks because it can make decisions based on the quality of a link, unlike OSPF, RIP etc.

The entire network is running the latest build v1.1.13 which we upgraded from v1.1.4. I'm thinking I either wait for the new release, or revert back to v1.1.4 before trying OLSR again.


Regards,
Keith

therealboss
06-04-2007, 04:32 AM
I found that OLSR was working well for me to but after I upgraded most of the AP's to build v1.1.13, thats when my problems started. I too changed the network back to RIP and it fixed the problem, but once there is an update to OLSR in StarOS, I will try it again as I too think its the best thing to use as I would like to add a extra gateway as a backup and mesh every AP by removing all the WAR2's I have for AP's and replace them with the new Metro WAR4's.

mimbach
06-20-2007, 11:47 AM
Guys,
I am using olsr and I drop my connections every hour or so(drop pings etc for about 30 seconds to 10 minutes). If I log into all the radios and restart olsr it fixes itself right away.
Is there a setting or something I should be trying?
I spent over 100 hours trying to make ospf work properly before having to give up on it. Then I had to re-enter all my static routes all 92 of them.
Someone please tell me this isn't the case with olsr.
Thanks,

Mimbach

tog
06-20-2007, 11:57 AM
My OLSR network shows 84 active routes, I never have problems like that.

The config I'm using on my whole network is here:
http://staros.tog.net/wiki/OLSR

mimbach
06-20-2007, 12:03 PM
Tog,
Can you give me some pointers. I have 92 active routes and 12 towers.
We have been building loops for redundancy and go a max of 5 links deep.
Thanks,

Mimbach

tog
06-20-2007, 12:15 PM
You need to narrow the problem down when it occurs using traceroute and by checking the HTTP interface plugin (http://ip.address:8001/) on the systems nearest the problem. Need more specific information than just "I drop my connections"

You might want to try using "LinkQualityMult" on each system that has secondary/backup links. Make the "secondary" neighbor's LinkQualityMult real low like 0.20 to make sure your routes don't flop around.

Normally this isn't a problem but this is for troubleshooting purposes.

It's entirely possible with lots of backup links in there you're running into OLSR bugs. It has been said that olsrd 0.4.10 is far from perfect and 0.5.0 fixes a lot of problems.

I believe olsrd 0.5.0 is slated to be included in the upcoming 1.2.1b/1.3.0.

Also, are any of your systems anywhere near fully-loaded? Do they transfer over 15mbit/sec of data? The WAR's CPU architecture makes it impossible to run routing protocols on a heavily/fully-loaded system, interrupts take full priority over all user-space applications (like routing software) so if you load the system to its maximum, your routing sessions can drop. But they'll come back up again by themselves quickly after the routes drop and the interrupt load disappears.

lonnie
06-20-2007, 12:19 PM
The new beta 1.2 will include the latest OLSR. Don't quote me on this, and don't ever expect a repeat, but the new release will likely happen this afternoon.

mimbach
06-20-2007, 12:23 PM
Guys,
I will hold out and try the new version if Lonnie is able to get it out today. If not I understand and will not complain. I will update on if it fixes my problem or not. Tog if it does not I will get some more in depth data for everyone.
Thanks,

Mimbach

tog
06-20-2007, 12:35 PM
So are you running a mix of olsrd 0.4.10 and 0.5.0 on your own network? They are ok talking to each other?

lonnie
06-20-2007, 12:48 PM
It's just a bug fix release.

mimbach
06-20-2007, 04:40 PM
Lonnie,
Does bug fix mean it will have the updated olsr version 0.5?
Thanks,

Mimbach

lonnie
06-20-2007, 05:09 PM
That is what we have said. 0.5 OLSR is a bug fix release.

lonnie
06-20-2007, 11:44 PM
The release did not happen today. I have one quarter of our LAN upgraded and so far it is nice but there are a couple of minor issues that I want to fix before we release.

mimbach
06-21-2007, 09:46 AM
Thank you for the update Lonnie.

amp
06-22-2007, 04:43 PM
Also, are any of your systems anywhere near fully-loaded? Do they transfer over 15mbit/sec of data? The WAR's CPU architecture makes it impossible to run routing protocols on a heavily/fully-loaded system, interrupts take full priority over all user-space applications (like routing software) so if you load the system to its maximum, your routing sessions can drop. But they'll come back up again by themselves quickly after the routes drop and the interrupt load disappears.

Yeah, lets not forget about ospf too. Its effected also, and you don't exactly have to be at 100% load either. We've had it drop on links with 0 interference and 1 in a million packets lost. The only way to help it is increase the amount of hellos ospf sends and increase the number of dropped ones until it shows the interface as down.

I think there has to be some sort of solution. Only thing keeping the stability of our network at maximum. Seen a war4 get hammered with under 20mbit.