Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Release of 4.8.0.0-9165
#21
Thank you.

I will double check the code I have added. I believe the Client was never acting up before, so I have introduced something. It has always been the AP stalling.

BTW, I have 9201 available with the ping watchdog mods to do an activate.

Have you had some success with the AP? Or have you rolled them all back as well?
Reply
#22
lonnie Wrote:I will double check the code I have added. I believe the Client was never acting up before, so I have introduced something. It has always been the AP stalling.
Well, yes, sort of.

The clients have also been acting up since the day the WAR1B's came out. Yes, the behavior of 4.8.1.0-9175 is kinda new and different on a client, but absolutely the WAR1B clients have been acting up forever.

So - with the early firmwares, both the AP and the Clients would always just stop passing traffic. On the AP we've been calling it a 'stall' and on the CPE it was less well defined, but they'd basically stop passing traffic until they were rebooted. Our typical CPE programming has it pinging the AP it's connected to and reboots if unpingable for 6 minutes, so on problematic WAR1B CPE's, this could be often, or on other's it could only be occasionally. When Tony added the reset, we said 'that's horrible, we now have no way to know how often this is happening' and then the reset counter was added. Some clients reset 100 times per day. So yes, on the AP we see that as a 'stall' and yes, that is absolutely the most problematic WAR1B issue. But the clients have also always stopped passing traffic and reset - absolutely.

So - what wa new about the 4.8.1.0-9175 on the client side is that I upgraded about 20 WAR1B clients or so, and they all updated fine and they all came back up fine and worked. What was new and different is that by the next morning, about half of them weren't associated. Some of those people called in, and they did a power cycle and their CPE's came back online, and I downgraded them to 4.7.1.0 at that time. A couple of them, we went out to and we power cycled them, and they came back online and we downgrade them to 4.7.1.0 as well.

What I think (maybe, maybe not) that I saw last night was that the last couple missing-in-action WAR1B's were back online, and I putty'd into them, and while they were open, one of them did something that made it go 'Session BUSY' (which seems to me like it was doing it's own 'reset' due to whatever detected condition) and then it didn't seem to re-associate properly on it's own after that. It didn't come back online until later, after it had been power cycled, and at that time, I downgraded it to 4.7.1.0

Of course, I have a sample-set-size of 1 here, so I could be wrong on my guess. All that I really know for sure is that I was watching a 4.8.1.0-9175 client, and that it did a 'Session BUSY' all on it's own, and that it didn't re-associate after that. My theory is that something in the 'reset code' is different in 9175 and that's caused them to not really reset correctly (for example, disable the WiFi but failes to re-enable it, or whatever) but that is just a guess based on the fact that about 10 of the 20 upgraded WAR1B's disappeared overnight that first night, and based on the fact that I saw a client try to do a 'Session BUSY' just before it went away.
Reply
#23
lonnie Wrote:Have you had some success with the AP? Or have you rolled them all back as well?
I am still running 4.8.1.0-9175 on the AP's (3x WAR1B and 1x APU) and it doesn't yet seem any different than 4.7 or 4.6. We've had a couple stalls as already mentioned.

But of course, it's quite usual after a power cycle or a reboot, that an AP can work fine for a day or two. We've even had them work up to 6 days after a new firmware and the required reboot, and we've thought "Ah ha! This is it!!" But, after those 6 days, they've always ultimately gone back to their routine of 'stalling' and if it's just an 'activate' (or a self-internal-reset) they may then do that once per day, or 10 times per day, or not for 4 days, or whatever.

So, for now we will keep 4.8.1.0-9175 on there and see how it goes with more run time - it doesn't seem any different than anything else so far.

BTW, here are the uptimes currently as of Dec 14/17 - 10:50 AM CST
#3 (APU) = 1 day, 12 hours
#10.3 (WAR1B) = 12 hours, 25 minutes
#10.4 (WAR1B) = 3 hours, 35 minutes
#49 (WAR1B) = 1 day, 17 hours
Reply
#24
AP #10.4 - WAR1B, 2.4 Ghz, 18 clients - all StarOS, but a mix of WAR1B plus WAR1A-SIAM and B/G clients.

It's been on 1 day 5 hours since it's last reboot, and there are 52 system detected resets - so on average, once every 33 minutes or so, everyone get's kicked offline. It's been a bit more often than this really, because we've also done our own 'activates' a few times over this last day as well, when the automated 'kick everybody offline routine' didn't detect a stall. So, about every 1/2 hour on average.


Here is a Screen Shot showing 52 resets in the last 1 Day, 5 hours, with 18 clients, with 4.6.1.0-9175


.jpg   4.8.1.0-9175-52ResetsPerDay.jpg (Size: 67.63 KB / Downloads: 33)
Reply
#25
Hi. Just to confirm, after a few days of using this on the AP, this version isn't really any better than any of the other's for the last couple years. Dozens and dozens of stalls, resets, reboots, slow downs (~200 KB total throughput), and plus the classic ''not being able to ping some client(s) while simultaneously working Ok on other clients".

So, any of the APU/Ventana/WAR1B Access Point, on any of the busier AP's, with more clients, in noisier environments, that have been online longer since the last reboot - are ''FUBAR'ing'' dozens of times per day.
Reply
#26
I am finally able to duplicate this again. I'm running 2 Access Points on a Ventana. One has 2 clients and the other has 1 client. I am finding that the single client one has issues after about 1.5 hrs. Not easy to troubleshoot but it is better than not seeing anything after 24 hrs.

ninedd, thanks for the testing.
Reply
#27
I'm still surprised that you're able to duplicate what we're seeing at all. We find that AP's with 18 - 25 clients show the issue multiple times per day. Then, when clients quit, and when the sector count get's down to less than... 10 or so, then these problems go away with them. We have some WAR1B AP's with only 4 clients or 1 client left on them, and they'll run a month or so with no noticeable issues.

For example, our AP#86 used to have 18 clients and it was FUBAR'ing 50 times per day. There is now (for pretty obvious reasons) only 1 remaining client on it, and it has 39 days of uptime. The ONLY change there has been the user count.
Reply
#28
ninedd Wrote:I'm surprised. We find that AP's with 18 - 25 clients show the issue multiple times per day. When clients quit, and when the sector count get's down to less than... 10 or so, then these problems go away with them. We have some WAR1B AP's with only 4 or 1 client left on them, and they'll run a month or so with no noticeable issues.

For example, our AP#86 used to have 18 clients and it was FUBAR'ing 50 times per day. There is now (for obvious reasons) only 1 remaining client on it, and it has 39 days of uptime. The ONLY change there has been the user count.

Same here - the only way we can run a WAR-1B AP is with 8-10 or less clients.
Reply
#29
ninedd and mickeym, this totally explains why other people I am working with see uptimes of months. They mostly do point to point and very small clusters.

War1B's are pretty cheap. Is it possible to use multiple units and keep the client count <10?

Another option, because if you use multiples, you need more channels. I have found that a HT10 channel can deliver 30+ mbps, which might be enough if you only have 10 clients. The key, though, is that it gives you some extra channels to use. The new firmware has support for HT10 and also has sync to make it easier to play with those settings.

Don't get wrong though. I am not proposing this as the final solution. I will be doing my best to find out why it locks clients out. That is not the standard way to handle an error.
Reply
#30
Good lord guys, 18 to 20 clients on a WAR1b ?? With the streaming demands of the kids alone and all the Amazon Echos and home automation. I would think it would be kind of like trying to haul 10 tons of asphalt in a S-10 pickup. For anything as important as an AP I would be opting for a Ventana just for the horsepower. I am rebuilding a head end at a hospital and 2 of the 4 back hauls to the ground distribution points are going to be Ventanas with a single radio card. I never put clients and backhauls on the same box. Maybe a short dogleg segment with a couple of sites but nothing more. I just installed 4810-9201 on a whole war1b network and my dropout problems went away. I also switched to HT-10 on the whole network and it is amazing how much better the system works. Lonnie will find the cause but in the meantime everything is so much better. I didn't even have to invoke the new watchdog reactivator as the stalls went away.
(They have come back on low and unstable paths) Maybe that is a clue as to why the stalls happen?
Thank you for all the hard work Lonnie and nined for all the great testing. I am sure it is a big help in finding a permanent solution. And for those of you that celebrate as I do. ..A Merry Merry Christmas! and the Best of a New Year.


Additional note:
I thought we all knew and agreed that there were deeply hidden, isosteric problems between the FW and the hardware platforms.
I was just reporting that I have found a version that is working for me and what I did to make the problems manageable.
My daily PRTG graphs are reporting vast improvements over other versions.
I repeat Unfortunately the stalls have reappeared on low quality and unstable PTP paths but the AP side reactivator has kicked them back on. As I said this may be a clue to the cause of the stalls.
I believe Lonnie is working on the possibility of a multi watch dog function that looks at everyone in the association list.
That would be a huge breakthrough for a several of the PTMP guys.
I am forever thankful for his actions to provide a workable solution until the Permanent Big Fix is discovered.
There is no debate that ther3e is something wrong.
The ever-present quests have been. How to duplicate it, so we can know what causes it, so Lon can fix it.

Like Jack Welch (GE CEO) once said :If you can't track it, You can't manage it"

Merry Christmas to all and good luck hunting Lonnie
Dave
<
@)
(Cooter>>
^^^^^^^^^^
Vrablic

"I have no excuses, Just reasons"

Reply


Possibly Related Threads...
Thread Author Replies Views Last Post
  9243 release lonnie 0 2,457 05-21-2018, 09:22 AM
Last Post: lonnie
  Release 4.8.1.0.9201 lonnie 0 1,294 12-14-2017, 10:20 AM
Last Post: lonnie
  Release 4.8.8848 adding Licensed Frequency support lonnie 14 4,061 03-05-2017, 02:48 PM
Last Post: lonnie
  4.7.1.0.8757 release lonnie 3 2,200 01-24-2017, 05:28 AM
Last Post: lonnie
  New release 4.7.1.0.8747 lonnie 1 1,756 01-20-2017, 05:47 AM
Last Post: ninedd
  Christmas Even Release 4.6.1.0.8730 lonnie 29 6,080 01-02-2017, 05:05 AM
Last Post: lonnie
  New release of 4.6 code lonnie 7 4,354 11-09-2016, 10:52 PM
Last Post: lonnie
  New release of very early 4.6 code lonnie 22 8,396 11-03-2016, 09:17 AM
Last Post: Ryan
  Having a problem with the 1.4.x release series? tony 210 45,139 01-07-2011, 07:13 AM
Last Post: DrLove73
  LATEST RELEASE: StarOS v2.01.1-4590 tony 1 4,128 06-25-2004, 09:00 AM
Last Post: lonnie

Forum Jump:


Users browsing this thread: 1 Guest(s)