[wxqc] Intermittent Missing Data
dave at aprsfl.net
dave at aprsfl.net
Thu Apr 12 16:56:22 CDT 2007
Well I was going to post a explanation since I just woke up, but I'll just
answer the concerns here for all...
>The 4 Core servers (rotate.aprs.net) have been working
>for less than 1 day combined. During these outages, some
Those "outages" shown by the uptime counter were long enough to reload the
software. That takes 1 second on the Linux/MAC (what first, second and
third run) platform as a simple service restart is all that's needed. A
system reboot on the fourth box as Windows locks memory pages and after it's
been running for over 50 days you can't expect a contiguous block of 256meg
of memory to be available when you restart the service, so a reboot is
required. You know this. Those outages totaled seconds to accommodate a
new server addition. And happened over 36 hours ago.
>THIRD.aprs.net dropped 600 WX stations the last few hours
There were two "third" servers under the DNS name of third for a period of
time today. A new location for third was found, and for 90 minutes today
that new location was ran in parallel as third and the "third" at my
location was third-2.
Unfortunately I discovered after 8:30am when the DNS addition was done that
a Perl module on the box when executed is tying up the box for 500-800ms
100%, so the box was experiencing lockups when that happened for a second
every 2 minutes. At 9:40am when I realized there was no way I can resolve
this in a timely fashion, the new IP was removed from the pool till I have
time to recompile a few things to resolve it.
At no time was the old third box ever turned off and anyone using it's IP as
a resolved host would still have gotten data in. Bottom line, this
affected only 1% of the traffic on the -NEW- third box alone which was only
taking 40 (not even 1/4 what third NORMALLY takes) connects at the time, the
rest of the users were still on the OLD third box. You can see even as of
now, all 5 servers are still interlinked, although the new third box is not
in DNS at all. This was a 90 minute issue, on one of 5 servers, and
most of the load that was missed during the time it took me to establish
this problem ALL floated around to the other servers during restarts on that
-one- box alone.
Note the totals on the checkservers page though, the totals did stay the
same, as the load floated around just as it was supposed to when I was
shutting it down and restarting the new box. If you were connecting to the
new third using rotate, when I had third down intermittently when working on
it, the traffic FLOATED to the next server next poll.
Since you talked about check servers, let's look again at when I took third
DOWN to move it, and also remember those that were on the new third box were
not counted here anywhere (that 1% that used it that had issues every 2
minutes for 1 second):
17 122 *** 44 318 181 1769 2434
16 136 *** 45 374 193 1675 2423
15 431 *** 51 533 211 1135 2361
14 424 *** 47 541 210 1125 2347
13 405 *** 48 548 223 1095 2319
12 152 *** 45 486 219 1378 2280
11 105 *** 43 305 179 1688 2320
10 106 *** 41 294 181 1688 2310
09 104 *** 42 338 172 1674 2330
08 100 *** 48 333 197 1712 2390
07 97 *** 47 320 178 1757 2399
06 108 *** 47 323 178 1745 2401
Lets note that during the 13/14/15 intervals when I was working on third
that all the stations that were on third rotated over to fourth (300+ of
them), first (200+ of them) and second (50 of them) plus the 40 or so who
started using the new third server when it was added to DNS. Once I
normalized everything, it took another 1/2 hour for traffic to all balance
back out and for those using third to come back.
Note the totals on the right column. They did not fluctuate any more than a
normal change in hourly submissions during a day aside from about 80-130
users running CWOP software that seems to only have third as a server to
use. Those 80-100 never move from third, and reasons like this is why we
emphasis to users to not use a single server. Either hard code all 4 of
them or use rotate.
Rotate WORKED today just as it should, traffic moved around just as it was
supposed to, Russ's page shows this with zero ambeguity.
>have pointed out. By the time rotate assigns a different
>server, data is lost. Rotate can't detect a downed or
>intermittent server. By the time a new DNS is assigned to
>your PC, data hasn't gone anywhere.
That is because some software does not do a fresh DNS lookup every connect
and does not attempt a retry right away (the important key to note here) if
there is a failure. If it retried and did fresh lookups on a failure this
works perfectly with not a single report missed. In the case of most of
the applications, the first poll when a server goes down is lost, then the
next poll you hit the next server in the list. There -is- the fix and a
whitepaper for developers is still in the works that would address all
issues in client software. Software already working correctly (there are
some that do) has no issues at all.
But that's still -not- what caused this today. The problem with data loss
in the past hours since last night appears to be a slowly getting worse disk
failure on first that was causing very very odd things to happen with
traffic flowing thru it.
Once it was determined to be the cause of the problem that started last
night, which was not obvious to identify right away, first was pulled out of
rotate, as it is now shutdown.
Gerry's working on rebuilding it now, and as of the last checkservers it
appears those that were dropping off weather on first have moved around to
second and fourth, just as rotate is supposed to do. There is still a few
that haven't moved around, those 60 or so are ones using software with none
of the rotate fixes in them, so until they -RESTART- their program they'll
be down (again no matter what server you put to connect to, this is a
problem with the client software not doing a fresh DNS lookup at every
connect).
Again, let's look at checkservers:
21 278 *** 49 0 355 1572 2254
20 146 *** 52 239 236 1683 2356
First went to zero when Gerry shutdown, and second went up 130 or so
connects, fourth went up about the same, and the numbers that went up
represent those who were on first. It floated around just as it should with
rotate.
The problem in what REALLY CAUSED the data loss was that even though the
server was -UP- and running it was taking connects normally and displaying a
status page normally it just did not passing data upstream. A kernel
panicing box tends to do odd things.
>Maybe it's time to admit there is a problem and face up
>to the fact that something isn't as 'reliable' as
>originally thought. The burden shouldn't be placed on WX
>users that have little control or guarantees that their
>data will end up where it is supposed to.
The only thing to admit is that the timing of a new server addition for 90
minutes and a disk failure co-incited and caused this since last night.
>Please listen to list members... there is a problem and
>they deserve straight answers...
They've been given.
I simply can't be more forthcoming.
I was up till almost noon from LAST NIGHT trying to diagnose this, so I had
to take a nap and didn't post my findings till now.
>Solution: Don't use rotate.aprs.net. It doesn't work on
The problem today would not have mattered how you got to the server - be it
rotate, static names, etc - if the server was not passing data upstream
which was the case, no server name change solves the problem.
Rotate, again, was -NOT- a problem today. I know it's starting to sound
like a broken record, but just look at the checkservers page, you'll see
folks are moving around when issues with a single server in the pool occur,
the totals of users changes very minutely, and that's folks with single
servers hard coded that's being affected.
>It works great. I'd suggest 5 min. intervals. If a server
Only users running Windows XP pre service pack 2 has the "bug" you are
talking about, and if you remember, I'm the one who reported it to Microsoft
and got a hot fix issued for it.
The FIX for pre SP2 XP boxes is on the core homepage - or that user need to
update to service pack 2.
The issue today with lost data was a failed disc on a server, and has since
been removed from the active server list, as Gerry posted a couple hours
ago.
And Dick...
I simply cannot be more forthcoming. I don't know why you have to post so
venemously about a failure when the core has ran for months with no problems
at all. Besides, with hardware failures and OS problems, you've had your
share too, it's not wise to throw rocks at a glass house my friend. The
difference now compared to in the past with me is that you'll not get a
aggressive response from me anymore. I now realize you're trying to take
advantage of my class A personality and desire to defend myself from attack.
You'll not get such a response from me any longer to make issues look
political or personal. All you'll see from me is facts and evidence
supporting my postings regarding issues such as this in a professional
manner just as I did above. You threw in your support of CWOP months ago
with no warning to ANYONE when you turned off ALL of the T2 CWOP servers one
late Friday resulting in many folks loosing a lot more than a few
observations. Dave Helms worked his tail off to get folks the new
information when you pulled that little surprise, and I had to setup another
server to handle the additional load trouble free. It's clear you're not
interested in the well being of the CWOP user base, they are a pawn to you,
or you would at least have given time for folks to move before you pulled
the plug. I'm sorry if this offends you or anyone else in this forum, but
it needs said and reminded. In closing, you've been asked by several
individuals in this forum to not post here about this material any longer,
can you please just honor that request....
Seeya,
Dave
Sysop Third & Fourth
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the wxqc
mailing list