[wxqc] Intermittent Missing Data

dave at aprsfl.net dave at aprsfl.net
Thu Apr 12 16:56:22 CDT 2007


Well I was going to post a explanation since I just woke up, but I'll just
answer the concerns here for all...
 
     >The 4 Core servers (rotate.aprs.net) have been working 
     >for less than 1 day combined. During these outages, some 

Those "outages" shown by the uptime counter were long enough to reload the
software.   That takes 1 second on the Linux/MAC (what first, second and
third run) platform as a simple service restart is all that's needed.  A
system reboot on the fourth box as Windows locks memory pages and after it's
been running for over 50 days you can't expect a contiguous block of 256meg
of memory to be available when you restart the service, so a reboot is
required.   You know this.    Those outages totaled seconds to accommodate a
new server addition.  And happened over 36 hours ago.


     >THIRD.aprs.net dropped 600 WX stations the last few hours 

There were two "third" servers under the DNS name of third for a period of
time today.   A new location for third was found, and for 90 minutes today
that new location was ran in parallel as third and the "third" at my
location was third-2.    

Unfortunately I  discovered after 8:30am when the DNS addition was done that
a Perl module on the box when executed is tying up the box for 500-800ms
100%, so the box was experiencing lockups when that happened for a second
every 2 minutes.   At 9:40am when I realized there was no way I can resolve
this in a timely fashion, the new IP was removed from the pool till I have
time to recompile a few things to resolve it.  

At no time was the old third box ever turned off and anyone using it's IP as
a resolved host would still have gotten data in.    Bottom line, this
affected only 1% of the traffic on the -NEW- third box alone which was only
taking 40 (not even 1/4 what third NORMALLY takes) connects at the time, the
rest of the users were still on the OLD third box.  You can see even as of
now, all 5 servers are still interlinked, although the new third box is not
in DNS at all.      This was a 90 minute issue, on one of 5 servers, and
most of the load that was missed during the time it took me to establish
this problem ALL floated around to the other servers during restarts on that
-one- box alone.

Note the totals on the checkservers page though, the totals did stay the
same, as the load floated around just as it was supposed to when I was
shutting it down and restarting the new box.  If you were connecting to the
new third using rotate, when I had third down intermittently when working on
it, the traffic FLOATED to the next server next poll.

Since you talked about check servers, let's look again at when I took third
DOWN to move it, and also remember those that were on the new third box were
not counted here anywhere (that 1% that used it that had issues every 2
minutes for 1 second):

17	122	***	44	318	181	1769	2434
16	136	***	45	374	193	1675	2423
15	431	***	51	533	211	1135	2361
14	424	***	47	541	210	1125	2347
13	405	***	48	548	223	1095	2319
12	152	***	45	486	219	1378	2280
11	105	***	43	305	179	1688	2320
10	106	***	41	294	181	1688	2310
09	104	***	42	338	172	1674	2330
08	100	***	48	333	197	1712	2390
07	97	***	47	320	178	1757	2399
06	108	***	47	323	178	1745	2401

Lets note that during the 13/14/15 intervals when I was working on third
that all the stations that were on third rotated over to fourth (300+ of
them), first (200+ of them) and second (50 of them) plus the 40 or so who
started using the new third server when it was added to DNS.  Once I
normalized everything, it took another 1/2 hour for traffic to all balance
back out and for those using third to come back.

Note the totals on the right column.  They did not fluctuate any more than a
normal change in hourly submissions during a day aside from about 80-130
users running CWOP software that seems to only have third as a server to
use.   Those 80-100 never move from third, and reasons like this is why we
emphasis to users to not use a single server.  Either hard code all 4 of
them or use rotate.

Rotate WORKED today just as it should, traffic moved around just as it was
supposed to, Russ's page shows this with zero ambeguity.


     >have pointed out. By the time rotate assigns a different 
     >server, data is lost. Rotate can't detect a downed or 
     >intermittent server. By the time a new DNS is assigned to 
     >your PC, data hasn't gone anywhere.
     
That is because some software does not do a fresh DNS lookup every connect
and does not attempt a retry right away (the important key to note here) if
there is a failure.  If it retried and did fresh lookups on a failure this
works perfectly with not a single report missed.   In the case of most of
the applications, the first poll when a server goes down is lost, then the
next poll you hit the next server in the list.   There -is- the fix and a
whitepaper for developers is still in the works that would address all
issues in client software.   Software already working correctly (there are
some that do) has no issues at all.



But that's still -not- what caused this today.   The problem with data loss
in the past hours since last night appears to be a slowly getting worse disk
failure on first that was causing very very odd things to happen with
traffic flowing thru it.  

Once it was determined to be the cause of the problem that started last
night, which was not obvious to identify right away, first was pulled out of
rotate, as it is now shutdown.

Gerry's working on rebuilding it now, and as of the last checkservers it
appears those that were dropping off weather on first have moved around to
second and fourth, just as rotate is supposed to do.  There is still a few
that haven't moved around, those 60 or so are ones using software with none
of the rotate fixes in them, so until they -RESTART- their program they'll
be down (again no matter what server you put to connect to, this is a
problem with the client software not doing a fresh DNS lookup at every
connect).

Again, let's look at checkservers:

21	278	***	49	0	355	1572	2254
20	146	***	52	239	236	1683	2356

First went to zero when Gerry shutdown, and second went up 130 or so
connects, fourth went up about the same, and the numbers that went up
represent those who were on first.  It floated around just as it should with
rotate.

The problem in what REALLY CAUSED the data loss was that even though the
server was -UP- and running it was taking connects normally and displaying a
status page normally it just did not passing data upstream.    A kernel
panicing box tends to do odd things.


     >Maybe it's time to admit there is a problem and face up 
     >to the fact that something isn't as 'reliable' as 
     >originally thought. The burden shouldn't be placed on WX 
     >users that have little control or guarantees that their 
     >data will end up where it is supposed to.
     
The only thing to admit is that the timing of a new server addition for 90
minutes and a disk failure co-incited and caused this since last night.


     >Please listen to list members... there is a problem and 
     >they deserve straight answers...
     
They've been given.

I simply can't be more forthcoming.

I was up till almost noon from LAST NIGHT trying to diagnose this, so I had
to take a nap and didn't post my findings till now.

     
     >Solution: Don't use rotate.aprs.net. It doesn't work on 

The problem today would not have mattered how you got to the server - be it
rotate, static names, etc -  if the server was not passing data upstream
which was the case, no server name change solves the problem.

Rotate, again,  was -NOT- a problem today.  I know it's starting to sound
like a broken record, but just look at the checkservers page, you'll see
folks are moving around when issues with a single server in the pool occur,
the totals of users changes very minutely, and that's folks with single
servers hard coded that's being affected.


     >It works great. I'd suggest 5 min. intervals. If a server 

Only users running Windows XP pre service pack 2 has the "bug" you are
talking about, and if you remember, I'm the one who reported it to Microsoft
and got a hot fix issued for it. 

The FIX for pre SP2 XP boxes is on the core homepage - or that user need to
update to service pack 2. 


The issue today with lost data was a failed disc on a server, and has since
been removed from the active server list, as Gerry posted a couple hours
ago.



And Dick...

I simply cannot be more forthcoming.   I don't know why you have to post so
venemously about a failure when the core has ran for months with no problems
at all.  Besides, with hardware failures and OS problems, you've had your
share too, it's not wise to throw rocks at a glass house my friend.   The
difference now compared to in the past with me is that you'll not get a
aggressive response from me anymore.  I now realize you're trying to take
advantage of my class A personality and desire to defend myself from attack.
You'll not get such a response from me any longer to make issues look
political or personal.  All you'll see from me is facts and evidence
supporting my postings regarding issues such as this in a professional
manner just as I did  above.   You threw in your support of CWOP months ago
with no warning to ANYONE when you turned off ALL of the T2 CWOP servers one
late Friday resulting in many folks loosing a lot more than a few
observations.    Dave Helms worked his tail off to get folks the new
information when you pulled that little surprise, and I had to setup another
server to handle the additional load trouble free.    It's clear you're not
interested in the well being of the CWOP user base, they are a pawn to you,
or you would at least have given time for folks to move before you pulled
the plug.    I'm sorry if this offends you or anyone else in this forum, but
it needs said and reminded.   In closing, you've been asked by several
individuals in this forum to not post here about this material any longer,
can you please just honor that request....


Seeya,
Dave
Sysop Third & Fourth




-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the wxqc mailing list