Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 18
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 25444 times and has 17 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
MCM: Seeing interspersed Invalid results with 7.26, passing via PVal > PVerification

Only a few, but since this Linux box was solid from first beta through 7.24, it raises suspicion... rigidity of the validator or instability in the processing/concatenation of result pieces after a restart? Homogeneous redundancy, all wingman are Linux too.

1) MCM1_ 0000148_ 4938_ 2-- 2524499 Invalid 11/20/13 15:13:58 11/20/13 19:58:42 3.58 / 3.59 70.3 / 35.4
2) MCM1_ 0000110_ 8524_ 0-- 2524499 Invalid 11/19/13 16:34:04 11/19/13 23:02:03 5.89 / 5.91 82.7 / 46.7

Wingman list 1):

MCM1_ 0000148_ 4938_ 4-- 726 Valid 11/20/13 22:05:07 11/21/13 11:44:14 3.63 68.7 / 70.8
MCM1_ 0000148_ 4938_ 3-- - Detached 11/20/13 20:51:27 11/20/13 21:19:59 0.00 0.0 / 0.0
MCM1_ 0000148_ 4938_ 2-- 726 Invalid 11/20/13 15:13:58 11/20/13 19:58:42 3.58 70.3 / 35.4
MCM1_ 0000148_ 4938_ 1-- 726 Error 11/20/13 15:10:31 11/20/13 15:13:42 0.00 79.8 / 0.0
MCM1_ 0000148_ 4938_ 0-- 726 Valid 11/20/13 15:10:29 11/20/13 20:51:02 2.71 72.9 / 70.8

Wingman list 2):

MCM1_ 0000110_ 8524_ 2-- 726 Valid 11/20/13 02:23:13 11/20/13 22:44:38 8.38 88.6 / 93.3
MCM1_ 0000110_ 8524_ 1-- 726 Valid 11/19/13 16:34:08 11/20/13 02:20:36 4.24 98.1 / 93.3
MCM1_ 0000110_ 8524_ 0-- 726 Invalid 11/19/13 16:34:04 11/19/13 23:02:03 5.89 82.7 / 46.7

The invalid and error show restarts, noting that LAIM was off. [I suspect why restart may have occurred... got several <exclusive_app> scheduled processes that are to stop BOINC while they run, manually or scheduled, which candidates these next 4 as warranting check-up once the wingman have checked in.

2314 21-11-2013 07:59 Suspending computation - an exclusive app is running
2315 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000166_6184_1 (removed from memory)
2316 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000173_5530_0 (removed from memory)
2317 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000174_9099_0 (removed from memory)
2318 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000174_4101_0 (removed from memory)
2319 21-11-2013 08:00 Resuming computation

Normally run with LAIM on, but hey, are we testing to break things or not?

The restart theory gets wobbly, and the random number generator gets to slip a foot in. One of the 4 above logs looks EXACTLY the same as the invalid results, yet, it's gone valid:

MCM1_ 0000174_ 9099_ 0-- 2524499 Valid 11/21/13 06:17:17 11/21/13 10:09:48 3.50 / 3.55 68.5 / 71.9

Result Log

Result Name: MCM1_ 0000174_ 9099_ 0--
<core_client_version>7.2.28</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
[07:19:08]: Computing pass 0
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
[08:00:08]: Computing pass 0
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
[08:46:38]: Computing pass 0
Result.out = 3899480.000000
Run complete, CPU time: 12612.136050
11:09:22 (17667): called boinc_finish

</stderr_txt>
]]>

The other 3 list PV, with identical logs indicating interruption:

MCM1_ 0000166_ 6184_ 1-- 2524499 Pending Validation 11/21/13 00:00:28 11/21/13 08:50:57 8.51 / 8.56 164.7 / 0.0
MCM1_ 0000173_ 5530_ 0-- 2524499 Pending Validation 11/21/13 05:04:38 11/21/13 08:47:46 3.48 / 3.52 67.7 / 0.0
MCM1_ 0000174_ 4101_ 0-- 2524499 Pending Validation 11/21/13 06:24:38 11/21/13 09:38:31 2.99 / 3.03 58.3 / 0.0

Leaving LAIM off for now, but eventually will set it on to see if that makes a change... then no more invalid [or Murphy riding hi?]

Edit: Title
----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 2, 2013 10:07:29 AM]
[Nov 21, 2013 12:33:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

Same thing here on Linux Mint 15 64-bit only. 3 invalid and 4 in Pending Verification, all were reported to WCG at the same time. Of the 3 copies of each workunit that were sent out, very few have any restarts in the stderr.txt, but on my Invalids the Result.out file is 1 or 2 bytes less than the valid results.
[Nov 21, 2013 6:30:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

Now got 5 passing via PVal to PVer, and sadly, it's also befalling Windows. In all cases one of the first 2 had a restart, which is visible in the log. Now the next thing to check is what happens if 2 restarts meet 1 non restarted. Not one encountered yet, but would anticipate a _3 to be issued.
[Nov 21, 2013 6:38:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gomeyer
Senior Cruncher
USA
Joined: Jul 11, 2008
Post Count: 161
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

I got two on a solid Linux box, both had a restart.
(EDIT: I restarted the machine I mean, the work unit didn't restart itself.)
Also one on a Windoz machine with a "no heartbeat" message; the first time I've seen that on this box.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gomeyer at Nov 22, 2013 1:50:05 AM]
[Nov 22, 2013 1:48:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7244
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

I have quite a rash of these invalids also on three separate Linux machines. The result out line is different than the ones which come back valid. They all all have several lines about"No heartbeat from client for 30 sec - exiting" However, some of my valid units also have have the no heartbeat for 30 seconds line in them also.
Here is one which is marked as "Valid."


Result Log

Result Name: MCM1_ 0000154_ 1457_ 1--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000154_1457.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
13:41:20 (8644): No heartbeat from client for 30 sec - exiting
13:41:20 (8644): timer handler: client dead, exiting

Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000154_1457.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
Result.out = 1144692.000000
Run complete, CPU time: 17591.939426
18:36:32 (8676): called boinc_finish

</stderr_txt>
]]>

The bolded lines only show up occasionally in the valid units, but always show up in the invalid units. In the invalid units the result.out line is always different then the valid units. Beats me as to the cause of the invalid - result.out difference.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 22, 2013 3:33:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

I have just noticed that I have 7 invalids that got generated within two days crying. I have had a few previous invalids (two at the most), but nothing to really worry about.

MCM1_ 0000281_ 7216_ 0-- PBYBWHW Invalid 11/25/13 02:41:18 11/30/13 07:41:51 2.94 / 2.96 65.0 / 39.6
MCM1_ 0000302_ 5492_ 2-- R8XZ4P5 Invalid 11/27/13 00:21:00 11/30/13 05:58:54 5.23 / 5.44 118.9 / 55.5
MCM1_ 0000278_ 3201_ 0-- PBYBWHW Invalid 11/24/13 23:20:00 11/30/13 02:37:26 5.61 / 5.75 124.4 / 69.2
MCM1_ 0000275_ 3673_ 0-- PBYBWHW Invalid 11/24/13 22:38:11 11/29/13 22:33:38 5.58 / 5.73 119.9 / 67.5
MCM1_ 0000299_ 9235_ 2-- R8XZ4P5 Invalid 11/26/13 22:08:27 11/29/13 22:28:37 4.44 / 4.80 79.7 / 40.4
MCM1_ 0000275_ 2037_ 0-- PBYBWHW Invalid 11/24/13 22:35:06 11/29/13 20:57:03 5.56 / 5.71 119.9 / 71.6
MCM1_ 0000275_ 6725_ 2-- PBYBWHW Invalid 11/24/13 22:32:01 11/29/13 02:07:53

They all seem to have basically the same output:
Result Name: MCM1_ 0000302_ 5492_ 2--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running
Result.out = 3326321.000000
Run complete, CPU time: 18812.536777
23:29:26 (5892): called boinc_finish

</stderr_txt>
]]>
The Invalids happened on 2 Intel i5 laptops. Any comments?

[EDIT]: forget about the request for comments... I see that the thread for this topic is here

Thanks, CJSL

Crunching for a better world...
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


----------------------------------------
[Edit 1 times, last edit by cjslman at Dec 2, 2013 12:21:04 AM]
[Dec 1, 2013 3:09:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

We are looking into the invalids for MCM1. It looks like it could be a checkpoint/restart issue so if possibly change settings to Leave Applicaiton in Memory to yes while we investigate.

Thanks,
armstrdj
[Dec 2, 2013 3:57:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

Regrettably, LAIM only -prevents- half the problem. Client / device restarts cause the same. Predominantly this will affect any cruncher that does part-time, not making use of hibernation/shleep.

What is disturbing is that even with matching Result.out values, the validator would not pass the mustering, but in summation, we've also seen all 5 copies, the maximum, come back with different Result.out values, then being moved into Too Late [which you will have seen in the take-out reports]

Anyway, thanks for looking into this [and baited breath Beta hunters not far behind, to take that urgency away]
[Dec 2, 2013 4:15:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

We are looking into the invalids for MCM1. It looks like it could be a checkpoint/restart issue so if possibly change settings to Leave Applicaiton in Memory to yes while we investigate.

Thanks,
armstrdj

Having LAIM checked doesn't seem to make any difference. I have 5 tasks from yesterday and today from 1 machine that is running with LAIM checked that are pending verification . 3 restarted, 2 didn't. It's the "Home Premium 764" named machine if the techs want to take a look. The machine and client have not been restarted that I know of. I can confirm what Sek said about restarts. I had to reboot one of my machines several times yesterday and all the tasks that were running at the time are headed to pending verification pergutory. sad
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 4 times, last edit by nanoprobe at Dec 2, 2013 4:39:09 PM]
[Dec 2, 2013 4:27:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7244
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: MCM: Seeing interspersed Invalid results with 7.26, passing via Pending Validation

What is disturbing is that even with matching Result.out values, the validator would not pass the mustering,


I wondered if anyone else had noticed this. Hopefully the techs can fix this.
MCM1_ 0000317_ 5063_ 4-- 726 Valid 12/1/13 20:44:40 12/2/13 13:41:26 9.94 162.8 / 155.8
MCM1_ 0000317_ 5063_ 3-- - No Reply 11/28/13 20:44:31 12/1/13 20:44:31 0.00 0.0 / 0.0
MCM1_ 0000317_ 5063_ 2-- 726 Invalid 11/27/13 19:37:58 11/28/13 20:43:52 5.93 124.0 / 77.9
MCM1_ 0000317_ 5063_ 1-- 726 Invalid 11/26/13 16:40:50 11/27/13 19:37:36 6.63 167.1 / 77.9 <Mine
MCM1_ 0000317_ 5063_ 0-- 726 Valid 11/26/13 16:40:31 11/27/13 12:37:45 6.67 148.7 / 155.8

All have the same result out:Result.out = 3982717.000000

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Dec 2, 2013 11:48:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread