Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 11
Posts: 11   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 14278 times and has 10 replies Next Thread
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
The thread for DDDT2 WUs that behave strangely or badly

Maybe this post should go in the DDDT-2 Run time Ranges thread, but it doesn't really belong there.
And there may be other WUs that behave strangely, so let's have a new thread.

erlc_ d002_ pr89b1_ 1 | Valid | 26/03/10 08:20:59 | 26/03/10 09:53:00 | 0.43 | 6.9 / 7.0
erlc_ d002_ pr89b1_ 0 | Valid | 26/03/10 08:20:57 | 26/03/10 11:57:06 | 0.27 | 7.2 / 7.0 (mine)
This WU progressed to an indicated approximately 30% complete, then suddenly terminated normally.
My device (Q9650 @ 3.9GHz) has crunched 2 other "pr" WUs in around 0.8h, and is currently running 1 other with an extrapolated completion time of about 0.8h too.
The wingman has claimed similar credit for this short WU, so he probably did a similar amount of work on it, and probably experienced early termination too.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Mar 26, 2010 12:22:43 PM]
[Mar 26, 2010 12:20:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

These sciences checkpoint every 2% progress i.e if you activate <checkpoint_debug> in the cc_config.xml and set a very short write to disk of 30 seconds, you might be able to find out, if you wish to of course.

Personally, I wish all checkpoints lines (even those that the client setting does not permit writing to the message log), were stored to the result log. It's only slightly more info, but quite comforting when we laymen do some of the self-diagnostics... i.e. we should see 50 of them.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 26, 2010 12:30:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Sek: I have checkpoint_debug turned off for most devices, and will leave counting checkpoints in logs to someone with more time on their hands.
Some of the contribitors to your DDDT-2 Run time Ranges might find some early-terminators and thus identify which WU types are affected. That would make searching through results and log files much easier.

[OT but related]: A suggestion: Type A WUs to checkpoint more often than every 2%. That would reduce lost crunching time dure to restarts.
[Mar 26, 2010 2:23:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Sek,

Checkpointing on type A work units is every 2%. But on type B and C it is much much quicker. It tries to checkpoint every x times through the loops, which I believe ends up around 10 seconds...

Rickjb,

The early termination is ok. It is one of the positive negatives I have talked about in other threads. Multiple checks were done to make sure both you and your wingman encountered similar situations, so unfortunately the run was short, it did provide information for the researchers.

Thanks,
-Uplinger
[Mar 26, 2010 2:37:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Rickjb,

Sorry due to a limitation of the dynamics loop, we are not able to make it checkpoint sooner than every 2%. We have tried and that was the best we were able to achieve.

-Uplinger
[Mar 26, 2010 2:38:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

"... to someone with more time on their hands."
Thanks for reminding me.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 26, 2010 2:41:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
X-Files 27
Senior Cruncher
Canada
Joined: May 21, 2007
Post Count: 391
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Sorry due to a limitation of the dynamics loop, we are not able to make it checkpoint sooner than every 2%. We have tried and that was the best we were able to achieve.

I also have this dilemma of losing some precious time due to:
a) BOINC running its periodic benchmark
b) switching applications
c) EDF mode
----------------------------------------

[Mar 26, 2010 2:53:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Point A) is fixed in a coming 6.10 version, I hope in the one that WCG is going to recommend. During the 30 second benchmark the sciences will then not be unloaded. Generally I've got LAIM on, but I see that for some that is no option. Switching apps is anyway done at checkpoints i.e. lossless, which leaves EDF... which should be rare with not excessively sized caches **.

edit: ** and in the current high variability environment of jobs for DDDT2, HCMD2 and HFCC that means keeping it near 1.00 or lower.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 26, 2010 3:11:15 PM]
[Mar 26, 2010 3:06:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

Sek,

Checkpointing on type A work units is every 2%. But on type B and C it is much much quicker. It tries to checkpoint every x times through the loops, which I believe ends up around 10 seconds...

Rickjb,

The early termination is ok. It is one of the positive negatives I have talked about in other threads. Multiple checks were done to make sure both you and your wingman encountered similar situations, so unfortunately the run was short, it did provide information for the researchers.

Thanks,
-Uplinger

Thx for connecting the dot's I've missed. Glad I got the WTD set on 5 minutes else the message log would get truly overlong. As per mikaok's comment yesterday, it's also good to be able to eliminate and in this case even move on quicker.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 26, 2010 3:24:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The thread for DDDT2 WUs that behave strangely or badly

WU that ended early with Error -161
These WUs have been discussed in the problematic batches? thread.
The techs & scientists are working on the problem, but it's still happening.
From the log file of my example (2 previous wingmen had the same):
> <file_name>erlc_e019_pda004_0_2</file_name>
> <error_code>-161</error_code>
> </file_xfer_error>
No-one else has quoted the corresponding message from the BOINC clients Messages tab:
27/03/2010 7:03:51 PM|World Community Grid|Computation for task erlc_e019_pda004_3 finished
27/03/2010 7:03:51 PM|World Community Grid|Output file erlc_e019_pda004_3_2 for task erlc_e019_pda004_3 absent
Times are UTC+11. HTH - Rick
----------------------------------------
[Edit 2 times, last edit by Rickjb at Mar 27, 2010 9:35:24 AM]
[Mar 27, 2010 9:33:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 11   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread