Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 20
Posts: 20   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 15929 times and has 19 replies Next Thread
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Changes to distribution of error work units

Greetings all,

For Discovering Dengue Drugs - Together Phase 2, we are going to change the max errors for a work unit to 3. In the past this was set to 5. From the errors we have seen with the application, almost all of them have been consistent errors. Meaning the false positives that the researchers expect. This will decrease the number of copies that are sent out with the 16MB input file but fail quickly. The main reason for doing this is to decrease the number of large downloads with quick errors for the members. This will increase the speed in which we can return batches to the researchers as well.

Thanks,
-Uplinger

PS: If you have any questions, feel free to ask.
[Apr 14, 2010 3:13:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units


PS: If you have any questions, feel free to ask.
Only one, when will the deluge start? biggrin wink
----------------------------------------

[Apr 14, 2010 4:42:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Apres nous.... laughing laughing laughing
[Apr 14, 2010 4:47:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JSYKES
Senior Cruncher
Joined: Apr 28, 2007
Post Count: 200
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Thanks for the update Uplinger, but will that have any effect on the distribution of the WU's away from crunchers who have had high percentages of errors to positive returns? I guess everyone has had errors but the distribution seems to have been very uneven and hence some of us have only had a very small number of WU's in total (I've had less than 20 with only 14 validated) which could slew the stats against us for a while (or how long?) despite having v quick PC's. What's the score with this?
----------------------------------------

[Apr 14, 2010 8:02:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Thanks for the update Uplinger, but will that have any effect on the distribution of the WU's away from crunchers who have had high percentages of errors to positive returns? I guess everyone has had errors but the distribution seems to have been very uneven and hence some of us have only had a very small number of WU's in total (I've had less than 20 with only 14 validated) which could slew the stats against us for a while (or how long?) despite having v quick PC's. What's the score with this?


It needn't be DDDT-2 results returned to make up for the error WUs... HCC gives the smallest downloads, but HCMD-2 appear to have the lowest average turnaround time for a good result. For each error WU you had, grab a dozen or so WU's from either of those tasks on each machine and (assuming they're returned without error) your machine(s) should be rated 'reliable' again within a day and eligible for A and B types. At least that's what I did after every one of my machines got hit with at least 1 bad WU; A couple are now crunching A's and B's as I type.
[Apr 14, 2010 9:34:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JSYKES
Senior Cruncher
Joined: Apr 28, 2007
Post Count: 200
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Thanks ZoSo - I've been crunching loads of other stuff as the DDDT2 is so unreliable for continuity, all others without a single error so it sounds as though it should be self correcting in due course.
----------------------------------------

[Apr 15, 2010 6:13:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

The good, now old news is, you really only need 1:1 for errors to stay in the RR class once arriving there, after having done 77 good ones! Case in point, my client flunked a DDDT2 job and shortly after got a HCC repair job.

Be Happy, Crunch Happy.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 15, 2010 6:26:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

uplinger: "If you have any questions, feel free to ask" ...
1. When does the change happen to the max no of errors on a WU that prevents further copies being sent out?
Perhaps it is already in place for the next batch of new WUs, but does not apply to the ones that have already started.
For example, the 7th copy (name *_06) has just been sent out for the WU that I described in The Bad Type A WUs Thread.

2. Within a quorum, should the values of all of the data in each of the members' running science programs be bit-for-bit identical throughout the run, so that if errors occur, they occur at the same place?
In the abovementioned TS05/ps WU, 3 copies terminated at pctComplete=0.688000 and 1 at 0.447600. In thread DDDT2 - now an Intermittent project , mweisensee says that ts05_a193_ps0000 gave 3 error exits with different % completion. JmBoullier mentions more WUs that terminated at different points, including one where 2 members completed the WU successfully. Furthermore, mweisensee thinks that forcing periodic restarts from checkpoints avoids the errors. Do these things make sense?
Does CHARMM use Monte Carlo methods deep inside its ancient FORTRAN machinery?
----------------------------------------
[Edit 5 times, last edit by Rickjb at Apr 17, 2010 12:59:42 PM]
[Apr 17, 2010 12:24:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

PS: If you have any questions, feel free to ask.


I submitted the 3rd error for ts05_a256_ps0000 this morning about 2:15, yet at 3:24 another copy of that WU was sent out.






How are the 3 errors being counted?

Thanks. smile

[edit1 - added screen grab]

All were exit code 29 (0x1d), by the way.

[edit2 - added exit code]
----------------------------------------
[Edit 2 times, last edit by Former Member at Apr 17, 2010 1:20:47 PM]
[Apr 17, 2010 1:04:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

JmBoullier mentions more WUs that terminated at different points, including one where 2 members completed the WU successfully.
Rick,
Sorry if my wording has been confusing but, in my sentence "the most consistent one is the only one which completed fine for both my wingman and me. smile " "one" stands for "quorum", and obviously this sentence applies to a particular WU which was valid for both wingmen, not to a WU with two valid and x errors.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Apr 17, 2010 1:36:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 20   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread