Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 20
Posts: 20   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 25733 times and has 19 replies Next Thread
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Jean: Yes, your wording was a bit imprecise smile However, if some copies of a WU error at different places, and there is some randomness in those places, then some might make it to 100%. The techs/scientists could check to see whether this is happening. Different error exit points could be explaind by the use of Monte Carlo methods, in which case restarting from a checkpoint might allow a WU to make further progress. A successful completion may not be meaningful though. I don't see how mweisensee's preemptive restarts could work.
[Edit]: Copy _3 of my error-29 WU ts05_b001_ps0000, described at The Bad Type A WUs Thread, has completed, and is PV. Thus we have an example of a WU with some copies completing, but other copies erroring out with a frequency that is way beyond the average WCG device failure rate.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Apr 18, 2010 9:27:08 AM]
[Apr 17, 2010 2:00:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

I submitted the 3rd error for ts05_a256_ps0000 this morning about 2:15, yet at 3:24 another copy of that WU was sent out.

How are the 3 errors being counted?

The standard BOINC server-code checks for larger than the limit when it comes to max "success"-tasks and max "error"-tasks for a wu. So if the limit is set to 3 errors, this means wu won't error-out before the 4th. error has been reported.

There's also a limit on max "total" tasks, there outdated server-code checks for larger than this limit also. With more resent code on the other hand this limit won't be exceeded.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Apr 17, 2010 8:01:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

The standard BOINC server-code checks for larger than the limit when it comes to max "success"-tasks and max "error"-tasks for a wu. So if the limit is set to 3 errors, this means wu won't error-out before the 4th. error has been reported.

There's also a limit on max "total" tasks, there outdated server-code checks for larger than this limit also. With more resent code on the other hand this limit won't be exceeded.



OK, yes - that makes sense... because they were getting more than 5 errors when the limit was set to 5, also. Forgot about that.

Thanks.
[Apr 18, 2010 2:58:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

(Edited my last post, which responded to JMBoullier.)

New (?) Information:
I have attempted to answer my own question (above): Does CHARMM use Monte Carlo methods ...?
I searched the DDDT2 program file, wcg_dddt2_charmm_6.17_windows_intelx86, looking for the ASCII strings "onte" and "MONTE", using the strings program in (the excellent freeware) Readypak Unix-like-utilities-for-Windows. We get:
> QMDEFN> Some nuclei will be treated quantum mechanically.
> The number of QM path integral atoms =
> The number of quasi-particles per atom =
> The number of Monte Carlo moves (av) =
> The number of Monte Carlo moves (eq) =
...
> MONTE CARLO : Sampling from Boltzmann Distribution
...
> GA_Evolve: Monte Carlo: Starting temerature :
> GA_Evolve: Monte Carlo: Final temerature :
> GA_Evolve: Monte Carlo: Temperature increment;frequency :
...
> GA_E
> MONTE CARLO energies per structure
...
> HB MONTE CARLO : GENERATION NUMBER
...
> SITE atoms using
> Monte Carlo points in
...
> MONTE CARLO :
> MONTE CARLO : GENERATION # TEMPERATURE $
> ANAL: BOND>
If this code is activated in DDDT2, it might explain what we are seeing.
Furthermore, I suspect that using results where some copies gave an error while others ran to completion would introduce bias and render the WU scientifically invalid. This might also apply to the forced restarts that mweisensee has performed - see exited with code 29 (0x1d, -227) . That's all up to the scientists to decide, of course.
HTH - Rick
----------------------------------------
[Edit 2 times, last edit by Rickjb at Apr 18, 2010 11:01:09 AM]
[Apr 18, 2010 9:29:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Jean: Yes, your wording was a bit imprecise smile However, if some copies of a WU error at different places, and there is some randomness in those places, then some might make it to 100%. The techs/scientists could check to see whether this is happening
Sure, it's possible that the same WU has copies in error and valid copies altogether. After all it's the purpose of distributing repair copies.
But, still, I don't see how I could have found such a quorum "the most consistent one". biggrin

Anyway, back to the point, I'll make sure that Uplinger does not miss these strange cases when he comes back tomorrow. They might signal something that the techs and the scientists have not noticed yet.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Apr 18, 2010 11:12:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Monte Carlo principles and in DDDT mentioned in a 2009 paper: http://www.utmb.edu/discoveringdenguedrugs-to...s/Watowich-IDDT-Jun09.pdf
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 18, 2010 11:26:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

It looks like I got some errors I had not noticed.

Page: 1
Result Name Device Name Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
ts05_ b436_ ps0000_ 0-- msi-920 Error 4/14/10 20:50:54 4/16/10 16:19:10 38.47 627.4 / 0.0
ts05_ a175_ ps0000_ 0-- WS-USSP-77417 Error 4/14/10 19:40:02 4/16/10 07:00:57 28.80 542.0 / 0.0

I was not the only one that got an error. Multiple people are getting errors, and both have been resent out and are in progress. I will be curious if anyone can complete these WUs.

One ran for 28 hours and one ran for 38 hours. That is a pretty long time to run, and get no credit. All the other errors I had on DDDT2 had no cpu time. Any thought of giving credit to those who processed a long time, before they got an error.

Also, if everyone is getting errors, why do they keep going out. On the first one, I got an error after 38 hours, my wingman got an error after 27 hours, it went out to someone else, who got an error after 28 hours, then it went out to two other people who are now in progress.
----------------------------------------



[Apr 18, 2010 8:34:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Mitch,
If your WUs in error have something like that
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
in their Result Log, then this case has been abundantly covered in several threads of the DDDT2 forum.

In short, it is a "normal" error and your WUs will be credited when their respective quora are complete.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Apr 19, 2010 12:29:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

Mitch,
If your WUs in error have something like that
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
in their Result Log, then this case has been abundantly covered in several threads of the DDDT2 forum.

In short, it is a "normal" error and your WUs will be credited when their respective quora are complete.[/quote


I will be wait for the credit, one of my errors was that but the other was.


Result Log

Result Name: ts05_ b436_ ps0000_ 0--



<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 29 (0x1d, -227)
</message>
<stderr_txt>
pctComplete = 0.405600
wcgStepsDone = 1500 wcgSteps1 = 5000 wcgCyclesDone = 20 wcgCycles = 50 pctComplete = 0.406000
wcgStepsDone = 1600 wcgSteps1 = 5000 wcgCyclesDone = 20 wcgCycles = 50 pctComplete = 0.406400
wcgStepsDone = 1700 wcgSteps1 = 5000 wcgCyclesDone = 20 wcgCycles = 50 pctComplete = 0.406800
wcgStepsDone = 1800 wcgSteps1 = 5000 wcgCyclesDone = 20 wcgCycles = 50 pctComplete = 0.407200
wcgStepsDone = 1900 wcgSteps1 = 5000 wcgCyclesDone = 20 wcgCycles = 50 pctComplete = 0.407600
wcgStepsDone = 2000 wcgSteps1 = 5000 wcgCyclesDone

Is this the same?

- Mitch
----------------------------------------



[Apr 19, 2010 1:38:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Changes to distribution of error work units

It is the error code which matters, so both WUs are in the right category and will be credited.

The format of the error message may vary depending on the OS and/or the BOINC version.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Apr 19, 2010 2:39:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 20   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread