Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 6
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 23976 times and has 5 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
BOINC: Results Status page - What does xyz status mean?

There have been a number of questions regarding the results status page. The definitions of these status' are shown below:
  • Work In Progress - This means that one of your computers is currently working on it

  • Aborted - has 2 prefixes, Server Aborted and User Aborted for canceled jobs before or during running. Server Aborted reflects the Server side instructed cancellations telling the client to automatically abort redundant or known bad tasks. Latter only happens when the client has contacted the project servers who set such an instruction ready.

  • Detached - When a newer client gets dis-associated from this project with tasks still in the client buffer, will a message be sent to the servers to ensure, that these tasks get quickly redistributed. With older clients this would not happen and task copies would not get sent until the "No Reply" condition occurred.
    NB: Those who experience this -unexpectedly- and are users of BOINCStats Account Manager [BAM], need to ensure the "Attach new host by default?" column box for WCG is ticked on their My Projects page.

  • Error - Some event occurred to keep the result from finishing properly. This could be due to a BOINC error or a science application error.

  • No Reply - The result was not returned to the server by the time it was due.

  • Pending Validation - The result was returned to the server but there have not been enough results returned yet for that workunit to allow a validation test. Occasionally a Pending Validation result without wingman is moved into a Too Late status to permit clearance from the system and facilitate credit granting.

  • Valid - The result was returned to the server and was equal to the majority of results returned for the workunit.
  • Invalid - The result was returned to the server and was not found to be equal to the majority of results returned for the workunit.

  • Pending Verification - [FKA Inconclusive] The result was returned to the server and validation was attempted but the system could not determine which result(s) it should consider to be valid. New results were sent out for this workunit and validation will be attempted again when those results are returned. Additionally, for the Zero Redundancy sciences one or more results are at times randomly marked Pending Verification to force out an additional copy for computation and confirmation. Clients which produced an invalid/error result will see this more frequent until the reliability rate has returned to high standard [Serial over 20 valids required].

  • Too Late - The result was returned to the server a longer time after it was due. Occasionally a result previously marked Pending Validation has the distribution stopped due to too many errors, without a complete quorum [max errors varies per science]. The non-error results are then converted to the status Too Late. Credit is granted as claimed [with delay]. Internally these task results are moved to a take-out list, for later review. Also see the Pending Validation status.

  • Other - The most common reason for this status is a last minute retraction. The feeder decided that the workunit copy has become superfluous as a missing No Reply workunit was received and validated seconds before sending.

  • Waiting to sent - A transient condition when not all copies of an Initial distribution have been downloaded by volunteer clients, not enough 'same platform' hosts are momentarily asking for the same project work. Seen also when a science/feeder has been temporarily stopped.

The initial "Work In Progress" status changes only after the Result has been returned and the BOINC client "Ready to Report" task status line has cleared.
Below are the Result Statuses in languages for which website translations are available:


NB:edit: Where appropriate the term science has replaced project
----------------------------------------
[Edit 27 times, last edit by Former Member at May 18, 2015 6:04:34 PM]
[Feb 11, 2006 2:46:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Results Status page - what does xyz status mean and how to Review Easily!

Frequent visitors will have figured out the various functions on the 'Result Status page' and maintain bookmarks to achieve quick access to listings of the various types of Statuses of Work Units. For those that have not, here a few using TinyURL all sorted with the latest on top and the oldest at bottom:All links produce lists with the latest/newest result returned or received at top and the oldest at bottom. More links could be created for bookmarking with additional filtering for the individual devices a member may be operating.

Sorting can be achieved on the following 5 criteria by hitting the respective links in the column headers:
  • Device Name: Alpha-Numerically sorted
  • Sent Time: Ordered with latest transmitted tasks on top
  • Time Due: Sorted with latest due at top
  • Return Time: Latest work at top of page
  • CPU Time (hours): Longest job at top, sorted descending.
The links provided always go to the first page. By replacing the ending 1 with 999, you can force the list to jump to the last page. Helps with longer tabulations and to quickly identify old work, e.g. those awaiting additional copies from No Reply, Inconclusive & Error return work units.

Filtering of work units can be done on 1 Device Name, 2 The 9 Result Statuses and 3 the Projects for which current results are returned. The sort order will be maintained as previously selected.

For further detail on the distribution & quorum of a particular Work Unit, select the one of interest (example circled red) which will than open a Pop Up list or a new tab, depending on the browser configuration (lower half of screen shot below). The list can be again sorted by selecting one of the 4 circled column headers. More expanded information is provided in the FAQ: Minimum Quorum & Initial Distribution of Work Units (Tasks)

Notably, Results are only listed until 4 days after validation and where applicable quorum is complete and or no more work needs to be send out to achieve validation. Distribution Quantities per Work Unit can vary by project and is subject to change as per the requirements of the project managers and scientists.



NB: www.wcgrid.org is a valid address which redirects to www.worldcommunitygrid.org
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 18 times, last edit by Former Member at Nov 14, 2011 7:59:03 AM]
[Feb 14, 2007 6:40:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Results Status page - HPF2 project

The Human Proteome Folding project - Phase 2 is unique. Instead of sending out 2 (or more) identical work units and expecting 2 (or more) identical results back, each work unit is computed slightly differently according to a random variable. The validator checks to ensure that the results are all similar though not identical. If a result is very different it is marked as an error. Currently we send out 19 copies of each work unit and expect to get 19 different results back. knreed discusses some of the details in this post: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=11960#102255
Some background and information on how HPF2 works on BOINC.

1) The goal of HPF2 is to generate about 33,000 structures predictions per gene. These predictions are then further analyzed by the researchers to identify the mostly likely structures for the protein. See the official description of the project for more info: http://homepages.nyu.edu/~rb133/Human_Proteome_Folding_Project.html

2) Each result returned by a computer contains between 10-35 predictions depending on a rough estimate of how tough the predictions will be to compute for the gene.

So on a given gene if we ask each computer to compute 25 structures, then it will take 1,320 results in order to generate the set of 33,000 structures. We could create one workunit for the gene and send it out enough times to make sure we get 1,320 results. However, with the way that BOINC works and this would create some inefficiencies in the system.

What we do is that we create a set of workunits that will generate the required number of results for the gene. Over the entire set of workunits for the gene, we need to generate an average of about 17.2 valid results per workunit. In this example where 25 results are being generated per result, this means that we will create 77 workunits for the gene - each of which needs to average 17.2 valid results with 25 structures per result. 77*17.2*25 = 33,110 structure predictions.

If we have a gene that is tougher and only 10 structures will be generated per result then we would create 192 workunits. 192*17.2*10 = 33,024 structure predictions.

In order to generate the average of 17.2 valid results per structure we only have three variables to modify:

1) Initial replicas sent out (initial_replicas)
2) The min quorum (min_quorum)
3) The min results in agreement within a quorum before accepting the result as valid (min_agreement)

The way that BOINC works is that for each workunit, BOINC will initially send out 'initial_replicas' initial copies of the workunit to be processed. As the results come back in, BOINC will wait until 'min_quorum' is reached before attempting to validate the workunit. During this phase, if a result is aborted or returned as an error, then an additional copy will be sent out. Once 'min_quorum' results are returned then validation is attempted. Validation is successful if at least 'min_agreement' results are determined to be valid. If there are not at least 'min_agreement' valid results, then all the results are marked as 'INCONCLUSIVE' and an additional result is sent out. Each time an additional result is returned, validation is attempted again. Once we have min_agreement valid results then the results will be marked as valid or invalid as appropriate and credit is awarded. Normally at this point BOINC would go ahead and 'assimilate' the result (this means to copy the results off to be returned to the researchers). However, we have modified the assimilator for HPF2 so that it will wait until the last result is returned or the result with the latest deadline has missed its deadline. Then the assimilator will collect all valid results returned to return to the scientists. Also - once the validation has been successful, no additional results will be sent out even if additional errors or invalid results are returned.

We currently have initial_replicas set to 19, min_quorum set to 15 and min_agreement at 13. This is yielding about 17.6 valid results per workunit. We want to keep it a somewhat above the 17.2 target to ensure that we have the required set of structure predictions.

Edit: Minimum Quorum number adjusted. See BOINC: Minimum Quorum & Initial Distribution of Work Units (Tasks) FAQ for current distribution info.
----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 6, 2008 6:05:26 PM]
[May 22, 2007 8:48:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Credit computed for Quorum 2 - FightAIDS@Home, Discovering Dengue Drugs & Help Conquer Cancer

Preface: If the Zero Redundancy projects FightAIDS@Home and Discovering Dengue Drugs - Together require a second copy for verification, the credit rule is as per this post.

We have recently switched to a quorum of 2 for FAAH, meaning that BOINC will send out 2 copies of each work unit for the FightAIDS@Home project. If the results match, then no additional copies will be sent. knreed has posted the method for granting credit in http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=15592#120765
The standard BOINC way of awarding claim when there is a quorum of 2 is to simply award the lower of the claimed credits. This doesn't account for some computers which claim low incorrectly for many reasons. As a result we are using a different policy.

If the two 'claimed' credits are similar, then the credit granted will be the average of the two values. However, if the two claimed credits are far apart, then BOINC will check the 'recent credit per cpu sec' value for each computer and compare it against the claimed credit/cpu time for the result. Which ever computer is closer to their recent average will be used to assign the granted credit for the quorum.

Kevin

Further, knreed posted about additional refinements related to the Quorum 2 Credit System Logic in the Discovering Dengue Drugs - Together forum equally applying to Help Conquer Cancer and FightAids@Home :
We have been doing some investigation. Since the upgrade to the new server code, there have been an increase in complaints about the credit awarded. We dug into this and we found that there was a problem with the update and so we fixed that last Thursday. However, we have also looked deeper into some of the reported problems and found that there was a more fundamental problem with the 'two result' credit granting strategy.

The way that credit is awarded in a quorum of two is that the two claimed credits are compared and if they are within 30% of each other, then they are averaged and the average value is granted. Over 85% of workunits have the granted credit determined this way.

If the two claimed credit values are further then 30% apart, then the code looks at a field in the database which stores the recent average credit granted per second for each computer. Whichever computer's claimed credit per second for the workunit is closer to their recent average credit granted per second has its claimed credit used as the credit granted for the workunit.

What we found was that there were a few computers that were extremely consistent about claiming very low so they always caused the workunit to check the recent average history. Because they were consistently claiming low and it was matching their average granted credit they were being selected as the credit to use for the granted credit. We determined that these computers were claiming low by looking at the history of computers they were paired with and seeing their history - and indeed those other computers had a much lower grant when paired with one of these computers.

As a result, we are going to change how the 2nd part of the process works. Instead of selecting the credit that is closest to its history, we will average the recent average history's for the two computers. We have been simulating the impact of this for the past couple of days and it turns out that in a strong majority of cases the result cpu time * host recent average credit per cpu second is actually quite consistent between different computers even if their claimed credit are further apart. This is what we had hoped to see and as a result we will start to use this policy in the near future.

[Mar 20, 2008 1:11:36 AM]

----------------------------------------
[Edit 2 times, last edit by Sekerob at Aug 16, 2008 8:45:23 AM]
[Aug 9, 2007 2:29:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Credit computed for Nutritious Rice for the World

knreed explains it in http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=20230#165594
Each result returned by the project contains unique data. We have some additional information in the results that we will use to help us make sure that the very strong majority of results returned contain the correctly computed data.

We are starting out sending 19 results per workunit and validating when 10 are returned **. Each workunit will run for about 8 hours on each computer. A computer running this project will generate an additional structure prediction in about 90 seconds of execution time on a average computer.

This means that if you have a really powerful computer, you may generate 1000 structure predictions in 8 hours while a slower computer may only generate 160. We will be awarding credit by averaging the the claimed credit per structure prediction, and then taking that value and awarding the # of structure predictions in a result * the average claimed credit per structure prediction. So in this case if the slower computer had a claimed credit of 80 and the larger computer had a claimed credit of 450, then the average credit per structure prediction would be 0.475. This means that the slower computer would be granted 76 credits and the faster computer would be awarded 475 credits.

The key point though is that all computers should have a estimated time to completion of around 8 hours.


** Due to the nature of the RICE computations it can happen that parts of the initial distribution set will not be send out until the first results have been returned. This is to ensure that all tasks compute unique structures, thus optimizing project efficiency.

Edit: Minimum quorum reduced from 14 to 10 on 2008/06/24
Edit: Added explanation why not all jobs are send out at the same time (**)
Edit: Current constant runtime is 7 hours.
----------------------------------------
[Edit 3 times, last edit by JmBoullier at Oct 13, 2009 3:38:26 AM]
[May 12, 2008 6:53:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Credit Method for Zero Redundancy distribution projects [AutoDock based]

knreed explains the new method work distribution, validation and credit calculation for the zero/non redundant DDDT [phase 1] and FA@H jobs, (now also HFCC, CEP2, GFAM, SN2S, DSFL & C4CW) http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=21348#176125

We have just started the second part of the beta test for single validation for FightAIDS@Home and Discovering Dengue Drugs - Together.

Single validation is going to work in the following way:

Work units are loaded into BOINC with a quorum size of 1. This means that 1 replica is created and it will only take 1 successfully run result in order for validation to be attempted on the result.

However, there are a few checks in place:

1) The server maintains a value that serves to inform us of how 'reliable' a given computer is. When a result is assigned to the computer for a work unit with a quorum of 1, then the reliable measurement is checked to make sure that the computer is sufficiently reliable. If it is, then all is good. If it isn't, then the work unit is changed to have a min_quorum of two and a second copy is sent.

2) When validation is attempted, the value for the host is checked again. If the value has fallen below the required level, then the result is marked PENDING VERIFICATION and another result is sent. **

3) Additionally, during validation, there is a certain random chance that the result will be flagged to be checked again. Any result picked in this case will be marked INCONCLUSIVE until the validation with the additional result occurs. All computers are subject to random checking. **

4) We have also added some additional checks within the research apps to detect errant results. Part of this is a short run of the application that computes a known result (this was part of what we ran last week). This short run will be used to help ensure that the computation ran correctly as well as it will be used to determine the appropriate credit to award. *

When a single redundancy result is returned, the mini work unit needs to match up with the value computed during the beta test (otherwise the result is marked invalid and the work unit is sent to someone else).***


* Point 4 means that the granted credit might differ from the claimed credit even for a single unit without any redundancy.

** in case of Pending Verification and a second copy is needed for re-validation purposes, the credit rule reverts to the old method of taking the average of the 2 claims in the quorum, or applying the exception rule for outliers, normalizing the credit award. When an Error/Invalid result occurs, at least 25 subsequent results must sequentially validate before single redundancy distribution for the host is resumed for the respective science [as at Mar.2013]

*** In the case of "Invalid", and a second or third copy is needed for make up/repair/verification purposes, the credit rule reverts to the old method of taking half of the credit awarded to the 2nd valid result IF not comparing correctly on the main result and full credit IF the 1st is proving valid after all. If invalid, the half credit award rule is limited by the original claim of the invalid task. E.g., if the valid is 100 and the invalid 'claimed' 25, then the invalid gets only 25 [what it wanted for the work it processed].
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 10 times, last edit by Former Member at Jul 7, 2013 8:25:46 AM]
[Aug 4, 2008 5:38:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread