Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 76
Posts: 76   Pages: 8   [ 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 59833 times and has 75 replies Next Thread
martin64
Senior Cruncher
Germany
Joined: May 11, 2009
Post Count: 445
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Parents, children, grandchildren WUs - how does it work?

Having looked through various Threads, I still couldn't find an explanation about how the process with the large HCMD2 WUs goes. Here is what I think I have understood so far:

1. All WUs have a distribution of 2, and a quorum of 2
2. The WUs' runtime is hard to predict, resulting in some rather long runtimes
3. Due to the nature of the WUs calculating "positions", these long WUs can e split by the WUs. This is, if the WU time reaches 6 hours, it will continue if the estimated progress is at least 60%, otherwise it is stopped. It will be stopped at 12 hours anyway.
4. The work left over is then distributed again, where the "children" sort of "inherit" the rest of the WU, continue where their "parents" stopped.

Now to the stuff I haven't understood:

Client-side termination (or better: truncation) of a WU in a quorum-2-environment means that the number of position I have calculated is different to what my wingman has calculated. So how is the validation done? Of course, if my WU is the "shorter" one, the common positions can be validated against each other, so my result will be valid. But how about the "longer" one where we do not have the validation results for? Do we believe that the rest is likely to be correct if some of the result has been validated? Does the "child" WU start at the first position that was not calculated, or at the first position that could not be validated?

Regards,
Martin
----------------------------------------

[Oct 22, 2009 11:36:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

I could guess but I'll leave this beauty for knreed to answer, the one who devised the algorithm now being out east of the Spanish plain in an annual BOINC workshop.

Validation is done on the same [common] positions each has done. The extra positions by the faster device are assumed to be okay, I think to have read.

Where a child starts off, don't recollect. From the highest position completed in a quorum or the last one matched? If there is a 100% match requirement it would be the lowest, implying an amount of redundancy. The increased number of homogeneity groups for this project i.e. P3 in P3 group, P4 in P4 group etc already couples devices of similar capability to reduce that redundancy, if so.

Aside, think there is enough statistical data to assign a confidence level... for instance is the device rated as reliable?

As said though, I'll leave the techs to answer the intricacies as I'm too quite puzzled.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Oct 22, 2009 12:07:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

I've been wondering about this, too.
Associated with the method of validation is the question of what positions does the wingman get when I get a child or later-generation WU?

Here is a scenario. I forget how many total positions there are in HCMD2 WUs, but for the example, I'll assume 600.
Starting with a virgin WU, cruncher A completes all 600 positions.
On the 2nd stream, cruncher B does 200.
I'm cruncher C and I get positions 201-600.
What does my wingman (D) get? My Results Status says that he always gets a WU with the same name as mine, except for the last digit that is the number of the copy. If he really gets positions 200-600, this means that these positions will get to be crunched 3 times. If C and D do not both complete their 400 positions, yet another duplication of crunching will be added. Et cetera. Such a system would be very wasteful.

Or are cruncher D and his WU imaginary, with his WU made up of positions extracted from result A? The latter scheme would be most efficient, and also simplest to implement. A new WU would be split into 2 streams. Each stream would be split into as many real WUs as needed to crunch all positions. The initial, parent WUs would be real and identical, while most other wingman WUs would be imaginary. For any WU returned, when all corresponding positions in the opposite stream have been crunched, the imaginary wingman WU could be synthesised if necessary and validation could proceed.
Comments?
----------------------------------------
[Edit 1 times, last edit by Rickjb at Oct 26, 2009 8:31:31 AM]
[Oct 26, 2009 8:26:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Let's first have knreed explain this, part of your question is a paraphrase of what I already said and what martin64 already asked.

Theoretically a single child result could be run to compute the missing matches of the parents, but then against what to validate? What if the device that does the spare positions is not in the same homogeneity group... u still need a second result in the quorum to get any validation at all... probably a reason why having the complete 5000 parents positions done takes longer as WCG will want to determine which to repackage into the child and grandchild.

edit: 5000 parent pairs:

http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,25861

When WCG upped the initial cut off time from 4 hours to 6 hours, there was a huge reduction in results... the mean project run time went from 3 hours to 4.6 hours, so do not know if the 15000 descendants is still a valid number.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Oct 26, 2009 9:36:09 AM]
[Oct 26, 2009 9:18:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mysteron347
Senior Cruncher
Australia
Joined: Apr 28, 2007
Post Count: 179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Well, I'll join the puzzled club.

Suppose we have IDENTICAL machines which process 40% of a WU in the allotted time.

The second generation would appear to be FOUR tasks, starting at the 40% mark and completing to 80%

We'd then have a third generation, with EIGHT tasks.

There would be massive wasted effort in this case.

Suppose we have one machine twice as fast as the others so that the first generation returned results at 40% and 80%.

The next generation could be TWO tasks started at the 40% mark, wasting the extra processing by the faster processor and effectively limiting the processing speed to the SLOWER of the processors selected.

OR the next generation could start at the 80% mark, violating the entire concept of matching, and hence unlikely.

Any other way would seem extremely complex to implement - perhaps tying to combine the 40% processed only one with the 20% totally unprocessed and split it between two tasks. Very ugly - and does not seem to fit with the numbering system for later generations (whereas the start-at-40% and start-at-80% scenarios would - both next-generation tasks carry the same number, bar the replication number.)

And it can't be the case that a task is simply completing a partially-completed task, as they ALWAYS appear in pairs.

If only the incomplete part was being passed on, every generation beyond the first would have a unique start-position number as part of its task designation.

I regret that the "homogeneity group" argument also doesn't hold water. knreed is concerned in that thread about the difference between SSE2 and non-SSE2 processors. From my own results, I have instances where my crunching partners have apparently taken between 36% and 142% of the time I took to crunch what appears to be identical tasks. This means that within my "homogeneity group" there is nearly a 4:1 speed ratio - which must be far greater than the SSE2/no-SSE2 scenario.

I theen' someone has some 'splainin' to do...
[Oct 26, 2009 7:06:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Parent is singular in meaning, 1 parent (in quorum 2), 1 child (in quorum 2) etc. It's not exponential, so originally with the shorter run times 5000 parent tasks generated 5000 children, generating 5000 grandchildren then even great-grandchildren. Those I've not seen for a long time,.

Homogeneity is matching CPU's of similar feature, thus where other projects are just matched Windows/Windows, with HCMD2 there's further specialization, the consequence being more equal run times and yes that can still run a substantial spread, but still less than matching a P3 with a I7-920.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Oct 26, 2009 7:18:58 PM]
[Oct 26, 2009 7:16:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

"... then even great-grandchildren. Those I've not seen for a long time." The greats aren't so unusual. They get a short return date, so you have to be a fast returner to get them:
> CMD2_0132-ARAF.clustersOccur-3BT2_A.clustersOccur_0_17879_22158_21086_22158_21655_21907_0 | rjb-a64x2 | Pending Validation | 26/10/09 09:11:25 | 26/10/09 17:25:03 | 0.14 | 2.2 / 0.0
It's the great-great grandkids that are unusual.
[Oct 26, 2009 8:14:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

What platform/cpu combo do you get those GGC's? Maybe driven by those homogeneity groups, suspecting a parent and it's descendants remain in the same family.

Vaguely I remember the deadlines are shorter on those GGC else the whole sequence taking way to long and with less than 48 hours return time since quite a few days not getting them, even though the short deadline tasks do come through here.

Today BTW had a child finishing in ~9.5 hours and few days before a parent ending in ~3.5 hours so the variety is wide.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Oct 26, 2009 8:39:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Example: Parent workunit is set up to compute 1-20,000.

Parent Replica 0 computes 1-5,500 in 6 hours
Parent Replica 1 computes 1-5,000 in 6 hours

Validation occurs on structures 1-5,000. Structures 1-5,000 are saved. Credit is awarded to parent replica 0 and 1 based on upon the average credit per structure (thus replica 0 is awarded 11.1% more credit then replica 1)

Since child workunits are required, the back-end code determines that the most structures that should be computed by a child workunit will those that could be computed in the 6 hour basic limit by an 'average' computer. This results in the following new workunits.

Workunit A: 5,001-8,750
Workunit B: 8,751-12,500
Workunit C: 12,5001-16,250
Workunit D: 16,251-20,000

This process then repeats if necessary.

For a set of 7 batches that finished yesterday, they had the following distribution of parents, children.....

8372 Parents (26.9%, cumulative: 26.9%)
18248 Children (58.7%, cumulative: 85.6%)
3943 Grandchildren (12.7%, cumulative: 98.3%)
507 Great-grandchildren (1.6%, cumulative: 99.9%)
25 Great-great-grandchildren (0.1%, cumulative: 100.0%)

While most of the Children represent 'splits' (i.e 2 or more children are created). The majority of the grandchildren are 'finishers' (i.e. only 1 additional workunit was created to finish off the workunit).

If a descendant is required, then the difference in structures between what one host computes and what the second one computes is discarded.

It is important to note though, that most workunits have no descendants. Those that do generally have a small number of structures that are computed by one host and not the other. There is a very small percentage of work that is 'lost' due to this technique.
[Oct 26, 2009 9:17:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mreuter80
Advanced Cruncher
Joined: Oct 2, 2006
Post Count: 82
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Thanks knreed for the information.
It makes very well sense to me. However, I hope you do some performance matching. Otherwise you will see some results like this one:
CMD2_ 0139-2A5AA.clustersOccur-1RW6_ A.clustersOccur_ 122_ 1--  614 Valid  10/18/09 03:50:57 10/19/09 14:56:10  12.00 181.0 / 217.7 <--- mine
CMD2_ 0139-2A5AA.clustersOccur-1RW6_ A.clustersOccur_ 122_ 0-- 614 Valid 10/18/09 03:47:51 10/24/09 14:33:09 6.02 35.6 / 30.5


Well, I know this results is very unusual, but still it makes me wonder whether roughly 10 hours went down the drain for nothing.
Don't get me wrong, I believe this is a good system to handle the unpredictable running time of the WUs - just want to mentioned that such odd situations exist.
----------------------------------------
[Edit 3 times, last edit by mreuter80 at Oct 26, 2009 9:47:44 PM]
[Oct 26, 2009 9:38:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 76   Pages: 8   [ 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread