Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 76
Posts: 76   Pages: 8   [ Previous Page | 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 59824 times and has 75 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Rick ... how about you publish when there is enough data to be statistically significant. Right now I only see 1 or 2 examples of how it is not working but no statistics on how often it does work. If the working sets from the samples you posted are in the 100s then the 1 or 2 bad runs do not amount to much. I am all for the most efficient crunching possible (I would love to see unecessary WUs server aborted once a quorum is met for any project) but knreed has made a substantive improvement, let's see where the numbers bring us once things settle in. Maybe if knreed needs another *break* and if there is a quick way to see if the single and double "split" stats have reduced at both the 6 and 12 hour limits we could get a better idea of the efficiencies the new methodology has provided. On the other hand ... maybe we don't want to see how inefficient it was before :-)
[Nov 2, 2009 12:12:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

@Snow Crash: The guys who can best assess the statistics are the techs, since they have access to all of the data. I wanted to let them know it was happening, but I waited until I had a 2nd example of a single-split so that no-one could say I just had a one-off event.
FWIW, the machine that crunched these 2 single-split WUs returned 14 HCMD2 WUs from the time the New System came into effect, up to and including the 2nd single-split. One WU that stopped at 6.01h is still PV.
[Nov 2, 2009 2:43:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Something to keep in mind is that we had to write the code for this so that workunits cannot get 'stuck' as is described in this thread: http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,27826

It took 2 days of the wingman to find a matching host.

The way that the code works is that it first scans the work array using 'tight' criteria. If enough work is found then all is good. However, if not enough workunits are found, then a 'loose' search is made. If not enough work is found then a final search is made that doesn't do this matching at all.

Even with this logic in place, these additional rules can slow down the pace at which a 'match' can be identified for a host.

Currently we have broken the 'power' metric and turnaround time metric into 10 segments each. Each segment has the same amount of the overall grids computing within in. On each request, a host is rated and assigned to one of the 10 segments for each of the 'power' metrics and the turnaround time. Between these two criteria, there are 100 'cells' where matching is performed (think of a 10x10 table).

In a 'tight' search, the code will match a workunit to a host if the host is in the same cell as the workunit or if the host is adjacent to the cell of the workunit. (9 out of 100 cells - except on boundaries)

In a loose search, the workunit and host can be up to two cells apart (25 out of 100 cells - except on boundaries).

The issue is that the 10x10 grid doesn't have equal power in each cell. Cells with long turnaround and high power metrics or short turnaround and low power metrics are going to have fewer machines. We are watching but it might be necessary to account for this issue by expanding the search when we are dealing with these cells.
[Nov 2, 2009 3:20:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Count   % Difference in granted credit
2437 0.0
208 5.0
128 10.0
63 15.0
49 20.0
85 25.0
69 30.0
50 35.0
49 40.0
36 45.0
19 50.0
16 55.0
15 60.0
9 65.0
8 70.0
6 75.0
1 80.0
1 85.0
1 90.0

75% of HCMD2 results have no difference between granted credit.
85% of HCMD2 results have 10% or less difference between granted credit.
----------------------------------------
[Edit 1 times, last edit by knreed at Nov 2, 2009 5:03:44 PM]
[Nov 2, 2009 5:03:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mysteron347
Senior Cruncher
Australia
Joined: Apr 28, 2007
Post Count: 179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Not sure why multiple searches are required.

if you build a list of candidate WUs in list_MAX(ABS(diff in columns) ,ABS(diff in rows)) then
1. If you get to enough_units in list_0, you have a list of units to allot
2. If you reach the end-of-pool, then allot units from list_0, List_1, list_2 until enough_units are allocated.

You could even use a speed-up to reduce the list-build time by observing that if enough_units have been recorded on all_lists then there's no point in appending any more to list_9; if 2*enough_units have been recorded, then don't add to list_8, etc.

IOW, each time enough_units have been transferred to all_lists, decrement the max_target_list_number. Naturally, don't bother adding to a list if list_n.count >= enough_units.

(I say lists, but arrays would work just as well....)
[Nov 2, 2009 7:09:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Randzo
Senior Cruncher
Slovakia
Joined: Jan 10, 2008
Post Count: 339
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

knreed you do a great job thanks a lot
[Nov 3, 2009 9:39:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

We are still wasting a lot of crunching time with HCMD2, due to devices hitting the 60%/6h barrier.
Deselecting HCMD2 in my device profiles is not a satisfactory idea because if I do that, my share of HCMD2 will go to other crunchers who will waste some of their time crunching the HCMD2 WUs I've left, instead of helping me crunch other projects.
I have, however, deselected HCMD2 for the machines where device speed-matching is very poor, because the wastage is worst for these.
That leaves my 2.5GHz AMD, for which speed-matching is much better - mostly.
Some recent examples of wastage for the AMD:
Name | My Time / Wingman's Time ( My Credits Awarded / his CA ) Credits Wasted / Time Wasted
CMD2_0148-TPM1A.clustersOccur-1Z2C_D.clustersOccur_260_2 | 6.01 / 6.01 (87.0 / 51.5) 28.5 / 1.97h (both devices hit 60/6 barrier)
CMD2_0164-MYH2A.clustersOccur-2RHK_C.clustersOccur_11_1 | 9.54 / 6.00 (130.1 / 52.7 ) 77.4 / 5.68h (one device hit 60/6 barrier - a "single-hit")
Would the following modifications to the current scheme be feasible?
First, some observations:
- Wastage occurs whenever a device hits the 60/6 barrier.
- Most hits on the 60/6 barrier are with 1st-generation WUs.
- The vast majority of 1st-generation WUs that I get, hit the barrier.
- Wastage is greater when speed mismatch is greater, and is worst by far with single-hits.
- The probability of a single-hit increases with mismatch. Hence wastage increases more than linearly with mismatch.
Suggestion 1:
Shorten the 1st-generation WUs that are sent, to increase the % that run to completion. Do not send the full no of structures, just enough to get a good estimate of the amount of work per structure. I am thinking of about 25% of the full job. Do not send them to the slower crunchers as that would increase the proportion of 60/6 hits. Do not send them to the very fast devices unless the completion rate is very high, because speed-matching is poor among sparsely-populated speed classes.
Then, having determined the amount of work per structure, carve up the remaining parts of the WUs into descendant WUs tailored to the various device classes, with lengths calculated so that "all" devices in the target class can clear the 60/6 hurdle.
There would be some increase in server load crying, but there would also be an increase in cruncher happiness smile.

Suggestion 2:
Whether this would help depends on the proportions of the total wastage in the project that are coming from single-hits versus double-hits.
For double-hits, wastage would be reduced by cutting the work off earlier, eg move the hurdle to 30%/3h (for 10h-max WUs).
However, this would increase the wastage being caused by the single-hits.
Theoretically it would have no effect on the proportion of WUs that get barrier hits, or the ratio of single- to double-.
For single-hits, wastage would be minimised by letting the slower device crunch for as long as possible, since everything that the faster device does beyond where the slow device quits is wasted.
To minimise wastage by single-hits, move the 60%/6h hurdle to 100%/10h !!
I don't know where the optimum % value lies, but knreed's New (-ish) System is likely to have changed it. And unless 60% was chosen as a result of doing the maths, it is unlikely to be the best value to minimise wastage.
Perhaps the statisticians among the Project Scientists should exercise their black art skills.

I'd also vote for longer HCMD2 times if it helped reduce wastage.
[Nov 20, 2009 10:24:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

I just took at results returned over the past 24 hours for HCMD2. Given the new strategy the following is occurring:

88% of results have no difference
91% of results have 10% or less difference
95% of results have 40% or less difference
[Nov 20, 2009 10:02:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mysteron347
Senior Cruncher
Australia
Joined: Apr 28, 2007
Post Count: 179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Hmm - so it appears that we have improved from 75% "no difference" to 88% simply by knreed's having a break from administration work.

Obviously then, knreed should have more such breaks wink

Bearing in mind knreed's comment
It would require substantial re-writing of BOINC to be able to do as you are saying (i.e. after the two results are returned, send out a third that only completes the work not doneby the shorter). I agree that this would be the best - but it simply wasn't feasible.


But it does seem possible to create arbitrary-length replicas, given
Example: Parent workunit is set up to compute 1-20,000.

Parent Replica 0 computes 1-5,500 in 6 hours
Parent Replica 1 computes 1-5,000 in 6 hours

Validation occurs on structures 1-5,000. Structures 1-5,000 are saved. Credit is awarded to parent replica 0 and 1 based on upon the average credit per structure (thus replica 0 is awarded 11.1% more credit then replica 1)

Since child workunits are required, the back-end code determines that the most structures that should be computed by a child workunit will those that could be computed in the 6 hour basic limit by an 'average' computer. This results in the following new workunits.

Workunit A: 5,001-8,750
Workunit B: 8,751-12,500
Workunit C: 12,5001-16,250
Workunit D: 16,251-20,000


((aside - not sure why 3,750 was chosen for the size here and not 5,000 - but perhaps because it's an example))


Now - not trying to trip anyone up here, just attempting to address a perceived problem - could the returned result unit be split?

The scenario I envisage would thus be:

Parent Replica 0 computes 1-5,500 in 6 hours
Parent Replica 1 computes 1-5,000 in 6 hours

Validation occurs on structures 1-5,000. Structures 1-5,000 are saved.

New Workunits are created:

Workunit A: 5,001-5,500
Workunit B: 5,501-9,125
Workunit C: 9,126 -12,750
Workunit D: 12,751-16,375
Workunit E: 16,376-20,000

And results 5,001 - 5,500 logged as a returned result for fastercruncher so that only the second replica of workunit A is physically despatched.

In Rickjb's single-hit scenario where Rickjb has crossed the 60% Rubicon at 6 hours and proceeds to completion at 9.54 whereas Rickjbs_wingman hits the 6hr barrier at 59.9% (on a nominal 10,000-position unit, just to make the maths easier) this would be:

Parent Replica 0 computes 1-10,00 in 9.54 hours
Parent Replica 1 computes 1-5,990 in 6 hours

Validation occurs on structures 1-5,990. Structures 1-5,990 are saved.

New Workunits are created:

Workunit A: 5,991-10,000

And results 5,991 - 10,000 logged as a returned result for Rickjb so that only the second replica of workunit A is physically despatched.


Even if Workunit A hits the 6-hr barrier because it lands up on a tortoise,
having processed say 3,000 positions, then the same mechanism is invoked;

Workunit A Replica 0 computes 5,991-10,000 in 0 hours (Rickjb's overage)
Workunit A Replica 1 computes 5,991-8,990 in 6 hours

Validation occurs on structures 5,991-8,990. Structures 5,991-8,990 are saved.

New Workunits are created:

Workunit B: 8,991-10,000

And results 8,991 - 10,000 logged as a returned result for Rickjb so that only the second replica of workunit B is physically despatched.


Result: No wasted crunched results, ever.

Sure - I can see that there would be more work for the server in splitting the returns and handling the small replicas generated - but those short jobs would seem ideal to send to slower machines by preference.
[Nov 22, 2009 7:32:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

For those who are interested, this crunching diagram may help to visualise the WU crunching rules:

Lines A to G show various crunching graphs.
D and E show non-linear progress, the others show linear progress.
C is the limiting case of 60% progress at 6 hours. WUs that crunch linearly above this line (A and B) can progress to completion. WUs that crunch below it (F and G) hit The Wall.
D and E have made more than 60% progress at the 6h decision-point, so they are allowed to continue, but they slow down and miss the 10h completion target. They are allowed an extra 2h, which enables D to finish, but E hits the 12h absolute limit. D and E are very rare.
Whenever a WU hits the 6h (or 12h) barrier, the difference in time/progress between it and its wingman's WU is wasted. Now consider G's other copy. All of its progress above G's finishing progress value (dotted line) is wasted. If it was A, B or D ("single-hit"), 40% plus G's 6h shortfall from 60% is wasted. If the wingman got F ("double-hit"), the maximum wastage is G's 6h shortfall.

I suggest you imagine the effects of moving the various limiting parameters of the chart, eg moving the wall to the right or changing its relative or absolute height, or changing the 10-hour target maximum time.
The 3 independent parameters are:
. W = position of the Wall [6 hours]
. M = absolute Maximum CPU time [12 hours]
. P = Percentage completion at top of Wall [60%]
The extrapolated Target time T to complete a WU that just clears the Wall [10 hours] is dependent on W and P:
. T = 100 * W / P
----------------------------------------
[Edit 3 times, last edit by Rickjb at Nov 24, 2009 8:35:15 AM]
[Nov 23, 2009 7:33:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 76   Pages: 8   [ Previous Page | 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread