Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 23
Posts: 23   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 23128 times and has 22 replies Next Thread
rbotterb
Senior Cruncher
United States
Joined: Jul 21, 2005
Post Count: 401
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

I picked up two new 7.28 MCM1 WUs about 2 hours ago. So it looks like something is flowing through. But then at the time I got these two new ones I didn't have any other MCM1 WUs in my cache....
[Dec 19, 2013 1:57:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

I am a little confused as to how this works. If I read rightly, there is a large number of short deadline, high priority work that needs to be worked through.

New devices are being restricted to downloading one task per virtual core to prevent all these high priority tasks. After 3 hours, this restriction is lifted.

I can see how his will overload the server as there will be more server requests as the short deadline tasks report more often.

However, it there a way of staggering these out, so that we can fetch work. I'm not demanding, just asking? Was the backlog that large?


Let me work backwards on this:

We send and receive data to/from the researchers in batches. While a batch is in progress but does not yet have all of its workunits completed, we have to store it on our system and the researchers don't get the benefit of the data in them until the whole batch is completed.

The Mapping Cancer Markers batches have 10,000 workunits per batch. Within in each batch, there are what we call "hard luck workunits". These are workunits where the first couple of assignments go to computers that don't disappear (the owner goes on vacation, they reformat their computer, they uninstall, etc). Therefore it takes 7 days before the next ones are sent out. If left to its own devices, then these next two might go to someone with a large cache and they happen to accidentally unplug their machine and the job ends in an error. Then the next one goes to someone else who never returns it. And so and so on. When we first started using the BOINC software there were workunits that took longer than 90 days to complete due to "bad luck".

As a result, the first copies that are sent out and any additional copies sent out in the first three days are assigned to any computer that asks for them. However, after that time they are only sent to computers that have proven themselves to quickly return jobs correctly. They are also assigned a shorter deadline (30% of the original deadline). This allows most batches to finish in about 160% of the original deadline (1 full deadline and then two shorter deadlines).

However, there are only certain number of these "reliable" hosts. If there are more jobs requiring reliable hosts than there are reliable hosts asking for jobs, then job distribution gets 'clogged' and we are not able to efficiently distribute the work. As a result, we want to minimize the amount of these reliable jobs.

Recently, people have been spinning up a number of EC2 Spot Instances to contribute to World Community Grid. These are virtual machines where you can say how much you are willing to spend and if there is sufficient excess capacity on Amazon's cloud and if no-one else is willing to pay more than you, then your machines will be started and will start running what ever workload you provide. The catch to this is that if excess capacity on EC2 shrinks or if someone offers to spend more money than you do, then your instances are stopped and removed.

What this means for us is that 32 core virtual machines were starting up, requesting 5*32 jobs from us, and then being killed off and the results are never returned. All 160+ of those jobs then become "no result" jobs 7 days later and generate 160 jobs that need reliable hosts. Enough of these were being done that it was exceeding the grid's ability to assign all of those 'reliable' jobs to reliable hosts.

So in order to reduce the number of reliable jobs being produced, we are now limiting new devices to only 1*(number of cores on device) jobs for the first 3 hours of its existence. This will reduce the number of jobs assigned and quickly abandoned, but also allow devices that continue to participate be able to get plenty of work soon when they start to finish their first jobs. Note that if a computer completes and reports a job as complete during this three hour window, they will still be able to get a new job to work on - they just won't be able to build up a buffer until the three hour window expires.

I want to note that there are also a lot of these ec2 spot instances that are doing some significant work for World Community Grid so we do not want to hinder their use and involvement. We are just taking steps to ensure that this works for everyone.
[Dec 19, 2013 3:11:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

Reminds me, there's a another option for cc_config and that is this one:

<abort_jobs_on_exit>0|1</abort_jobs_on_exit>
If 1, abort jobs and update projects when client exits. Useful on grids where disk gets wiped after each run. List-add.pngNew in 6.6.10

The description says that abort notices are immediately send to the server. Think this is used/useful for the deepfreezers too.

Detach as invoked with e.g. --detach_project URL is a bit of a funny. Is the message send instantly to the server [should it]? Somewhere someone made the point it would not until adding WCG back, but with those EC2 instance, these machines will likely never be recognized as returners... the detach never reaching WCG. So, however is dealt with these, after a number of hours they could have the full default buffer on them [0.1 or 0.3 days?], or whatever the user set the profiles to be. At that point, after the first 3 hours, there's a bunch of No Reply going to emanate from these when they are ended. True/False? Suppose though that you've determined some tilt point with these clients... whence they get past the 3 hour mark, they're bound to run for longer i.e. not to many cases in a day to cause the scheduling get swamped with repairs to go outby priority [though my interpretation of the rule, edited in below is that they would go to anyone since it happened within 3 days of original distribution].

edit: On this
As a result, the first copies that are sent out and any additional copies sent out in the first three days are assigned to any computer that asks for them. However, after that time they are only sent to computers that have proven themselves to quickly return jobs correctly. They are also assigned a shorter deadline (30% of the original deadline). This allows most batches to finish in about 160% of the original deadline (1 full deadline and then two shorter deadlines).

Interpreting this:

1) Repairs that go out the within 3 days of original deadline get 7 day deadline and go to anyone [seeing repairs with 7 days coming through]
2) Repairs that come about after the first 3 days, get a 30% deadline and only go to the 'quick' returners.

In effect, getting a say MCM _2 copy can have either a 30% deadline [2.1 days] or a 100% deadline [7 days] depending on when they occurred in the cycle.

Please correct me if something got lost in translation.
----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 19, 2013 8:59:32 AM]
[Dec 19, 2013 8:45:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

1) Repairs that go out the within 3 days of original deadline get 7 day deadline and go to anyone [seeing repairs with 7 days coming through]
2) Repairs that come about after the first 3 days, get a 30% deadline and only go to the 'quick' returners.


From the time a workunit is created, until 72 hours later, all jobs created for the workunit will not require a reliable device to process them. The timing is related to when the workunit was created not when the first job was assigned nor the deadline of any of the jobs. At some point we would like to change this, but the workunit create time was the only data that was already present in the transitioner that could be useful for this. We load in work between 18-36 hours before it starts being distributed.
[Dec 19, 2013 4:04:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

You almost made me happy. Taking the median of 27 hours, just about the first 2 days after first result goes out, do they go anywhere. Client v7 behavior taken into account, any arriving later than that and having a buffer of > 1 day makes them rush to head, for MCM. If a client gets too many, true EDF cram can develop [been observed]. Maybe cap them to not assign more then there are active cores in a device.
[Dec 19, 2013 4:25:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
yoro42
Ace Cruncher
United States
Joined: Feb 19, 2011
Post Count: 8976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

At what point should I abort running jobs that show they are past deadline? The system is Aborting tasks that have not started and are past the deadline.
----------------------------------------

[Dec 19, 2013 5:15:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Byteball_730a2960
Senior Cruncher
Joined: Oct 29, 2010
Post Count: 318
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

Knreed, thank you for the very detailed answer. I never knew that the EC2 instances were all about cheap spare capacity that could be yanked just like that.

That would explain an awful lot. As always, keep up the good work.
[Dec 20, 2013 5:34:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

Kevin,

re http://www.worldcommunitygrid.org/forums/wcg/...ead,36060_offset,0#443811 and the first 3 hour capping, theoretically a machine that has idle cores starts asking for work, and reports the results with that request. What happens if these first arrivals complete work within 3 hours. E.g. my quad and octo do FAHV in under 1.5 hours. Many an MCM finishes in under that time and presently, CEP2 does so too. Still have to wait for the 3 hours to pass [and the incrementing deferral to add]?
[Dec 20, 2013 10:09:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

Errata: As a possible solution, critiquing is always easy, would a time limited rule work akin to the one for BETA. e.g. 1 in progress per core and then 1 for 1 reported up to 6-12 hours. Most mistakes happen on the first day.

(It's anyway good when testing not to get a boatload because my default profile is set to 0.5 days, but if I cycle multiple test client installs/uninstalls, detach/reset on a day, and the server catches it as the same device, there's little testing, but maybe getting 'Recovering lost result/task'.
[Dec 20, 2013 11:45:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: No Tasks Available [RESOLVED]

... would a time limited rule work akin to the one for BETA. e.g. 1 in progress per core and then 1 for 1 reported up to 6-12 hours...
That's the way I had read Kevin's post (and I found it fine) but your interpretation is possibly correct too.
Needs some clarification from Kevin...
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Dec 20, 2013 12:16:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 23   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread