Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 30
Posts: 30   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 50115 times and has 29 replies Next Thread
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

Rickjb are you running multiple CEP2 workunits on this computer when you have the issue?

Thanks,
armstrdj
[Mar 4, 2011 2:43:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

Thanks, armstrdj. Just running 1 x CEP2, 3 x FAAH.

More info as of today:
I have the O/S (XP) and pagefile on the old 8.4GB IDE drive, BOINC data on the WD Caviar Green 500GB/32MB. I can hear & feel when the heads seek on the IDE, and the activity LED flickers. The WD does its R/W mostly silently. During a lockup, there is no sound or vibration from the IDE, and the LED comes on bright & steady, strongly suggesting the WD, ie the BOINC data drive, is the culprit. Perhaps the WD is doing full-buffer 32MB r/w operations when it is only being asked to r/w smaller blocks.
I had the same problems with this computer running 4 x DDDT2 when the WD was the sole drive, and the drive activity symptoms were the same as now. The IDE is too small to keep BOINC data with CEP2, but 4 x DDDT2 fitted & they ran OK. When doing that, it would periodically have extended periods of intense HDD activity with the LED on but flickering a little, and the system would slow but not freeze, and no WUs fell over.

I normally run an instance of Task Manager so that I can have a look around if I catch a lockup. (You may be unable load TM then - it's best to have it there already). Today I came upon the machine when system CPU usage had dropped to 75%. The CEP2 was at 0%, the 3 FAAHs 25% ea. Usually, if left alone in this situation the CEP2 will time out, exit, try to restart, and the dominoes will fall. Today, after a delay I was able to get into BOINC Manager and Suspend everything from the Activity menu (LAIM is ON). The HDD LED went out very soon, and everything resumed to 100% CPU as soon as I clicked "Run always". No WUs had timed out.

I will try swapping the WD drive with a Samsung unit, but I'll be away next week & this will have to wait until afterwards.
----------------------------------------
[Edit 3 times, last edit by Rickjb at Mar 5, 2011 2:53:51 PM]
[Mar 5, 2011 10:38:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

I have the O/S (XP) and pagefile on the old 8.4GB IDE drive, BOINC data on the WD Caviar Green 500GB/32MB.
............
I will try swapping the WD drive with a Samsung unit, but I'll be away next week & this will have to wait until afterwards.

Can you run it in AHCI mode? I think it will be a lot happier with the writes, even if it means that you have to put everything on a single drive. Perhaps even more importantly, the write cache on a 8.4 GB drive can't be very large. I would use a later-generation drive for the OS and page file anyway.
----------------------------------------
[Edit 2 times, last edit by Jim1348 at Mar 5, 2011 6:35:42 PM]
[Mar 5, 2011 6:13:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

The Samsung is an SP2014N (200GB/7200rpm/8MB cache/PATA). I think AHCI only works on SATA drives.
I switched from IDE to AHCI Mode with the WD Cav Green 500GB unit but that has not helped.

The large cache may be the cause of the problem. With a small cache, the drive spends less time doing lookups & other stuff with the cache and just goes and does real r/w instead. Smaller caches might be faster for r/w of smaller blocks of data in random places on the drive. (The cache in the o/s was invented to deal with that stuff).
Perhaps a drive with slow transfer speed and a large cache needs to hold the DMA request line for too long: it could cause the o/s to miss important stuff, or, vice versa, the o/s keeps interrupting long DMA transfers so it can do more important stuff. Or, the "green" drive just has low sustained r/w speeds. OTOH, surely it's faster than an 11-year-old IDE drive.
Just theories, and I could be entirely wrong.
Can any of the WCG techs call on some of the hardware guys at their Big Blue Brother for opinions? (The word "Winchester" comes to mind).
----------------------------------------
[Edit 5 times, last edit by Rickjb at Mar 19, 2011 2:32:26 PM]
[Mar 6, 2011 4:18:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work units not finishing

Are you running with the BM always loaded? Yes, large caches are known to cause substantial delay in the CC/BM exchange, particular if the BM is up. For that reason the button in the task view was created to only show active tasks. Even when the BM is not loaded and hundreds of tasks are in cache the CC has to work harder.

--//--
[Mar 6, 2011 10:05:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

@Sek: sorry, there are 3 caches involved & we are talking about different ones. You: BOINC work cache. Me: HDD onboard RAM cache slowing down random disc access, and o/s disc cache for speeding things up, within limits (software overheads of very large o/s disc caches could slow things down again).
Irrelevant now, but I usually close the BM window. However, boincmgr.exe remains in the Task Manager Processes list after you do this.
[Mar 6, 2011 3:48:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work units not finishing

Propose you exit the BM **, this removing one more vector on holding anything up if it would. In the status bar it still fetches the CC state every other second. Some have observed a memory leaking and the BM/CC bloating ever bigger. Just looked on the Linux box, it takes 328.5Mb VM, 10MB RAM. Even when minimized it takes 1/100 of a second every few second. So, I look in, then exit it routinely.

** If BOINC is installed as user, the core client might exit too. The GPU crunchers would know since they have to install on Windows without PAE.

Just lose thoughts to steel more microseconds.

--//--
[Mar 6, 2011 4:02:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: work units not finishing

Who was seeing suspending/preempting tasks and the jobs breaking? Just checked in with the 6.12.16 alpha test client:
client: wait 15 seconds (instead of 5) for an app to exit before killing it. Apparently some apps take ~10 sec on slow computers.

and

client: in the loop that starts up apps, check if we've been in the loop for 10 sec. If so, break out of it and reschedule. Avoid starving GUI RPCs and heartbeats.


Not sure if related, but worth a trial run. Currently this test version is only available for Windows!

--//--

to get it, just take the standard boinc installer download link of berkeley and replace the 6.10.58 with 6.12.16 in the address. Available in 32+64 bits
----------------------------------------
[Edit 1 times, last edit by Former Member at Mar 7, 2011 10:57:24 AM]
[Mar 7, 2011 10:55:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

"No heartbeat from core client for 30 sec - exiting" + "exited with zero status but no 'finished' file" : Problem solved!
Cause: Poor performance of hard disk containing the BOINC data. Solution: Replace drive with one of a "better" model.

I have solved my problem of WUs exiting with zero status and restarting from their last checkpoint, that I described in my long post earlier in this thread.
The drive containing the BOINC DATA was a Western Digital "Caviar Green" Model WD5000AADS-00S9B0 500GB SATA drive which has 32MB cache, and I replaced it with an older Samsung SP2014N 200GB IDE unit having 8MB of onboard cache.
The WD drive passes the WD diagnostics with a clean bill of health, so the problem is with the drive model rather than my particular unit.
As stated above, another of my machines runs happily with a similar WD drive with the same firmware version number, but with 750GB capacity and only 16MB of cache.
With its IDE interface, the raw I/O speed of the Samsung drive is probably slower than the WD's SATA interface, especially when the WD is in AHCI Mode. The latency of the 5400rpm WD may however be greater than for the 7200rpm Samsung, due to its slower rotation and perhaps slower track-to-track seek times, which are compromises made by WD to reduce power consumption. The way that the WD's firmware handles its large onboard cache may also cause delays. Or, the WD may be holding a system bus for long periods while it is performing DMA activity, freezing out the CPU.

With the Samsung drive, there are still times when the HDD LED comes on and CPU usage falls to zero, but these are for much shorter periods than before, and no WUs have timed out, even when running 4 CEP2 WUs simultaneously. Also, the running of other tasks that are not requesting large amounts of disk I/O seems to be hardly affected, whereas before the computer would almost freeze.

Note that it is the drive that contains the BOINC data that causes the problem, not the drive containing the system, pagefile or BOINC program files. My Samsung with the BOINC data is currently the 2nd drive, and all of the intense HDD activity is on this, not the system drive.

I repeat my suggestion above that if you are experiencing the problems that I have described, please post details of the drive that contains your BOINC DATA.

Also, if you have a "green" or "eco" low-power drive that contains your BOINC data and runs CEP2 without problems, you might post the details of that too.
[Mar 15, 2011 1:34:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: work units not finishing

@cleanenergy/WCG programmers:
I notice that one of the times that there is heavy HDD activity and a drop in CPU usage/increase in System Idle is while an instance of wcgrid_cep2_6.40_windows_intelx86 is starting up a wcgrid_cep2_qchem_6.40_windows_intelx86 process. I have only observed this during startup of the first qchem process of a WU, but guess each Job of the WU is similar. [B.S] sTrey has also complained that his CEP2 units ... Bogart the host when they start up. If the HDD activity is the result of a single command to the O/S to load and execute qchem, there's nothing you can do, but if it is possible to insert some short delays between HDD requests at these times, that might help machines that have slow storage devices, and might reduce any perceived slowdowns on all machines. My affected machine also had WU timeouts running multiple DDDT2 WUs, and on at least 1 occasion when restarting 4 FAAH WUs, but of course fixing CEP2 would not help in those cases.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Mar 16, 2011 7:14:13 AM]
[Mar 15, 2011 2:06:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 30   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread