Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: Mapping Cancer Markers Forum Thread: Neverending MCM tasks |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 86
|
Author |
|
shanen0
Cruncher Joined: Feb 4, 2021 Post Count: 20 Status: Offline |
Have three of them now. Normal working time is around 4 hours, but two of them are close to three days and one day past their deadline, and the third is over one day already. Eventually they apparently do get killed off, but I think no credit is granted and it isn't my fault that the code is buggy. I noticed another a few days ago. So far it's only been on my largest machine, but I don't watch the others as closely.
Anyone else seeing this sort of thing. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7243 Status: Offline Project Badges: |
I am going to hazard a guess you are running some version of Windows. I have seen this before and the easiest thing to do to correct the problem (if you notice it) is to reboot. Then the work units should come to a normal conclusion. If that does not solve the problem (when it occurs) please post some log entries and perhaps there will be a clue in there. I have not seen nor heard of this problem on Linux. If anyone has, please post and let us know of a proposed solutions. I don't think it is faulty code from MCM, but faulty memory management from the OS.
----------------------------------------Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at Sep 6, 2021 1:26:35 AM] |
||
|
shanen0
Cruncher Joined: Feb 4, 2021 Post Count: 20 Status: Offline |
Yes, that was from a Windows machine and perhaps I should have noted that I am aware that rebooting often fixes the hung-task problem. But that's a machine that I prefer to avoid rebooting. All of my machines have primary uses and WCG runs on unused cycles. (Kind of reminds me of how IBM managed WCG, actually.)
My general concern is that buggy software produces unreliable results. Bugs include tasks that hang or that fail to checkpoint. Perhaps the new "management" will do a better job. I also hope they will stop with the short-deadline tasks. Annoying and I generally nuke them on sight, even on the machines that will probably be able to complete the tasks within their short deadlines. Never understood the point of the deadlines except to waste donated cycles when some machines can't achieve some arbitrary deadline. If the deadline encourages people to run machines they otherwise wouldn't run, then I tend to see the deadlines as counterproductive as well as wasteful. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7243 Status: Offline Project Badges: |
Just curious - How often do you get a hung MCM task ? Do you have hung tasks on any other projects ? Do you ever have any other software which hangs your system or is just a BOINC problem ?
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: |
If the problem is with one task an alternative to try is:
set 'leave application in memory' to off suspend the task wait a couple of minutes to ensure it is removed from memory resume task set 'leave application in memory' to on It should then resume from the last saved checkpoint |
||
|
shanen0
Cruncher Joined: Feb 4, 2021 Post Count: 20 Status: Offline |
Not monitoring WCG that closely these days, but I've only noticed those stuck tasks on my main machine, which rarely gets rebooted. Not so much RAM that I want to encourage any apps to remain in RAM, though I haven't noticed any problems that seem symptomatic of memory problems.
Basically I just want WCG to run with fewer problems. The most common reason I move from one project to another is because of persistent intrusions. However, remembering back to my days with researchers, buggy software reduces confidence in the results. |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
MCM1 tasks works fine for me on my amazing 106 days uptime on Windows 10.
----------------------------------------For those with hung MCM1 tasks and the other computer wingman completed the same work unit, I guess this is probably a hardware problem of RAM, CPU, Motherboard, or something. On non-ECC, I have seen some hung MCM1 tasks, invalid ARP1, MIP1 random computation errors, computer BSOD / crashes, freezes. The worst was file system corruption and can no longer start Windows 10. I switched to ECC unbuffered DIMM (UDIMM) for most of my computers with supported CPU and Motherboard. Works fine with nice uptime. I have never seen any hung MCM1 tasks with ECC RAM, but ECC RAM can fail different from non-ECC. I have had a failing ECC DDR4 in which Windows log show hundreds of WHEA corrected errors and have frequent reset and reboot. Just sitting idling on Windows 10 at less then 1% CPU usage on 1 faulty ECC, got me 6 reboots in 1 hour. Removed faulty memory and after all this, no more random reboots, no invalids, no computation errors, no file system corruptions. My computers: 107 days ago was a power outage in my area. Ryzen 3900x, Asus B550-E, Win10, 32GB (2x16) DDR4-3200 ECC, Uptime 2 days (changed ECC RAM) Ryzen 2700x, Asus Prime B350 Plus, Win10, 32GB (2x16) DDR4-3200 ECC, Uptime 106 days AMD FX-4100, Asus M5A97 R2.0, Linux Debian, 32GB (4x8) DDR3-1600 ECC, Uptime 106 days, edac-util 1 corrected intel Atom N270, HP Mini 100-1000, Linux Debian 32bit, 2GB DDR2 no-ecc, Uptime 42 days intel i7-2600, Asus P8H77-m, Linux Debian, 16GB (2x8) DDR3-1333 no-ecc, Uptime 3 days - Unused: Ryzen 2400g not-pro don't support ECC. MSI Tomahawk B450 don't support ECC. [Edit 3 times, last edit by sam6861 at Sep 25, 2021 12:08:41 AM] |
||
|
rcthardcore
Cruncher United States Joined: Jan 29, 2009 Post Count: 13 Status: Offline Project Badges: |
I have verified that it IS faulty MCM code. No other stuck/forever running tasks on any of my other BOINC projects. It ONLY happens on MCM.
----------------------------------------
AMD Ryzen 9 5950x
NVIDIA RTX 3090 FE 128 GB DDR4-3200 Windows 10 64-bit 21H1 |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7243 Status: Offline Project Badges: |
I have verified that it IS faulty MCM code. No other stuck/forever running tasks on any of my other BOINC projects. It ONLY happens on MCM. If you have verified the code used for MCM is faulty, did you indeed look at the code and find the offending bug? If you have found it, did you notify the project you found a bug ? I have run over 250,000 MCM units on both Windows and Linux and have never seen a stuck unit. That doesn't mean that there is not a bug somewhere in the code, but I would speculate that the code is not the problem, but there is a hardware issue on the machine in question. In addition , if there were a large number of users seeing this problem, that would be more likely to be a code problem, but there does not seem to be a lot of complaints about this issue. In addition, if there was a software bug, why would a simple reboot fix the problem for a specific work unit ? If it were a code problem, it would probably continue to occur in the same work unit even after a reboot. Ergo, back to a probable hardware issue. Good luck. Cheers Edit:spelling
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at Sep 30, 2021 8:47:23 PM] |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 1886 Status: Offline Project Badges: |
11,191 MCM units crunched here. Not a single stuck unit.
----------------------------------------Correlation does not imply causation. Cum hoc ergo propter hoc [Edit 2 times, last edit by Grumpy Swede at Sep 30, 2021 8:57:25 PM] |
||
|
|