Details
-
Bug
-
Resolution: Fixed
-
None
-
High
-
Due problems with the worker system the logging of the workers had to be changed / improved. A delay for the startup of workers which could not be started was also introduced to reduce the risk of internal endless loops.
-
Empty show more show less
Description
The actual worker garbage cleaning is not sufficient since only done workers are deleted. Also we have the problem, that workers where processed sequentially, so if a worker remains running in DB no other worker is getting started.
Remaining running workers can happen with restarting apache, apache or php segmentation faults or with poorly written workers.
In this case the whole part of the application relaying on workers is blocked!
A solution would be, to implement a recognition if the process behind a worker does not exist anymore. This recognition check should be done if the runtime of a worker is bigger as X seconds. To set a handable value of X keep the following statistics from a Zf_worker table of 4343 TermTaggerImport Requests (with 50 Segments per call) in mind:
Avg Duration: 21.32
Max Duration: 1171
Min Duration: 0
Request Count: 4343
Grouped Count:
[<=10] => 481
[10 - 50] => 3610
[50 - 100] => 184
[100 - 200] => 51
[200 - 500] => 11
[500 - 1000] => 0
[>=1000] => 2
That means that for termtagger workers running longer as ~60 Seconds the existence of the process should be checked.
Checking the existence of the process is not as trivial as it looks like, since storing the process id is not sufficient. Using PHP as apache modul gives always the pid of the apache process. Since the apache process remains with the same pid, multiple workers do have the same pid, so we cant check for this pid. Also pid checking is difficult for cross platform applications (see http://stackoverflow.com/questions/9874331/check-if-specified-pid-is-currently-running-using-php-possible-without-using-p).
As far as I can see, the only reliable way would be:
Each worker creates a temporary file, does a lock_ex on it. The file name is saved in the DB instead of the PID. If a worker is finished, crashed or killed by an apache kill, the lock is released. That means, at the place where found workers running more then X seconds we have to try to get also a lock_ex on the stored file. If the lock_ex fails, the process is still running. If we get the lock_ex there, we can delete the lock file and worker table entry.
Attachments
Issue Links
- relates to
-
TRANSLATE-1161 Task locking clean up is only done on listing the task overview
- Done
-
TRANSLATE-3381 Start workers as plain processes instead using HTTP requests
- Done