Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-514

Improve Worker garbage clean up and implement a dead worker recognition

    XMLWordPrintable

Details

    • High
    • Due problems with the worker system the logging of the workers had to be changed / improved. A delay for the startup of workers which could not be started was also introduced to reduce the risk of internal endless loops.

    Description

      The actual worker garbage cleaning is not sufficient since only done workers are deleted. Also we have the problem, that workers where processed sequentially, so if a worker remains running in DB no other worker is getting started.
      Remaining running workers can happen with restarting apache, apache or php segmentation faults or with poorly written workers.

      In this case the whole part of the application relaying on workers is blocked!

      A solution would be, to implement a recognition if the process behind a worker does not exist anymore. This recognition check should be done if the runtime of a worker is bigger as X seconds. To set a handable value of X keep the following statistics from a Zf_worker table of 4343 TermTaggerImport Requests (with 50 Segments per call) in mind:
      Avg Duration: 21.32
      Max Duration: 1171
      Min Duration: 0
      Request Count: 4343
      Grouped Count:
      [<=10] => 481
      [10 - 50] => 3610
      [50 - 100] => 184
      [100 - 200] => 51
      [200 - 500] => 11
      [500 - 1000] => 0
      [>=1000] => 2

      That means that for termtagger workers running longer as ~60 Seconds the existence of the process should be checked.
      Checking the existence of the process is not as trivial as it looks like, since storing the process id is not sufficient. Using PHP as apache modul gives always the pid of the apache process. Since the apache process remains with the same pid, multiple workers do have the same pid, so we cant check for this pid. Also pid checking is difficult for cross platform applications (see http://stackoverflow.com/questions/9874331/check-if-specified-pid-is-currently-running-using-php-possible-without-using-p).

      As far as I can see, the only reliable way would be:
      Each worker creates a temporary file, does a lock_ex on it. The file name is saved in the DB instead of the PID. If a worker is finished, crashed or killed by an apache kill, the lock is released. That means, at the place where found workers running more then X seconds we have to try to get also a lock_ex on the stored file. If the lock_ex fails, the process is still running. If we get the lock_ex there, we can delete the lock file and worker table entry.

      Attachments

        Issue Links

          Activity

            People

              tlauria Thomas Lauria
              tlauria Thomas Lauria
              Aleksandar Mitrev
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: