Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-2764

Refactor MatchAnalysis code and reduce memory usage

    XMLWordPrintable

Details

    Description

      problem

      The MatchAnalysis Code must be refactored:

      • Wrong class hierarchy
      • to complex nested code
      • Regarding repetitions its very memory consuming (probably the reason for TS-1406)
      • The whole process must be changed, so that the steps that come after each other must be clearer:
        1. check repetititons (but load only the segment IDs not all data)
        2. then load matches from TM (but do not mix up match info with repetition info by overwriting here matchrate with 102 - this looses info and gives problems later on
        3. then if needed from MT
        4. save the analysis
        5. then pretranslate
      • currently this steps are highly chaotic done, which makes it hard to change, fix code here
      • Problem with error handling: on a pre-translation against deepl one request was failing with a timeout, instead of retrying the segments were just not pre-translated (see also TRANSLATE-2217)
      • TRANSLATE-2255 and TRANSLATE-2364 MUST be implemented with that issue

      According to TS-1655 not only the matchRate must be considered in order to find out which found match should be used, but also the segment validation. For example a only text based repetition may be to long if the max length between the repeated segments change. The solution is to check the segment length / segment validation before the match is used (or reduce the match rate with a penalty?)

      The issue popped up in the comment of Stefan in https://jira.translate5.net/browse/TS-1871 again. The error was out of memory. Task https://smartspokes.translate5.net/editor/#project/2498/2517/focus

      Run analysis in parallel

      Basic idea how we should achieve this:

      General Concept like AutoQA:

      • 2 framing-workers that do the initial stuff (cloning TMs, ...) and the cleanup (MatchAnalysisStart & MatchAnalysisFinish workers)
      • in-between MatchAnalysisSegment Workers (1 ... n) depending on available pretrans resources etc.
      • these "eat" throgh the segments in batches & save the results to a temporary datamodel (maybe like in AutoQA, JSON per Segment or maybe filebased, must be discussed)
      • maybe we need to assign segment-ranges to the workers on queueing to have "nailed" intermediate models
      • if a segment fails (pretrans) it can be retried (see AutoQA and MittagQI\Translate5\Segment\Processing\State)
      • it t5memory or termtagger is not available, the MatchAnalysisSegment worker can be delayed until the problem is solved (delaying does not create stuck processes!)
      • this should make MA much harder against problems in t5memory, termtagger, or unavailable MTs
      • the process should be split, first pretranslation against TMs and Termcollections, then second run pretranslate the non-found/fuzzy matches against assigned MTs

      Use only one of the assigned TMs for internal Fuzzy search

      If internal fuzzy is active, currently the internal fuzzy search is done against each of the TMs. So each TM is cloned and in each TM all segments are written with a hash, after the matches for this segment are querid.

      When using more than one TM for a task, this is highly inefficient, since the internal fuzzy matches will be the same for all assigned TMs. Therefore it is enough to do the internal fuzzy process only with the first of the assigned TMs.

      Attachments

        Issue Links

          Activity

            People

              leonkiz Leon Kiz
              tlauria Thomas Lauria
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: