Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-2764

Refactor MatchAnalysis code and reduce memory usage

    XMLWordPrintable

Details

    • High
    • -

    Description

      problem

      The MatchAnalysis Code must be refactored:

      • Wrong class hierarchy
      • to complex nested code
      • Regarding repetitions its very memory consuming (probably the reason for TS-1406)
      • The whole process must be changed, so that the steps that come after each other must be clearer:
        1. check repetititons (but load only the segment IDs not all data)
        2. then load matches from TM (but do not mix up match info with repetition info by overwriting here matchrate with 102 - this looses info and gives problems later on
        3. then if needed from MT
        4. save the analysis
        5. then pretranslate
      • currently this steps are highly chaotic done, which makes it hard to change, fix code here
      • Problem with error handling: on a pre-translation against deepl one request was failing with a timeout, instead of retrying the segments were just not pre-translated (see also TRANSLATE-2217)
      • TRANSLATE-2255 and TRANSLATE-2364 MUST be implemented with that issue

      According to TS-1655 not only the matchRate must be considered in order to find out which found match should be used, but also the segment validation. For example a only text based repetition may be to long if the max length between the repeated segments change. The solution is to check the segment length / segment validation before the match is used (or reduce the match rate with a penalty?)

      The issue popped up in the comment of Stefan in https://jira.translate5.net/browse/TS-1871 again. The error was out of memory. Task https://smartspokes.translate5.net/editor/#project/2498/2517/focus

      Run analysis in paralel

      Basic idea how we should achieve this:

      For each language resource one worker should be started, so that they can run in parallel? Also there must be the option to first run the analysis and pre-translation against the TermCollections and TMs and only in a second step for all not-pre-translated segments against the MT resources.

      Use only one of the assigned TMs for internal Fuzzy search

      If internal fuzzy is active, currently the internal fuzzy search is done against each of the TMs. So each TM is cloned and in each TM all segments are written with a hash, after the matches for this segment are querid.

      When using more than one TM for a task, this is highly inefficient, since the internal fuzzy matches will be the same for all assigned TMs. Therefore it is enough to do the internal fuzzy process only with the first of the assigned TMs.

      Attachments

        Issue Links

          Activity

            People

              leonkiz Leon Kiz
              tlauria Thomas Lauria
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: