Details
-
Bug
-
Resolution: Unresolved
-
None
-
None
-
High
-
Empty show more show less
Description
problem
The MatchAnalysis Code must be refactored:
- Wrong class hierarchy
- to complex nested code
- Regarding repetitions its very memory consuming (probably the reason for TS-1406)
- The whole process must be changed, so that the steps that come after each other must be clearer:
- check repetititons (but load only the segment IDs not all data)
- then load matches from TM (but do not mix up match info with repetition info by overwriting here matchrate with 102 - this looses info and gives problems later on
- then if needed from MT
- save the analysis
- then pretranslate
- currently this steps are highly chaotic done, which makes it hard to change, fix code here
- Problem with error handling: on a pre-translation against deepl one request was failing with a timeout, instead of retrying the segments were just not pre-translated (see also
TRANSLATE-2217) - TRANSLATE-2255 and TRANSLATE-2364 MUST be implemented with that issue
According to TS-1655 not only the matchRate must be considered in order to find out which found match should be used, but also the segment validation. For example a only text based repetition may be to long if the max length between the repeated segments change. The solution is to check the segment length / segment validation before the match is used (or reduce the match rate with a penalty?)
The issue popped up in the comment of Stefan in https://jira.translate5.net/browse/TS-1871 again. The error was out of memory. Task https://smartspokes.translate5.net/editor/#project/2498/2517/focus
Run analysis in parallel
Basic idea how we should achieve this:
General Concept like AutoQA:
- 2 framing-workers that do the initial stuff (cloning TMs, ...) and the cleanup (MatchAnalysisStart & MatchAnalysisFinish workers)
- in-between MatchAnalysisSegment Workers (1 ... n) depending on available pretrans resources etc.
- these "eat" throgh the segments in batches & save the results to a temporary datamodel (maybe like in AutoQA, JSON per Segment or maybe filebased, must be discussed)
- maybe we need to assign segment-ranges to the workers on queueing to have "nailed" intermediate models
- if a segment fails (pretrans) it can be retried (see AutoQA and MittagQI\Translate5\Segment\Processing\State)
- it t5memory or termtagger is not available, the MatchAnalysisSegment worker can be delayed until the problem is solved (delaying does not create stuck processes!)
- this should make MA much harder against problems in t5memory, termtagger, or unavailable MTs
- the process should be split, first pretranslation against TMs and Termcollections, then second run pretranslate the non-found/fuzzy matches against assigned MTs
Use only one of the assigned TMs for internal Fuzzy search
If internal fuzzy is active, currently the internal fuzzy search is done against each of the TMs. So each TM is cloned and in each TM all segments are written with a hash, after the matches for this segment are querid.
When using more than one TM for a task, this is highly inefficient, since the internal fuzzy matches will be the same for all assigned TMs. Therefore it is enough to do the internal fuzzy process only with the first of the assigned TMs.
Attachments
Issue Links
- blocks
-
TRANSLATE-2255 Only send non-pretranslated segments to MT in pre-translation process
- Backlog
-
TRANSLATE-2364 Improve batch algorithm in pre-translation and analysis
- Open
- causes
-
TRANSLATE-3068 Fix repetition behaviour in pre-translation with MT only
- Done
- is blocked by
-
TRANSLATE-2945 Missing pre-translations if internal fuzzies enabled
- Open
- relates to
-
TRANSLATE-2335 Do not query MT when doing analysis in batch mode without MT pre-translation
- Done
-
TRANSLATE-2834 Change repetition behaviour in pre-translation
- Done
-
TRANSLATE-2217 List refactoring and code maintenance needs in translate5
- Selected for dev