problem

The MatchAnalysis Code must be refactored:

Wrong class hierarchy
to complex nested code
Regarding repetitions its very memory consuming (probably the reason for TS-1406)
The whole process must be changed, so that the steps that come after each other must be clearer:
1. check repetititons (but load only the segment IDs not all data)
2. then load matches from TM (but do not mix up match info with repetition info by overwriting here matchrate with 102 - this looses info and gives problems later on
3. then if needed from MT
4. save the analysis
5. then pretranslate
currently this steps are highly chaotic done, which makes it hard to change, fix code here
Problem with error handling: on a pre-translation against deepl one request was failing with a timeout, instead of retrying the segments were just not pre-translated (see also ~~TRANSLATE-2217~~)
TRANSLATE-2255 and TRANSLATE-2364 MUST be implemented with that issue

According to TS-1655 not only the matchRate must be considered in order to find out which found match should be used, but also the segment validation. For example a only text based repetition may be to long if the max length between the repeated segments change. The solution is to check the segment length / segment validation before the match is used (or reduce the match rate with a penalty?)

The issue popped up in the comment of Stefan in https://jira.translate5.net/browse/TS-1871 again. The error was out of memory. Task https://smartspokes.translate5.net/editor/#project/2498/2517/focus

Run analysis in parallel

Basic idea how we should achieve this:

General Concept like AutoQA:

2 framing-workers that do the initial stuff (cloning TMs, ...) and the cleanup (MatchAnalysisStart & MatchAnalysisFinish workers)
in-between MatchAnalysisSegment Workers (1 ... n) depending on available pretrans resources etc.
these "eat" throgh the segments in batches & save the results to a temporary datamodel (maybe like in AutoQA, JSON per Segment or maybe filebased, must be discussed)
maybe we need to assign segment-ranges to the workers on queueing to have "nailed" intermediate models
if a segment fails (pretrans) it can be retried (see AutoQA and MittagQI\Translate5\Segment\Processing\State)
it t5memory or termtagger is not available, the MatchAnalysisSegment worker can be delayed until the problem is solved (delaying does not create stuck processes!)
this should make MA much harder against problems in t5memory, termtagger, or unavailable MTs
the process should be split, first pretranslation against TMs and Termcollections, then second run pretranslate the non-found/fuzzy matches against assigned MTs

Use only one of the assigned TMs for internal Fuzzy search

If internal fuzzy is active, currently the internal fuzzy search is done against each of the TMs. So each TM is cloned and in each TM all segments are written with a hash, after the matches for this segment are querid.

When using more than one TM for a task, this is highly inefficient, since the internal fuzzy matches will be the same for all assigned TMs. Therefore it is enough to do the internal fuzzy process only with the first of the assigned TMs.

blocks

TRANSLATE-2255 Only send non-pretranslated segments to MT in pre-translation process

Backlog

TRANSLATE-2364 Improve batch algorithm in pre-translation and analysis

Open

causes

TRANSLATE-3068 Fix repetition behaviour in pre-translation with MT only

Done

is blocked by

TRANSLATE-2945 Missing pre-translations if internal fuzzies enabled

Open

relates to

TRANSLATE-2335 Do not query MT when doing analysis in batch mode without MT pre-translation

Done

TRANSLATE-2834 Change repetition behaviour in pre-translation

Done

TRANSLATE-2217 List refactoring and code maintenance needs in translate5

Selected for dev

mentioned in: Page Loading...; Page Loading...

(2 relates to, 2 mentioned in)

Details

Description

problem

Run analysis in parallel

Use only one of the assigned TMs for internal Fuzzy search

Attachments

Issue Links

Activity

People

Dates