Definitions

The IBM-idea of exact-exact matches and context-matches is respected.
- An exact-exact match is a 100% match, that has the same document name as the currently translated document. Regarding analysis, this is handled as 101% match
- A repetition is a segment, that already showed up with the same words and tag order further above in the same task. Same tag order means: Tags must be present at the same positions and in the same count in the segment, but do not have to be the same tags. The repetition algorithm is the same, as used by the translate5 repetition editor. Regarding analysis, a repetition is handled as 102% match
- A context match is an exact-exact match, that in addition has the same context set in TM as in the document - usally an ID, which often is the line number or segment-id (depends on what the import does). Regarding analysis, this is handled as 103% match

Basic match analysis structure

After importing a task a match analysis should be created based on the assigned TM based MatchRessources.
To get the analysis results, each segment is send to the assigned MatchRessources. For each queried MatchRessource the received best match rate is stored in a separate DB table.
Out of this table all desired analysis can be calculated.
The analysis DB table should contain:

the taskGuid
- This foreign key (between taskguid and LEK_task table. tmmt) should not contain any delete FK, since the statistics should remain although the tmmt (match resource) is deleted. Therefore should the deletion of a tmmt be logged into a separate log table (or in LEK_task_log) to conserve the TMMT type and name to the id.
the segmentId
the segmentNrInTask
the tmmtid (match resource id)
the best matchrate
analysisId
- foreign key to LEK_match_analysis_taskassoc
the word count of the segment (see below)

Analysis to task assoc table:

id
taskGuid - foreign key to lek task table
created - timestamp when the analysis is created

repetitions

For analysis are source repetitions relevant only!
Count repetitions (100%-matches, that will be generated by the same task while translating) with the same algorithm, as the repetition editor uses.
A repetition, that also is found as a 103% match in the TM is NOT counted as a repetition. And it is pre-translated. But if there are normal 100%-Matches or 101%-Matches or Fuzzy-Matches for the same segment, the repetition is counted and not the other Matches. This is solved by setting the matchrate to 102% for repetitions.
The first segment of a repetition group gets the best matchrate from the TM, only the repetitions to this first segment are set to 102%
To solve the data storage and visualization we assume a matchrate of 102% percent for internal repetitions. So the context-match (103%) is higher rated as desired.
The target md5 hashes must be calculated always when target changes in first workflow step for translation tasks (~~TRANSLATE-885~~)
- same for pretranslation, always when we get a pretranslation result the target md5 hash must be recalculated
In the consequence the autostate and matchrate column must be displayed in the repetition editor, so that the user can decide what to do with found repetitions
For Pretranslation: 102% segments (repetitions) are pretranslated from the TM, if there is a >=100% match from the TM.

Word count

The word count of each segment is calculated like follows:

All whitespace tags (new type of tags Thomas just introduced) are replaced by a single space)
All other tags and other markups are deleted from the segment
All punctuation characters are remove from the segment (for a list of punctuation characters see https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#PunctuationCharacters)
The segment is split into word chunks with a regular expression. The regex is based on a definit list of non-word characters, that is added below. This list is the pratical implementation of the https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#Words standard.
Based on the result the words can be counted

 /** All the Unicode whitespace chard. */
    public static final char [] WHITESPACE_CHARS = {
                    9,
                    0x0A,
                    0x0B,
                    0x0C,
                    0x0D,
                    0x20,
                    0x85,
                    0xA0,
                    0x1680,
                    0x2000,
                    0x2001,
                    0x2002,
                    0x2003,
                    0x2004,
                    0x2005,
                    0x2006,
                    0x2007,
                    0x2008,
                    0x2009,
                    0x200A,
                    0x200D,
                    0x2028,
                    0x2029,
                    0x202F,
                    0x205F,
                    0x3000
    };

Word counts for East Asian languages

In East Asian languages there are no whitespaces between words. Therefore according to GMX-V2 for Chinese, Japanese, Korean and Thai we will

Remove all tags and punctuation characters from the segment
count the graphems of each segment with grapheme_strlen
and divide the count by the language specific factor listed in https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#LogographicScripts

For Lao, Khmer and Myanmar we will simply list the char count as word count, as long as someone provide a good factor.

Attachments

Issue Links

relates to

TRANSLATE-1254 Show "other" whitespaces in a unified whitespace-tag

Backlog

Match analysis basic back-end implementation

Details

Description

Definitions

Basic match analysis structure

repetitions

Word count

Word counts for East Asian languages

List of Thai punctuation chars

List of Khmer punctuation chars

List of Lao punctuation chars

List of Myanmar punctuation chars

Attachments

Issue Links

Activity

People

Dates