Details
-
Sub-task
-
Resolution: Fixed
-
None
-
None
-
Empty show more show less
Description
Definitions
- The IBM-idea of exact-exact matches and context-matches is respected.
- An exact-exact match is a 100% match, that has the same document name as the currently translated document. Regarding analysis, this is handled as 101% match
- A repetition is a segment, that already showed up with the same words and tag order further above in the same task. Same tag order means: Tags must be present at the same positions and in the same count in the segment, but do not have to be the same tags. The repetition algorithm is the same, as used by the translate5 repetition editor. Regarding analysis, a repetition is handled as 102% match
- A context match is an exact-exact match, that in addition has the same context set in TM as in the document - usally an ID, which often is the line number or segment-id (depends on what the import does). Regarding analysis, this is handled as 103% match
Basic match analysis structure
After importing a task a match analysis should be created based on the assigned TM based MatchRessources.
To get the analysis results, each segment is send to the assigned MatchRessources. For each queried MatchRessource the received best match rate is stored in a separate DB table.
Out of this table all desired analysis can be calculated.
The analysis DB table should contain:
- the taskGuid
- This foreign key (between taskguid and LEK_task table. tmmt) should not contain any delete FK, since the statistics should remain although the tmmt (match resource) is deleted. Therefore should the deletion of a tmmt be logged into a separate log table (or in LEK_task_log) to conserve the TMMT type and name to the id.
- the segmentId
- the segmentNrInTask
- the tmmtid (match resource id)
- the best matchrate
- analysisId
- foreign key to LEK_match_analysis_taskassoc
- the word count of the segment (see below)
Analysis to task assoc table:
- id
- taskGuid - foreign key to lek task table
- created - timestamp when the analysis is created
repetitions
- For analysis are source repetitions relevant only!
- Count repetitions (100%-matches, that will be generated by the same task while translating) with the same algorithm, as the repetition editor uses.
- A repetition, that also is found as a 103% match in the TM is NOT counted as a repetition. And it is pre-translated. But if there are normal 100%-Matches or 101%-Matches or Fuzzy-Matches for the same segment, the repetition is counted and not the other Matches. This is solved by setting the matchrate to 102% for repetitions.
- The first segment of a repetition group gets the best matchrate from the TM, only the repetitions to this first segment are set to 102%
- To solve the data storage and visualization we assume a matchrate of 102% percent for internal repetitions. So the context-match (103%) is higher rated as desired.
- The target md5 hashes must be calculated always when target changes in first workflow step for translation tasks (
TRANSLATE-885)- same for pretranslation, always when we get a pretranslation result the target md5 hash must be recalculated
- In the consequence the autostate and matchrate column must be displayed in the repetition editor, so that the user can decide what to do with found repetitions
- For Pretranslation: 102% segments (repetitions) are pretranslated from the TM, if there is a >=100% match from the TM.
Word count
The word count of each segment is calculated like follows:
- All whitespace tags (new type of tags Thomas just introduced) are replaced by a single space)
- All other tags and other markups are deleted from the segment
- All punctuation characters are remove from the segment (for a list of punctuation characters see https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#PunctuationCharacters)
- The segment is split into word chunks with a regular expression. The regex is based on a definit list of non-word characters, that is added below. This list is the pratical implementation of the https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#Words standard.
- Based on the result the words can be counted
/** All the Unicode whitespace chard. */ public static final char [] WHITESPACE_CHARS = { 9, 0x0A, 0x0B, 0x0C, 0x0D, 0x20, 0x85, 0xA0, 0x1680, 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 0x2008, 0x2009, 0x200A, 0x200D, 0x2028, 0x2029, 0x202F, 0x205F, 0x3000 };
Word counts for East Asian languages
In East Asian languages there are no whitespaces between words. Therefore according to GMX-V2 for Chinese, Japanese, Korean and Thai we will
- Remove all tags and punctuation characters from the segment
- count the graphems of each segment with grapheme_strlen
- and divide the count by the language specific factor listed in https://www.xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#LogographicScripts
For Lao, Khmer and Myanmar we will simply list the char count as word count, as long as someone provide a good factor.
List of Thai punctuation chars
https://en.wiktionary.org/wiki/Category:Thai_punctuation_marks
List of Khmer punctuation chars
https://en.wiktionary.org/wiki/Category:Khmer_punctuation_marks
List of Lao punctuation chars
https://en.wiktionary.org/wiki/Category:Lao_punctuation_marks
List of Myanmar punctuation chars
Attachments
Issue Links
- relates to
-
TRANSLATE-1254 Show "other" whitespaces in a unified whitespace-tag
- Backlog