Details

    • Sub-task
    • Resolution: Fixed
    • translate5 - 2.8.1
    • None
    • None

    Description

      Definitions

      • The IBM-idea of exact-exact matches and context-matches is respected.
        • An exact-exact match is a 100% match, that has the same document name as the currently translated document. Regarding analysis, this is handled as 101% match
        • A repetition is a segment, that already showed up with the same words and tag order further above in the same task. Same tag order means: Tags must be present at the same positions and in the same count in the segment, but do not have to be the same tags. The repetition algorithm is the same, as used by the translate5 repetition editor. Regarding analysis, a repetition is handled as 102% match
        • A context match is an exact-exact match, that in addition has the same context set in TM as in the document - usally an ID, which often is the line number or segment-id (depends on what the import does). Regarding analysis, this is handled as 103% match

      Basic match analysis structure

      After importing a task a match analysis should be created based on the assigned TM based MatchRessources.
      To get the analysis results, each segment is send to the assigned MatchRessources. For each queried MatchRessource the received best match rate is stored in a separate DB table.
      Out of this table all desired analysis can be calculated.
      The analysis DB table should contain:

      • the taskGuid
        • This foreign key (between taskguid and LEK_task table. tmmt) should not contain any delete FK, since the statistics should remain although the tmmt (match resource) is deleted. Therefore should the deletion of a tmmt be logged into a separate log table (or in LEK_task_log) to conserve the TMMT type and name to the id.
      • the segmentId
      • the segmentNrInTask
      • the tmmtid (match resource id)
      • the best matchrate
      • analysisId
        • foreign key to LEK_match_analysis_taskassoc
      • the word count of the segment (see below)

      Analysis to task assoc table:

      • id
      • taskGuid - foreign key to lek task table
      • created  - timestamp when the analysis is created

      repetitions

      • For analysis are source repetitions relevant only!
      • Count repetitions (100%-matches, that will be generated by the same task while translating) with the same algorithm, as the repetition editor uses.
      • A repetition, that also is found as a 103% match in the TM is NOT counted as a repetition. And it is pre-translated. But if there are normal 100%-Matches or 101%-Matches or Fuzzy-Matches for the same segment, the repetition is counted and not the other Matches. This is solved by setting the matchrate to 102% for repetitions.
      • The first segment of a repetition group gets the best matchrate from the TM, only the repetitions to this first segment are set to 102%
      • To solve the data storage and visualization we assume a matchrate of 102% percent for internal repetitions. So the context-match (103%) is higher rated as desired.
      • The target md5 hashes must be calculated always when target changes in first workflow step for translation tasks (TRANSLATE-885)
        • same for pretranslation, always when we get a pretranslation result the target md5 hash must be recalculated
      • In the consequence the autostate and matchrate column must be displayed in the repetition editor, so that the user can decide what to do with found repetitions
      • For Pretranslation: 102% segments (repetitions) are pretranslated from the TM, if there is a >=100% match from the TM.

      Word count

      The word count of each segment is calculated like follows:

       /** All the Unicode whitespace chard. */
          public static final char [] WHITESPACE_CHARS = {
                          9,
                          0x0A,
                          0x0B,
                          0x0C,
                          0x0D,
                          0x20,
                          0x85,
                          0xA0,
                          0x1680,
                          0x2000,
                          0x2001,
                          0x2002,
                          0x2003,
                          0x2004,
                          0x2005,
                          0x2006,
                          0x2007,
                          0x2008,
                          0x2009,
                          0x200A,
                          0x200D,
                          0x2028,
                          0x2029,
                          0x202F,
                          0x205F,
                          0x3000
          };
      

      Word counts for East Asian languages

      In East Asian languages there are no whitespaces between words. Therefore according to GMX-V2 for Chinese, Japanese, Korean and Thai we will

      For Lao, Khmer and Myanmar we will simply list the char count as word count, as long as someone provide a good factor.

      List of Thai punctuation chars

      https://en.wiktionary.org/wiki/Category:Thai_punctuation_marks

      List of Khmer punctuation chars

      https://en.wiktionary.org/wiki/Category:Khmer_punctuation_marks

      List of Lao punctuation chars

      https://en.wiktionary.org/wiki/Category:Lao_punctuation_marks

      List of Myanmar punctuation chars

      https://en.wikipedia.org/wiki/Burmese_alphabet#Punctuation

      Attachments

        Issue Links

          Activity

            People

              aleksandar Aleksandar Mitrev
              marcmittag Marc Mittag [Administrator]
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: