Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-3206

Protect and auto-convert numbers and general patterns during translation

    XMLWordPrintable

Details

    • Medium
    • Numbers are protected with tags for all translations jobs. Custom patterns for number protections can be defined in separate UI.

    Description

      General idea

      Numbers should be protected during the import, if the import job is a translation job (all target units empty).

      This means, numbers should be protected as internal tags with the following conditions:

      • Separate from other tags it should be possible to configure, if numbers that are protected as tags should be drawn out of the segment, when they are positioned at the start or end of a segment.
      • The number that is protected as tag should be visible in the tooltip of the tag and in the full tag view
      • Numbers that change their format in another language, must be auto-adjusted.
      • Numbers added by the user are not protected as a tag

       
      1. create 3 tabs: Number recognition, Imput mapping, Output mapping
       In output section only lang, type, name and format are provided for columns.
      One lang per row in imput and output (edited) 
       
      2. For recognition source language should be used for regex filtering
      3. There should be an ability to disable Regex rules
      4. Flag for default rules. Such rules are non-changable or deletable (edited) 
      5. toogle button to show/hide default rules
      6. For output formats only already existing names are possible to set
      7. Rethink name for format column: parsing format, source format, output format, target format..?
      8. create confluence page with instructions for users
      9. replace icon in tokens and number pages with question icon

      Effect on the translation and the translation memory and MT

      • Segments that contain only number, white-space and other tags should not be exposed for translation, but the numbers automatically converted. They should not be part of the segment table (edited: no, number only segments should be included for context)
      • Effect on TM:
        • If a segment with a translated number is saved to the TM, only an PH-TAG is saved to the TM. That tag should contain an identifier, that identifies it as a number tag. The raw number should be contained also in the tag, so that on TMX export the raw number is exported instead of the specified tag. This needs to be supported by t5memory, which it is currently not.
        • If it is looked for a segment for new matches in the TM with the fuzzy search or analysis, the segment that is send to TM also should only contain a PH tag with the numbers identifier instead of the number. This way independent of differences in numbers there will be matches
        • If a match from the TM is taken over in a segment (be it during pre-translation or taken over by the translator manually) the tag that protects the number in the source is automatically taken over into the target segment in the position of the empty PH-tag from the TM.
        • TMX Import: when TMX is imported, translate5 have to pre-parse the TMX and replace plain numbers with the placeholders for t5memory (edit: according to all active input mappings for that language)
        • The exact syntax of the number tag that is send to t5memory and the communication between translate5 and t5memory is documented in linked https://jira.translate5.net/browse/T5TMS-174
          It needs to be implemented for
          • the update request that saves/updates a segment
          • the fuzzy request, that searches for fuzzy matches
          • for tmx import and export on parsing TMX
          • for concordance search when the search request contains a number and in the answer of t5memory that may contain number tags (analogous to what is happening in TMX parsing for import and export, so number needs to be converted to number tag when sending the search request to t5memory and number needs to be unprotected, when coming back from t5memory in the answer)
          • the regex in the number tag (see linked T5TMS-174) is encoded as base64(gzdeflate(regex)).
          • we also providing id to each number tag for order purposes
      • Effect on MT: For the MT it is important to have the whole number without protection. So when sending a text to the MT, the number protection should be removed, but a tag pair should be placed around the number, that still contains the number protection. When the translated text is coming back from the MT, the protected number should be recreated based on our tag and the  number in between the tags should be removed.

      Tag format

      <t5:n id=":int:" n=":string:" r=":string:"/>

      The r in format stands for regex that was used to find number in text. It is encoded as 

      base64(gzdeflate(regex))

      The n is a number that was found in a source text. value of it must be valid xml attribute value

      Example:

      <t5:n id="2" n="2023-09-15" r="049JikmpNqnV1TCINtS1jK0xjDbQNYqtAXM04aJA0igWKANkGgMpQ5iCmCR9AA=="/> 

      source in translate5

      He lived there from 19-08-1965 till 23-09-1987

      source in translate5 with n-tag

      He lived there from <t5:n id="1" n="19-08-1965" r="regexA"/> till <t5:n id="2" n="23-09-1987" r="regexA"/>

      source from t5memory

      He lived there from <t5:n id="23" n="12-05-1907" r="regexA"/> till <t5:n id="7" n="11-05-1987" r="regexA"/>

      target from t5memory

      Er lebte von <t5:n id="7" n="11-05-1987" r="regexA"/> bis <t5:n id="23" n="12-05-1907" r="regexA"/>

      target in translate5 with n-tag

      Er lebte von <t5:n id="2" n="23-09-1987" r="regexA"/> bis <t5:n id="1" n="19-08-1965" r="regexA"/>

      What is a number

      • Any integer, that is surrounded only by whitespace. Other tags should be ignored for this evaluation, meaning the following should be treated as a number to protect
        • SEGMENTSTART<tag>NUMBER<tag>BLANK
        • TAB<tag>NUMBER<tag>SEGMENTEND
        • LINEBREAK<tag>NUMBERSEGMENTEND
        • TEXT<tag>NUMBER<tag>TEXT → WOULD BE NOT TAGGED!
        • etc.
      • Any  regular expression (with a few technical limitations, that might be necessary to define), that matches something with the same conditions as above described for integers.
        • These  regular expressions can be used to tag all kind of numbers like decimals, numbers with thousands separators or date formats
        • Those formatted numbers often must be adjusted between different languages. To achieve this automatically
          • numbers must be found language specific: by a search regular expression defined for each number type (decimal / date) for each language
          • to finally format the so found number value in a different language format for each language the output format strings / characters is configured:
            • the decimal separator
            • the thousand separator
            • valid date format patterns (here not only the separator, since the order of day / month may change)
          • More regular expressions to match more patterns can be added any time by the admin.
      • If for a sub-language no regular expressions are defined but for its corresponding main language there are, then those defined for the main language are chosen.

      Language administration in the GUI

      To be able to administrate the above described configuration of regular expressions and formatting, the language administration of translate5 must be reflected in the GUI.

      For each language it should be possible to set the above described regular expressions for each type and output formats.

      Maybe it will be easier to define multiple date types:

      • full_date (dd.mm.YYYY) with an own regex for identification and the corresponding output format string (regex similar to: /[0-9] {1,2}\.[0-9] {1,2}
        \.[0-9] {4}
        / output format: dd.mm.YYYY)
      • short_date (dd.mm.YY) with (regex similar to: /[0-9] {1,2}\.[0-9] {1,2}
        \.[0-9] {2}
        / output format: dd.mm.YY)
      • This might be more effort in initial configuration, but easier in final implementation since no automatic recognition if YY or YYYY should be used in output.
      • With that approach the user will also be able to add more formats (datetime, time for example) on his own

      These settings for now will NOT be overwritable on client level.

       

      Prefilling of the language administration

      The language administration will be pre-filled with all languages currently available in translate5.

      Please use this page as information source, what decimal and thousand separators are used in different languages and countries:

      https://en.wikipedia.org/wiki/Decimal_separator

      And this for the date:

      https://en.wikipedia.org/wiki/Date_format_by_country

      Place of conversion in the translation process

      The conversion should be handled in the GUI, when a translator takes over a tag. This makes it possible to have the right tooltip on the tag.

      It also takes place in the back-end, when the pre-translation takes over matches.
       
       
       

       

      Attachments

        Issue Links

          Activity

            People

              oleksandrmikhliaiev Oleksandr Mikhliaiev
              marcmittag Marc Mittag [Administrator]
              Thomas Lauria
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: