Type: New Feature
Resolution: Fixed
Fix Version/s: translate5 - 6.9.0b, translate5 - 7.4.0
Affects Version/s: None
Component/s: Configuration, Import/Export

Urgency:
Medium
ChangeLog Description:
Numbers are protected with tags for all translations jobs. Custom patterns for number protections can be defined in separate UI.
Checklist:

Empty

show more show less

General idea

Numbers should be protected during the import, if the import job is a translation job (all target units empty).

This means, numbers should be protected as internal tags with the following conditions:

Separate from other tags it should be possible to configure, if numbers that are protected as tags should be drawn out of the segment, when they are positioned at the start or end of a segment.
The number that is protected as tag should be visible in the tooltip of the tag and in the full tag view
Numbers that change their format in another language, must be auto-adjusted.
Numbers added by the user are not protected as a tag

1. create 3 tabs: Number recognition, Imput mapping, Output mapping
In output section only lang, type, name and format are provided for columns.
One lang per row in imput and output (edited)

2. For recognition source language should be used for regex filtering
3. There should be an ability to disable Regex rules
4. Flag for default rules. Such rules are non-changable or deletable (edited)
5. toogle button to show/hide default rules
6. For output formats only already existing names are possible to set
7. Rethink name for format column: parsing format, source format, output format, target format..?
8. create confluence page with instructions for users
9. replace icon in tokens and number pages with question icon

Effect on the translation and the translation memory and MT

Segments that contain only number, white-space and other tags should not be exposed for translation, but the numbers automatically converted. They should not be part of the segment table (edited: no, number only segments should be included for context)
Effect on TM:
- If a segment with a translated number is saved to the TM, only an PH-TAG is saved to the TM. That tag should contain an identifier, that identifies it as a number tag. The raw number should be contained also in the tag, so that on TMX export the raw number is exported instead of the specified tag. This needs to be supported by t5memory, which it is currently not.
- If it is looked for a segment for new matches in the TM with the fuzzy search or analysis, the segment that is send to TM also should only contain a PH tag with the numbers identifier instead of the number. This way independent of differences in numbers there will be matches
- If a match from the TM is taken over in a segment (be it during pre-translation or taken over by the translator manually) the tag that protects the number in the source is automatically taken over into the target segment in the position of the empty PH-tag from the TM.
- TMX Import: when TMX is imported, translate5 have to pre-parse the TMX and replace plain numbers with the placeholders for t5memory (edit: according to all active input mappings for that language)
- The exact syntax of the number tag that is send to t5memory and the communication between translate5 and t5memory is documented in linked https://jira.translate5.net/browse/T5TMS-174
  It needs to be implemented for

- - the update request that saves/updates a segment
  - the fuzzy request, that searches for fuzzy matches
  - for tmx import and export on parsing TMX
  - for concordance search when the search request contains a number and in the answer of t5memory that may contain number tags (analogous to what is happening in TMX parsing for import and export, so number needs to be converted to number tag when sending the search request to t5memory and number needs to be unprotected, when coming back from t5memory in the answer)
  - the regex in the number tag (see linked T5TMS-174) is encoded as base64(gzdeflate(regex)).
  - we also providing id to each number tag for order purposes
Effect on MT: For the MT it is important to have the whole number without protection. So when sending a text to the MT, the number protection should be removed, but a tag pair should be placed around the number, that still contains the number protection. When the translated text is coming back from the MT, the protected number should be recreated based on our tag and the number in between the tags should be removed.

Tag format

<t5:n id=":int:" n=":string:" r=":string:"/>

The r in format stands for regex that was used to find number in text. It is encoded as

base64(gzdeflate(regex))

The n is a number that was found in a source text. value of it must be valid xml attribute value

Example:

<t5:n id="2" n="2023-09-15" r="049JikmpNqnV1TCINtS1jK0xjDbQNYqtAXM04aJA0igWKANkGgMpQ5iCmCR9AA=="/>

source in translate5

He lived there from 19-08-1965 till 23-09-1987

source in translate5 with n-tag

He lived there from <t5:n id="1" n="19-08-1965" r="regexA"/> till <t5:n id="2" n="23-09-1987" r="regexA"/>

source from t5memory

He lived there from <t5:n id="23" n="12-05-1907" r="regexA"/> till <t5:n id="7" n="11-05-1987" r="regexA"/>

target from t5memory

Er lebte von <t5:n id="7" n="11-05-1987" r="regexA"/> bis <t5:n id="23" n="12-05-1907" r="regexA"/>

target in translate5 with n-tag

Er lebte von <t5:n id="2" n="23-09-1987" r="regexA"/> bis <t5:n id="1" n="19-08-1965" r="regexA"/>

What is a number

Any integer, that is surrounded only by whitespace. Other tags should be ignored for this evaluation, meaning the following should be treated as a number to protect
- SEGMENTSTART<tag>NUMBER<tag>BLANK
- TAB<tag>NUMBER<tag>SEGMENTEND
- LINEBREAK<tag>NUMBERSEGMENTEND
- TEXT<tag>NUMBER<tag>TEXT → WOULD BE NOT TAGGED!
- etc.
Any regular expression (with a few technical limitations, that might be necessary to define), that matches something with the same conditions as above described for integers.
- These regular expressions can be used to tag all kind of numbers like decimals, numbers with thousands separators or date formats
- Those formatted numbers often must be adjusted between different languages. To achieve this automatically
  - numbers must be found language specific: by a search regular expression defined for each number type (decimal / date) for each language
  - to finally format the so found number value in a different language format for each language the output format strings / characters is configured:
    - the decimal separator
    - the thousand separator
    - valid date format patterns (here not only the separator, since the order of day / month may change)
  - More regular expressions to match more patterns can be added any time by the admin.
If for a sub-language no regular expressions are defined but for its corresponding main language there are, then those defined for the main language are chosen.

Language administration in the GUI

To be able to administrate the above described configuration of regular expressions and formatting, the language administration of translate5 must be reflected in the GUI.

For each language it should be possible to set the above described regular expressions for each type and output formats.

Maybe it will be easier to define multiple date types:

full_date (dd.mm.YYYY) with an own regex for identification and the corresponding output format string (regex similar to: /[0-9] {1,2}\.[0-9] {1,2}
\.[0-9] {4}
/ output format: dd.mm.YYYY)

short_date (dd.mm.YY) with (regex similar to: /[0-9] {1,2}\.[0-9] {1,2}
\.[0-9] {2}
/ output format: dd.mm.YY)

This might be more effort in initial configuration, but easier in final implementation since no automatic recognition if YY or YYYY should be used in output.
With that approach the user will also be able to add more formats (datetime, time for example) on his own

These settings for now will NOT be overwritable on client level.

Prefilling of the language administration

The language administration will be pre-filled with all languages currently available in translate5.

Please use this page as information source, what decimal and thousand separators are used in different languages and countries:

https://en.wikipedia.org/wiki/Decimal_separator

And this for the date:

https://en.wikipedia.org/wiki/Date_format_by_country

Place of conversion in the translation process

The conversion should be handled in the GUI, when a translator takes over a tag. This makes it possible to have the right tooltip on the tag.

It also takes place in the back-end, when the pre-translation takes over matches.

blocks

TRANSLATE-3723 Validate usage and content of toSort fields

Selected for dev

relates to

TRANSLATE-2784 RTL languages may lead to wrong tag order

Open

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

1.

Check and rework tag-only segment handling in import

Open

Sanya Mikhliaiev

Details

Description

General idea

Effect on the translation and the translation memory and MT

Tag format

What is a number

Language administration in the GUI

Prefilling of the language administration

Place of conversion in the translation process

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates