XML

Word

Printable

Type: Improvement
Resolution: Unresolved
Fix Version/s: None
Affects Version/s: None
Component/s: Okapi integration

Urgency:
Critical
Important release notes:

Hide
IMPORTANT: SRX rules have been refreshed from Okapi repo. If you have customized your SRX rules and use your own ones in translate5, it is recommended to merge them with the new translate5 default rules. To add what is new in translate5's default and keep, what is important of your own changes/additions.

Show
IMPORTANT: SRX rules have been refreshed from Okapi repo. If you have customized your SRX rules and use your own ones in translate5, it is recommended to merge them with the new translate5 default rules. To add what is new in translate5's default and keep, what is important of your own changes/additions.
Checklist:

Empty

show more show less

Problem

A lot of languages are so far not supported by segmentation rules in translate5.

Also we need to do some clean-up - see developer comment.

Solution

Okapi meanwhile supports many more. Please see

~~https://bitbucket.org/okapiframework/srx-repository/src/master/srx-common/src/main/resources/~~

https://gitlab.com/okapiframework/srx-repository/

We should add those languages from there to our srx rules, that are not supported by us so far.

Also we should check via diff for the languages, that we support already, if rules have been added at Okapi, that we do not have so far. Then we should add them. At the same time we should ensure, that modifications we did for those languages stay. Also the order of rules is important. The rules work from top to bottom. So for those languages a classical diff merge should be done. And our content should be considered preferred in case of doubt, so more to the top.

Please carefully coordinate with Axel in case of having to merge rulesets for the same language.

Please place in the header of the file a hint, that this file is under Apache 2.0 license and most of its content stems from Okapi project with link to Okapi bitbucket. Copy a typical Apache 2.0 license file header into the file and link to the Apache 2.0 original license on the web from it.

Quick Notes:

important links:
https://de.wikipedia.org/wiki/Liste_der_Unicode-Eigenschaften#Allgemein

RegEx rules for SRX.
https://unicode-org.github.io/icu/userguide/strings/regexp.html#regular-expressions

To detect numbers, the following "unicode-selector" should be used:
\p

{Nd}

and NOT this:
[0-9]
else a defined "No-Break"rule may be overwriten by an existing "Break" rule because unicode-selectors do have a higher "specifity" than the other ones.
The "specifitiy" can be compared with CSS-selector where: id-based is higher than class-based is higher than tag-based.

Similar to the selector for numbers also the following character-selectors should be replaced:
[a-z] [A-Z]

Especially in German rules there are a lot of this "outdated rules"

some ideas

no default srx rulesets.

we had one customer who has very special translations. There it would be easier to have NO srx rules at all (beside may some really needed ones, like the t5_xyz) and only define some very less rules.
So maybe it should be possible to have srx rule-sets without any default rules. Maybe a checkbox to deactivate them.

Assignee:: Stephan Bergmann
Reporter:: Marc Mittag [Administrator]
Peer developer:: Axel Becher
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: 02/Jun/2025 06:11
Updated:: 09/Oct/2025 09:36

Details

Description

Problem

Solution

Quick Notes:

some ideas

no default srx rulesets.

Attachments

Activity

People

Dates