Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-4693

Clean-up and complete srx rules

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • None
    • None
    • Okapi integration
    • Critical
    • Hide
      IMPORTANT: SRX rules have been refreshed from Okapi repo. If you have customized your SRX rules and use your own ones in translate5, it is recommended to merge them with the new translate5 default rules. To add what is new in translate5's default and keep, what is important of your own changes/additions.
      Show
      IMPORTANT: SRX rules have been refreshed from Okapi repo. If you have customized your SRX rules and use your own ones in translate5, it is recommended to merge them with the new translate5 default rules. To add what is new in translate5's default and keep, what is important of your own changes/additions.

      Problem

      A lot of languages are so far not supported by segmentation rules in translate5.

      Also we need to do some clean-up - see developer comment.

      Solution

      Okapi meanwhile supports many more. Please see

      https://bitbucket.org/okapiframework/srx-repository/src/master/srx-common/src/main/resources/

      https://gitlab.com/okapiframework/srx-repository/

      We should add those languages from there to our srx rules, that are not supported by us so far.

      Also we should check via diff for the languages, that we support already, if rules have been added at Okapi, that we do not have so far. Then we should add them. At the same time we should ensure, that modifications we did for those languages stay. Also the order of rules is important. The rules work from top to bottom. So for those languages a classical diff merge should be done. And our content should be considered preferred in case of doubt, so more to the top.

      Please carefully coordinate with Axel in case of having to merge rulesets for the same language.

      Please place in the header of the file a hint, that this file is under Apache 2.0 license and most of its content stems from Okapi project with link to Okapi bitbucket. Copy a typical Apache 2.0 license file header into the file and link to the Apache 2.0 original license on the web from it.

      Quick Notes:

      important links:
      https://de.wikipedia.org/wiki/Liste_der_Unicode-Eigenschaften#Allgemein

      RegEx rules for SRX.
      https://unicode-org.github.io/icu/userguide/strings/regexp.html#regular-expressions

      To detect numbers, the following "unicode-selector" should be used:
      \p

      {Nd}

      and NOT this:
      [0-9]
      else a defined "No-Break"rule may be overwriten by an existing "Break" rule because unicode-selectors do have a higher "specifity" than the other ones.
      The "specifitiy" can be compared with CSS-selector where: id-based is higher than class-based is higher than tag-based.

      Similar to the selector for numbers also the following character-selectors should be replaced:
      [a-z] [A-Z]

      Especially in German rules there are a lot of this "outdated rules"

      some ideas

      no default srx rulesets.

      we had one customer who has very special translations. There it would be easier to have NO srx rules at all (beside may some really needed ones, like the t5_xyz) and only define some very less rules.
      So maybe it should be possible to have srx rule-sets without any default rules. Maybe a checkbox to deactivate them.

            Stephan Stephan Bergmann
            marcmittag Marc Mittag [Administrator]
            Axel Becher
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: