• High
    • Enable tag handling configuration for each resource. Introducing new xml tag handler with tag repair functionality.

      Problem

      Currently, the system interacts with multiple translation resources (DeepL, OpenAI, Google, Microsoft, etc.) for translating content. Each resource can process tags, but they have different methods and options for handling tag processing. Additionally, there is no uniform way to configure how tags are processed or repaired across resources.

      Tasks:

      To create a configurable system for tag processing and repair that allows:

      1. Defining the type of tags (HTML or XLIFF) sent to each translation resource.
      2. Configuring whether the tag repair functionality is applied post-translation on the backend.
      3. Aligning the tag repair functionality with the type of tags sent to resources.
      4. Document in confluence, how which MT/LLM resource is currently handling tags with translate5

      Implementation ideas:

      1. Tag Repair Functionality:
        • Introduce a tag repair mechanism for XLIFF tags.
        • Evaluate whether to:
          • Develop a single, unified tag repair class to handle both HTML and XLIFF tags.
          • Create separate tag repair classes for HTML and XLIFF tags.
        • Evaluate, how current tag repair for DeepL works. In Marcs understanding it makes sure
          • no tag is missing
          • tags are syntactically correct
          • if a tag has to be inserted or moved, it will be moved/inserted in a similar position as it had in the source segment (so e. g. after the same number of blocks of word-characters and non-word-characters. If that logic already exist, keep it.
      1. Resource-Specific Configuration:
        • Allow configuration for each resource to specify:
          • The type of tags it processes (HTML or XLIFF).
          • Whether tag repair should be enabled or disabled.
        • Ensure tag repair type aligns with the tag type sent to the resource (e.g., if XLIFF tags are sent, only XLIFF tag repair should be applied).
      1. Validation Logic:
        • Implement validation to prevent mismatches between tag type and tag repair functionality. For example:
          • If XLIFF tags are sent, ensure HTML repair is not attempted.

      suggested presets

      In general its always a good idea to set sendWhitespaceAsTag to active.
      For more details see Test-Details.txt

      Deepl:

      • runtimeOptions.LanguageResources.deepl.sendWhitespaceAsTag: "active" (if disabled whitespaces (can) get lost)
      • runtimeOptions.plugins.DeepL.api.parametars.tagHandling: "none" (if that means that the parameter is not sent at all
      • runtimeOptions.LanguageResources.deepl.tagHandler: "xliff_paired_tags"
        Sample: <t5x_123>abc</t5x_123> or <t5x_456 />

      additional:

      • "split_sentences" => "nonewlines" should be removed from request. This will be set automatically in the right way by Deepl
      • "preserve_formatting" => false is default and can be removed.
        OR better: set to true else \n will be lost and zeile1\nzeile2 will end in "line1line2" instead of "line1 line2"

      OpenAI

      runtimeOptions.LanguageResources.openai.sendWhitespaceAsTag activated
      runtimeOptions.LanguageResources.openai.tagHandler xliff_paired_tags

      additional

      recommended by ChatGPT "Für professionelle Übersetzungen"
      'model' => 'gpt-4',
      or
      'model' => 'gpt-4-turbo',
      can be selected when actual language-resource is created.
      !!! in the list which is offered, there are some (at least one) model(s) which is not able to translate at all.
      This ends up in an error

      "This is not a chat model and thus not supported in the v1/chat/completions endpoint. Did you mean to use v1/completions?"
      

      So maybe the list can be examined by some kind of attributes which are able to translate. Else its really pain in the ass for user.

      Google

      really hard to decide, none of them is perfect. "best" results are with:
      runtimeOptions.LanguageResources.google.sendWhitespaceAsTag activated
      runtimeOptions.LanguageResources.google.tagHandler xliff_paired_tags
      runtimeOptions.LanguageResources.google.format text

      Microsoft

      hard to decide, formal all OK, Must/can be decided by prefered results.
      Maybe html_image is not as good as the other two.

      runtimeOptions.LanguageResources.microsoft.sendWhitespaceAsTag activated
      runtimeOptions.LanguageResources.microsoft.tagHandler xlf_repair

            aleksandar Aleksandar Mitrev
            aleksandar Aleksandar Mitrev
            Axel Becher, Leon Kiz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: