Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-4778

Add entities to File-Format-Settings to patch imported XML files

XMLWordPrintable

    • Critical
    • This Feature needs the latest OKAPI-java11 container to work properly (translate5/okapi-longhorn:combined-java11)
    • Added entities to File-Format-Settings to patch imported XML files

      Problem

      If XML-documents contain entities like   which are not declared in the DOCTYPE  the parsing with OKAPI may fail with XPath exceptions for xml-processing with codefinder or other regex stuff

      Solution

      with the usage of entities in XML we should add special enity-declarations in the DOCTYPE declaration of the imported XMLs and remove them on export. The entity declaration(s) can be added like

      <!DOCTYPE frntcover PUBLIC "-//USA-DOD//DTD TM Assembly REV C" "production.dtd" [
          <!ENTITY nbsp "&#160;">
          <!ENTITY copy "&#169;">
      ]>

       

      Predefined entities in XML are only 

      &lt; represents `<`
      &gt; represents `>`
      &amp; represents `&`
      &quot; represents `"`
      &apos; represents `'`

       
       

      What needs to be implemented

      • add a data-model to LEK_okapi_bconf "patchedEntities" holding an array (comma seperated) of entities
      • the list will be like "copy,nbsp,shy"
      • the list must only contain valid entities (check with html_entity_decode('&' . $entity . ';'): result must be single char ...)
      • the numbered entity can be generated programmatically
      • any input file that will use okf_xml parser will be patched
      • on import, the Doctype must be patched, the original DOCTYPE will be escaped and saved as comment below the new patched doctype: <!--t5doctype ESCAPED_DOCTYPE t5doctype -->
      • on export, the original doctype will be restored

      Code to convert named to numbered entity:

      $entity = '&copy;';
      $char = html_entity_decode($entity, ENT_HTML5, 'UTF-8');
      $numberedEntity = '&#' . mb_ord($char) . ';';
      

      see https://www.freeformatter.com/html-entities.html

      Used Texts / Entities column

      "Entities" (all languages)
      Tooltip:
      en: "If you use non-standard entities in your imported XML-documents, these need to be declared and added here for proper processing with OKAPI. The standard entities are: lt, gt, amp, quot, apos. Please provide all other used in this field."
      "de": "Wenn nicht-standardisierte Entities in importierten XML-Dateien vorhanden sind, müssen diese deklariert werden, damit die verarbeitung mit OKAPI funktioniert. Die Standard-Entities sind: lt, gt, amp, quot, apos. Bitte geben sie all anderen in diesem Feld an."

      Additional requirement

       Since the Worker has to be changed we finally should change to dedicated Import & Export Worker (which is a longterm problem now) and unify the shared code in a seperate class

      Amendment 6th of August 2025

      Tests showed, that even with the above implemented solution there are rare cases, where Java 11 still runs into an exception (see related TS-issue).

      As Denis (Okapi dev) stated, with Java 17 there will be only a problem left, if a non-declared named entity exists directly before the closing root tag of the document.

      Since the problem itself is rare and with usage of the solution implemented in this issue even more rare, we do not care for the remaining issues with Java 11. Those will be solved with https://jira.translate5.net/browse/TRANSLATE-4846, when we upgrade to Java 17 usage.

      For the remaining problem with Java 17 we will implement the following solution:

      • if we declare a named entity with the solution implemented via this issue we check, we add a small comment section directly before the closing root tag of the xml document
      • in the export we remove the added comment

       

       

            volodymyr@mittagqi.com Volodymyr Kyianenko
            axelbecher Axel Becher
            Axel Becher
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: