XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • pdf2htmlex
    • Medium

    Description

      background

      The HTML is generated with pdf2htmlex from a pdf. This generation is triggered by the development written by Aleks and not part of this issue. Please read T5DEV-126 for the overall picture.

      the parser

      A parser must insert tags in the HTML, that mark the borders of the target segments, which we have in translate5. It must contain the segmentNrInTask we find in translate5 (segmentNrInTask is better than the segmentID, because in the future we might build a feature to migrate tasks from one translate5 instance to the other. The segmentNrInTask will always be unique in the task.

      Attached you find an example HTML of a very complex PDF (the whole data including images you find in T5DEV-126). Starting in line 17695 you find an example how I think, the markup would make sense: Inserting
      <span class="t5-link" data-segNr="4711-1">-tags inside of each div-tag. Just inserting an attribute to the div-tag is not possible, because segment borders will not always be identical with div-tag borders. There may even be several segments within one div-tag.

      As an algorithm I would do the following:

      • eliminate all tags from the HTML, but remember their offset
      • replace all white-space with spaces
      • eliminate all repeating spaces (only single spaces are allowed) - but remember the offset of the deleted white-space
      • Find the target segments and remember the offset, where they start and end
      • reenter all tags, spaces and the new tags for segment borders.

      Please also see T5DEV-155. In some easy way the segmenter has to provide for the JS the information, which segments are present on which PDF page. This is necessary to easily load the information about all changed segments from the database to the frontend. On way for implementation would be as specified in T5DEV-155. But this should be concepted together with tlauria and aleksandar before you start implementing this part.

      Attachments

        Issue Links

          Activity

            People

              Stephan Stephan Bergmann
              marcmittag Marc Mittag [Administrator]
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: