Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-5034

Filter TMX on import

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • None
    • None
    • t5memory
    • High
    • Apply filters on TMX on import time from translate5 side

      Source duplicates (variation) exist in t5memory as part of its features.

      That is dictated partly by TMX standard and partly by simple custom of how TM should behave.

      We now thought of possibility to add more robust way to filter trans units that are of little use for end user.

      As so we will introduce 3 config parameters that will allow us to filter duplicates without care about for author, document or context.

      So in current logic uniqueness of segment is calculated by combination of: source text, author, document and context.
      Segment is replaced only in case if all those fields are same for newly coming segment and only timestamp is more fresh then the one existing in t5memory.

      After improvement done we will be able to combine author, document or context depending on config. Only source text will always play a role and in theory all additional fields may be omitted from combination.

      How do I even test it?

      You need TMX file with duplicates first.

      In application/config/installation.ini you have to add 1 lines of settings:

      runtimeOptions.LanguageResources.t5memory.import.skipAuthor = 0
      runtimeOptions.LanguageResources.t5memory.import.skipDocument = 0
      runtimeOptions.LanguageResources.t5memory.import.skipContext = 0
      
      runtimeOptions.LanguageResources.t5memory.useTmxUtilsTrim = 0
      runtimeOptions.LanguageResources.t5memory.useTmxUtilsFilter = 0

      With everything set to 0 after import resulting memory should have only freshest duplicates but still have variants for diff author, doc and context.

      Segments without context will receive fake one that is "-" symbol

       

      If skipAuthor is set to 1: all duplicates where only author differs will be skipped and only freshest one preserved.

      If skipDocument is set to 1: all duplicates where only document differs will be skipped and only freshest one preserved.

      If skipContext is set to 1: all duplicates where only context differs will be skipped and only freshest one preserved.

      Configs above may be combined in any way.

      If useTmxUtilsFilter is set to 1: all logic above should remain absolutely same. Only difference is speed of processing.

       

      To test useTmxUtilsTrim you need big TMX file. One that for sure will not fit into 1 memory on t5memory side.

      So test is to import that file in test-lr-1 with useTmxUtilsTrim = 0 -> export TMX file.
      Then  import that same file in  test-lr-2 with useTmxUtilsTrim = 1 -> export TMX file.

      Compare files. They should be same. Only difference yet again is speed of import.

            sanya@mittagqi.com Sanya Mikhliaiev
            sanya@mittagqi.com Sanya Mikhliaiev
            Leon Kiz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: