Uploaded image for project: 'translate5'
  1. translate5
  2. TRANSLATE-5023

Performance issue with segment duplicate detection query

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • None
    • None
    • Auto-QA
    • Critical
    • Improved performance of duplicate segment detection by optimizing queries on large segment datasets. Future updates may further refine indexing and logic to reduce load.

      Problem Description:

      The query that checks for duplicate/variant target segments in \editor_Segment_Consistent_Check::getInconsistentSegmentNrsInTask on segment views may run slowly.

      SELECT  
        GROUP_CONCAT(`segmentNrInTask`) AS `nrs`,
        MAX(DISTINCT `editable`) AS `atLeastOneEditable`,
        MAX(DISTINCT `autoStateId` = 3) AS `atLeastOneLocked`
        FROM `LEK_segment_view_1313b8514edd64be348cd80384eb2dfb`  
        WHERE `targetEditToSort` != ''  
          AND `sourceToSort` != ''
          AND `autoStateId` != '16'
        GROUP BY BINARY `sourceToSort`
          HAVING COUNT(DISTINCT BINARY `targetEditToSort`) > 1
            AND (`atLeastOneLocked` = 0 OR `atLeastOneEditable` = 1); 

      Execution plan shows:

      • Full table scan over ~5.3k rows (type = ALL).
      • No indexes are being used (possible_keys = NULL).
      • Filtering conditions include sourceToSort != '', targetEditToSort != '', and autoStateId != 16.
      • Grouping on BINARY targetEditToSort and HAVING COUNT(DISTINCT BINARY sourceToSort) > 1 requires temporary tables and sorting (Using filesort).

      Impact:

      • Query performance degrades significantly on large datasets (millions of segments).
      • Duplicate detection becomes a bottleneck for workflows that rely on this validation.

      Notes:

      • Because both sourceToSort and targetEditToSort are LONGTEXT, indexing them directly is not feasible.
      • Partial/functional indexes (e.g., on targetEditToSort(255) or on hash columns like targetMd5, sourceMd5) could help.
      • Query may require refactoring to leverage existing Md5 columns instead of operating directly on LONGTEXT fields.
      • Alternative: precompute normalized sort fields in shorter indexed columns.

      Next Steps:

      • Evaluate feasibility of replacing GROUP BY BINARY targetEditToSort with grouping by targetMd5.
      • Add indexes on autoStateId, editable, and possibly (autoStateId, targetMd5) to filter and group more efficiently.
      • Test performance improvements with real-world data volumes.
      • If indexing alone is insufficient, redesign duplicate detection logic to avoid full scans and filesort.

       
       
       

       
       
       
       

       

            pavelperminov Pavel Perminov
            tlauria Thomas Lauria
            Axel Becher, Thomas Lauria
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: