-
Bug
-
Resolution: Unresolved
-
None
-
None
-
Critical
-
Improved performance of duplicate segment detection by optimizing queries on large segment datasets. Future updates may further refine indexing and logic to reduce load.
-
Emptyshow more show less
Problem Description:
The query that checks for duplicate/variant target segments in \editor_Segment_Consistent_Check::getInconsistentSegmentNrsInTask on segment views may run slowly.
SELECT
GROUP_CONCAT(`segmentNrInTask`) AS `nrs`,
MAX(DISTINCT `editable`) AS `atLeastOneEditable`,
MAX(DISTINCT `autoStateId` = 3) AS `atLeastOneLocked`
FROM `LEK_segment_view_1313b8514edd64be348cd80384eb2dfb`
WHERE `targetEditToSort` != ''
AND `sourceToSort` != ''
AND `autoStateId` != '16'
GROUP BY BINARY `sourceToSort`
HAVING COUNT(DISTINCT BINARY `targetEditToSort`) > 1
AND (`atLeastOneLocked` = 0 OR `atLeastOneEditable` = 1);
Execution plan shows:
- Full table scan over ~5.3k rows (type = ALL).
- No indexes are being used (possible_keys = NULL).
- Filtering conditions include sourceToSort != '', targetEditToSort != '', and autoStateId != 16.
- Grouping on BINARY targetEditToSort and HAVING COUNT(DISTINCT BINARY sourceToSort) > 1 requires temporary tables and sorting (Using filesort).
Impact:
- Query performance degrades significantly on large datasets (millions of segments).
- Duplicate detection becomes a bottleneck for workflows that rely on this validation.
Notes:
- Because both sourceToSort and targetEditToSort are LONGTEXT, indexing them directly is not feasible.
- Partial/functional indexes (e.g., on targetEditToSort(255) or on hash columns like targetMd5, sourceMd5) could help.
- Query may require refactoring to leverage existing Md5 columns instead of operating directly on LONGTEXT fields.
- Alternative: precompute normalized sort fields in shorter indexed columns.
Next Steps:
- Evaluate feasibility of replacing GROUP BY BINARY targetEditToSort with grouping by targetMd5.
- Add indexes on autoStateId, editable, and possibly (autoStateId, targetMd5) to filter and group more efficiently.
- Test performance improvements with real-world data volumes.
- If indexing alone is insufficient, redesign duplicate detection logic to avoid full scans and filesort.