-
Bug
-
Resolution: Unresolved
-
None
-
None
-
Medium
-
Emptyshow more show less
Issue description
For the document in question most segments look fine. Yet there are a number of segments, where the segments are connected, yet only a part of the segment is connected to the layout.
These are the segments
211, 201, 195, 184, 182, 78, 46
in the first testfile.
I would expect, that the following change to the algorithm would enhance this, not only for this document, but lead in general to a better connection rate:
- Currently we are connecting segments in a second loop, that have less than 10 chars (correct regarding the number?)
- In addition we should in the future connect segments in the second loop, that have less than 3 words. As word boundary it should be sufficient here to take spaces into account (please question that). This way also mail addresses or domains, that sometimes hinder the correct connections would also be done in the second loop
- Also in the second loop we should connect segments, that only contain numbers and/or the following chars: / \ - _ , . ; : ! ? + * ~ = ( ) { } ( ) [ ] " # ' ' `
- Same is true for segments, that contain only the chars of the last bullet plus whitespace plus only uppercase chars.
This way, the above mentioned segments should all be correct. The third bullet above does not change the first of the testfiles, but the segment 97 of the second testfile.
Concept of solving this
Each segment will get a new property "do not search in first loop". This property indicates if the segment should be used in the first loop of the segment finding process. The property will be set on initialization of the segment object and is based on the above defined rules.