-
Type:
Improvement
-
Resolution: Unresolved
-
None
-
Affects Version/s: None
-
Component/s: file format settings
-
Critical
-
File Format Settings: Improve Segmentation after Colons: Take quotes into account
-
Emptyshow more show less
Problem
- a) Segmentation after Colons frequently is not correct
- b) Problem in OKAPI implementation: https://groups.google.com/g/okapi-users/c/pZgAi_tsn28/m/N5L_RgPlCAAJ?utm_medium=email&utm_source=footer&pli=1
Solution
a)
Überall, wo wir die folgende break-Regel in unseren default-srx haben,
<rule break="yes"> <beforebreak>:</beforebreak> <afterbreak>\s+\p{Lu}</afterbreak> </rule>
sie gegen die folgende ersetzen
<rule break="yes"> <beforebreak>:</beforebreak> <afterbreak>\s+[„»‚›\"']? ?\p{Lu}</afterbreak> </rule>
b)
For us the solution is, to replace this break-yes rule
<beforebreak>[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]\s*[\p{Pe}\p{Pf}\p{Po}"'"''’""]\s[\.?!]\s*[\p{Pe}\p{Pf}\p{Po}"'"''’""]*</beforebreak>
with this one
<beforebreak>[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]\s*[\p{Pe}\p{Pf}\p{Po}"'"''’""]\s[\.?!]\s+[\p{Pe}\p{Pf}\p{Po}"'"''’""]*</beforebreak>
So just the last \s* in the regex replaced with \s+
This solves our problem (details see Okapi-Groups-Link above) and I think many similar problems.