Details
-
Task
-
Resolution: Unresolved
-
None
-
None
-
Medium
-
Empty show more show less
Description
current status and open questions
- we have a docker container starting languagetool, downloading the libreoffice hunspell dictionaries from an own fork: https://bitbucket.org/mittagqi/hunspell-dictionaries-for-languagetool
- the repo was forked, the common words were added manually - had to been converted from XML to txt - the config.properties file was created manually
- it contains a fixed config file - manually createad - for using that files, also common word lists from https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries were added
- Open question here:
- what happens if a dictionary to a language which is natively provided in languagetool is added? Which data is then used for that language?
- In languagetool documentation the usage of morfologik data is suggested instead of using slower hunspell data - https://dev.languagetool.org/hunspell-support.html
- How are such dict files then included? Passing them as config as currently the hunspell dict files? Automatically
- Are the common wordlists really needed, as far as I understand thats only for language recognition - and we know the language to be used, so...
- Ngrams are automatically downloaded from https://languagetool.org/download/ngram-data/ - are there other sources?
- downloading of the dictionaries and ngrams is done once after start, to update local files they have to be removed, and a container restart / run of the download scripts makes a download again
Todo:
- Download the dictionaries automatically from the root repo, same for the wordlists, integrate the languages not provided by languagetool (depending on the answer of the above question nr 1).
- Use morfologik files?
- Compare the list of languages with a list of languages in translate5