James Scott-Brown home about projects

Compiling a comprehensive biomedical/chemical spell-check dictionary

I often type biomedical or chemical terms which are not recognised by default spell-checkers.

When all technical terms are highlighted as being misspelt, it is much harder to spot where you have actually made typos.

A while ago, I therefore programmatically created a large word-list of technical terms. This was derived from a number of sources:

The NCBI lexicon contains lexical details of words (part of speech, etc.), so I extracted only the words with a regex. I then concatenated the files, sorted them with sort, and remove duplicates with uniq. A little additional processing was also performed.

The result is a single file with 371,261 entries. This includes almost all the terms that I am likely to use, plus many that I am not. There are some gene names that possibly shouldn’t be there (whsc1l2p), but I’ve resisted the urge to manually delete entries.

The file is long enough that, after concatenating it to the end of ~/Library/Spelling/LocalDictionary on my mac, and logging-out and -in again, almost all of the spurious squiggly lines disappear. Even more useful, I now get suggestions for how to spell obscure words like woytkowskii or wiedemann-rautenstrauch.

blog comments powered by Disqus