Hunspell

 

Hunspell is a widely used spell-checker that can handle numerous languages. Notable applications that use it are LibreOffice, Firefox, Thunderbird, and Chrome. The examples on this page employ Ole Schonburg's Java binding which includes the Hunspell code, but no dictionaries.

There are many sources for suitable dictionaries, including LibreOffice, but I downloaded the standard US dictionary (and its larger version) from SCOWL.

Almost all of my examples are related to word puzzles, including word wheels, grids, anagrams, and ladders, and so mostly use Hunspell to check if a string is a dictionary word. They employ my HunspellChecker class to interact with Hunspell since it handles the loading of dictionaries and spelling corrections.

Word puzzle programs of this type often rely upon word lists (i.e. a text file containing one word per line). Typical lists of this type can be found here, here, here, and here. For instance the Rosetta code problems for word ladders, word grid search, word wheels, and anagrams all suggest the use of unixdict.txt. This is a fine approach, but in larger tests I found that using Hunspell resulted in a significant speed up (probably because of the optimized storage of its dictionaries). On the downside, there was a 1-2 second delay when Hunspell was first used, triggered by the creation of those data structures and the start-up of the Hunspell service. This overhead didn't occur in subsequent calls to Hunspell, even from different Java programs.

1. Basic Tests

HunspellTest.java has Hunspell spellcheck a user-supplied word, and corrections are suggested if necessary.

HunspellChecker.java is used as a class by the other examples to invoke Hunspell, check words, and obtain suggested corrections. It also includes a main() test rig that checks the spelling of all the words in a supplied text file, such as testSpell1.txt.

2. Jumble: a Word Game

JumbleGame.java is a supremely exciting guess-the-word game which, rather than containing a hardwired list of words for the user's enjoyment, creates a random string of letters which Hunspell corrects, producing a dictionary word for the game.

JumbleGame.java also utilizes my WordUtils.java, a collection of support functions related to word puzzles. The main areas it covers are

It also comes with a main() function, for testing its functionality.

3. Splitting Strings into Words

SplitWords.java splits a user-supplied string into words without any reordering of the text. One irritation of using Hunspell for this task is the large number of two- and three- letter "Scrabble" type words it knows, which reduces the quality of the output (in my opinion). When I coded the anagram solver (see below), I filtered out those results by testing words of those lengths against common words stored in WordUtils.java.

4. Word Wheels and Graphs

[Word Wheel PIC] WordWheel.java finds all the words made from letters in a supplied string, using the letters at most once. Also, all the words must include the 'center' letter, which is supplied as a separate argument. For example, the wheel shown on the right is solved by calling:

run WordWheel tendiop d

This program is noticeably much faster when Hunspell is used to find words rather than by searching a words list.

[Lettergram PIC] Lettergram.java extends the wheel idea to an undirected graph with the letters as its nodes (e.g. see the example on the right). The code employs a depth-first search of the graph, generating a string at every step along the path until all the nodes have been visited. The graph is encoded as an adjacency list, built with data read from a text file (e.g. adjsLetters.txt). Each node is encoded as an AdjNode object.

5. Word Grids

WordsGrid.java is a simplified version of the Rosetta Code puzzle. The words may be placed horizontally, vertically, or diagonally, can be spelled backwards, and overlap. They are not allowed to zigzag, or wrap around. Unlike the Rosetta code problem, there's no message hidden in the grid.

The more complex output, which includes the starting and ending coordinates of each word, is implemented using a GridWord class.

Two example input grids are: wordSq1.txt and rosettaSq.txt.

[PathFinder PIC] PathFinder.java reuses the character grid format but attempts to generate a path of words starting from the center of the grid which covers it entirely. Each letter can be used only once, and the next letter in a word can be North, South, East, or West of the current one.

This is computational much tougher than the basic grid puzzle: it takes between 200 and 400 seconds to generate an answer on a 9 x 9 grid. As a result, I made the puzzle a little easier by restricting valid words to be between three and seven letter long.

Two example input grids are: wordSq1.txt and wordSq2.txt.

6. Anagrams

SingleAnagrams.java generates anagrams of a supplied word. The 'single' in the name is to distinguish this approach from the next program.

Anagrams.java allows the input string to be split, in order to generate several words based on the input. For an input string of 11 characters or less, the program can be implemented by calling WordUtils.permute() or WordUtils.ipermute(), the recursive and iteration permutation functions. However, the program runs out of memory when asked to find anagrams of a 12-character word. The problem is that both functions generate all the permutations at once, storing them in a single, very big, list before passing them to WordUtils.partitions() and Hunspell. This list is simply too large, with 12! strings (479,001,600). It's possible to allocate Java a larger heap space, but a better solution is to calculate the permutations using the factoradic number system. This is explained in FactPermutes.java, and implemented as WordUtils.fpermute().

From a practical point of view, the factoradic approach means that the n! permutations can be generated incrementally, rather than all at once. That means there's no need to build a massive list before calling partitions() and Hunspell.

7. Word Ladders

WordLadder.java is supplied with two words – a begin word and an end word, and attempts to convert the first into the second by changing one letter at a time. Each change represents one step (rung) in a ladder. The other aim is to create a ladder with the shortest number of steps.

WordLadder.java loads a text file of words (words.txt), and optimizes its search by only keeping those words of the same length as the begin and end words. The code also deletes used words from the list during execution. Neither of these optimizations is possible with Hunspell, so I decided to code both versions to compare their ladder lengths and speeds.

WordLadder.java uses breadth-first search (BFS), while the Hunspell version, Doublets.java, uses a priority queue and a simple scoring scheme supported by the WordNode class. The two versions take about the same time to find typical ladders, but the BFS ladders are sometimes shorter due to the simplicity of the scoring scheme in Doublets.java.


 

Downloads and Other Links


Dr. Andrew Davison
E-mail: ad@coe.psu.ac.th
Back to the third-party libraries page