Friday, March 26, 2010

How to write your own spell checker ? C#.net

Hi There,
Today I wrote a readable & usable implementation of Norvig's Spell corrector.
Norvi's implementation of spell check is awesome! On his site he has link to many other implementation in various languages, including C#.

Most of the implementation are focused on minimum number of lines , so they are not readable. I have written my own implementation which you can use on your own risk.

Here is the source code

You can also check-out entire code using this svn command




            svn checkout https://jugad.googlecode.com/svn/trunk/C%23/SpellCheck SpellCheck




How Algorithm works?



1.Initially Algorithms builds a dictionary* data-structure of words, from a huge English text. (so algorithm does not uses dictionary# file as such). This dictionary* contains words-to-frequency-of-occurrence mapping. lets say it as nWords.



2.When you will fetch a word(lets say 'w') for spell correction,Then the algorithm builds a list of all possible words that can be formed from w by doing following operation

  1. Deletion (delete 1 character, for all possible combination)
  2. Transposition (flip 1 character, for all possible combination)
  3. Alteration (modify 1 character, for all possible combination)
  4. Insertion (insert 1 character, for all possible combination)

Lets say name of this list is 'edits' , note that all words in this list are just 1 step away from original word.
3. If edits has 1 or more words that exists in nWords, then it returns the word from nWords with maximum frequency, as a result.
4. If No words in 'edits' are present in nWords, then it creates another edits using edits to get second level word (word which are 2 step away from original input word)
5.Repeats from step-3 (or give up if your are already 2/3 steps apart from original word 'w')
Read more Here.

How To Use This code ?
To use this code instantiate class spell by passing huge-string of English text (from where it will build dictionary)
Then call function correct() to get the corrected word as result.

Note:This code is just to demonstrate the algorithm, modify constructor to build dictionary directly from file. You can also avoid calling sort operation in function 'correct'. Thoroughly understand the code before using it. Notify me if you find any critical problem.


*dictionary data-structure of programming
#real dictionary

No comments:

Post a Comment