How to Write a Spelling Corrector in Racket
In September, 2008, I translated Peter Norvig’s spelling corrector into Ruby. My current favorite language is Racket, so I thought it would be a good exercise to port it to Racket. After some helpful tips by Vincent St-Amour and Sam Tobin-Hochstadt in the #racket IRC channel, I came up with the following. I’ll show it two different ways, the first minimizes the line count (without sacrificing too much stylistically) to 27 lines, and the second is closer to how I’d normally format it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
#lang racket (define (words text) (regexp-match* #rx"[a-z]+" (string-downcase text))) (define (train features) (define model (make-hash)) (for ([f features]) (hash-update! model f add1 1)) model) (define nwords (train (words (file->string "big.txt")))) (define (edits1 word) (let* ([alphabet "abcdefghijklmnopqrstuvwxyz"] [splits (for/list ([i (in-range (+ (string-length word) 1))]) (cons (substring word 0 i) (substring word i)))] [deletes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0)) (string-append a (substring b 1)))] [transposes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 1)) (string-append a (substring b 1 2) (substring b 0 1) (substring b 2)))] [replaces (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0) [c alphabet]) (string-append a (string c) (substring b 1)))] [inserts (for*/set ([(a b) (in-dict splits)] [c alphabet]) (string-append a (string c) b))]) (set-union deletes transposes replaces inserts))) (define (known-edits2 word) (for*/set ([e1 (edits1 word)] [e2 (edits1 e1)] #:when (hash-has-key? nwords e2)) e2)) (define (known words) (for/set ([w words] #:when (hash-has-key? nwords w)) w)) (define (nes set) (if (set-empty? set) #f set)) (define (correct word) (let ([candidates (or (nes (known (list word))) (nes (known (edits1 word))) (nes (known-edits2 word)) (set word))]) (argmax (λ (x) (hash-ref nwords x 1)) (set->list candidates)))) |
And here is a more aesthetically pleasing format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
#lang racket (define (words text) (regexp-match* #rx"[a-z]+" (string-downcase text))) (define (train features) (define model (make-hash)) (for ([f features]) (hash-update! model f add1 1)) model) (define nwords (train (words (file->string "big.txt")))) (define (edits1 word) (let* ([alphabet "abcdefghijklmnopqrstuvwxyz"] [splits (for/list ([i (in-range (+ (string-length word) 1))]) (cons (substring word 0 i) (substring word i)))] [deletes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0)) (string-append a (substring b 1)))] [transposes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 1)) (string-append a (substring b 1 2) (substring b 0 1) (substring b 2)))] [replaces (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0) [c alphabet]) (string-append a (string c) (substring b 1)))] [inserts (for*/set ([(a b) (in-dict splits)] [c alphabet]) (string-append a (string c) b))]) (set-union deletes transposes replaces inserts))) (define (known-edits2 word) (for*/set ([e1 (edits1 word)] [e2 (edits1 e1)] #:when (hash-has-key? nwords e2)) e2)) (define (known words) (for/set ([w words] #:when (hash-has-key? nwords w)) w)) (define (nes set) (if (set-empty? set) #f set)) (define (correct word) (let ([candidates (or (nes (known (list word))) (nes (known (edits1 word))) (nes (known-edits2 word)) (set word))]) (argmax (λ (x) (hash-ref nwords x 1)) (set->list candidates)))) |