Lojic Technologies

Archive for October 2015

How to Write a Spelling Corrector in Racket

with one comment

In September, 2008, I translated Peter Norvig’s spelling corrector into Ruby. My current favorite language is Racket, so I thought it would be a good exercise to port it to Racket. After some helpful tips by Vincent St-Amour and Sam Tobin-Hochstadt in the #racket IRC channel, I came up with the following. I’ll show it two different ways, the first minimizes the line count (without sacrificing too much stylistically) to 27 lines, and the second is closer to how I’d normally format it:

#lang racket

(define (words text) (regexp-match* #rx"[a-z]+" (string-downcase text)))

(define (train features)
  (define model (make-hash))
  (for ([f features]) (hash-update! model f add1 1)) model)

(define nwords (train (words (file->string "big.txt"))))

(define (edits1 word)
  (let* ([alphabet "abcdefghijklmnopqrstuvwxyz"]
         [splits (for/list ([i (in-range (+ (string-length word) 1))])
                   (cons (substring word 0 i) (substring word i)))]
         [deletes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0))
                    (string-append a (substring b 1)))]
         [transposes (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 1))
                       (string-append a (substring b 1 2) (substring b 0 1) (substring b 2)))]
         [replaces (for/set ([(a b) (in-dict splits)] #:when (> (string-length b) 0) [c alphabet])
                     (string-append a (string c) (substring b 1)))]
         [inserts (for*/set ([(a b) (in-dict splits)] [c alphabet])
                    (string-append a (string c) b))])
    (set-union deletes transposes replaces inserts)))

(define (known-edits2 word)
  (for*/set ([e1 (edits1 word)] [e2 (edits1 e1)] #:when (hash-has-key? nwords e2)) e2))

(define (known words) (for/set ([w words] #:when (hash-has-key? nwords w)) w))

(define (nes set) (if (set-empty? set) #f set))

(define (correct word)
  (let ([candidates (or (nes (known (list word))) (nes (known (edits1 word)))
                        (nes (known-edits2 word)) (set word))])
    (argmax (λ (x) (hash-ref nwords x 1)) (set->list candidates))))

 

And here is a more aesthetically pleasing format:

 

#lang racket

(define (words text)
  (regexp-match* #rx"[a-z]+" (string-downcase text)))

(define (train features)
  (define model (make-hash))
  (for ([f features])
    (hash-update! model f add1 1))
  model)

(define nwords
  (train (words (file->string "big.txt"))))

(define (edits1 word)
  (let* ([alphabet "abcdefghijklmnopqrstuvwxyz"]
         [splits (for/list ([i (in-range (+ (string-length word) 1))])
                   (cons (substring word 0 i) (substring word i)))]
         [deletes (for/set ([(a b) (in-dict splits)]
                            #:when (> (string-length b) 0))
                    (string-append a (substring b 1)))]
         [transposes (for/set ([(a b) (in-dict splits)]
                               #:when (> (string-length b) 1))
                       (string-append a
                                      (substring b 1 2)
                                      (substring b 0 1)
                                      (substring b 2)))]
         [replaces (for/set ([(a b) (in-dict splits)]
                             #:when (> (string-length b) 0)
                             [c alphabet])
                     (string-append a (string c) (substring b 1)))]
         [inserts (for*/set ([(a b) (in-dict splits)]
                             [c alphabet])
                    (string-append a (string c) b))])
    (set-union deletes transposes replaces inserts)))

(define (known-edits2 word)
  (for*/set ([e1 (edits1 word)]
             [e2 (edits1 e1)]
             #:when (hash-has-key? nwords e2))
    e2))

(define (known words)
  (for/set ([w words] #:when (hash-has-key? nwords w))
    w))

(define (nes set)
  (if (set-empty? set)
      #f
      set))

(define (correct word)
  (let ([candidates (or (nes (known (list word)))
                        (nes (known (edits1 word)))
                        (nes (known-edits2 word))
                        (set word))])
    (argmax (λ (x) (hash-ref nwords x 1)) (set->list candidates))))

Written by Brian Adkins

October 16, 2015 at 12:36 am

Posted in programming

Tagged with

DSL Embedding in Racket

leave a comment »

Here’s a great, two part, video by Matthew Flatt about embedding DSLs in Racket. Being able to hack the language is one of Racket’s/Lisp’s killer features:

Written by Brian Adkins

October 15, 2015 at 6:39 pm

Posted in programming

Tagged with ,

Programming Language Popularity – Part Nine

with one comment

I made a number of Google searches of the forms below and summed the results:

"implemented in <language>"
"written in <language>"
"developed in <language>"
"programmed in <language>"

See Part Eight for prior results 17 months ago. I finally got around to writing a program (in Racket) to automate the collection of the search results, so it was much easier!

I’ve divided the table into sections based on percentage increases of more than 50% from one language to the next. To compute the rank delta, I excluded the newly added languages.

|------+-----------------+------------+-------+-------|
| Rank | Language        | # Search   | Prev. |  Rank |
|      |                 | Results    |  Rank | Delta |
|------+-----------------+------------+-------+-------|
|    1 | C               | 54,696,000 |     2 |     1 |
|------+-----------------+------------+-------+-------|
|    2 | R               | 22,784,600 |    12 |    10 |
|------+-----------------+------------+-------+-------|
|    3 | Java            |  1,490,000 |     6 |     3 |
|    4 | C++             |  1,256,600 |     3 |    -1 |
|    5 | PHP             |  1,082,000 |     1 |    -4 |
|    6 | C#              |  1,065,900 |     5 |    -1 |
|    7 | Python          |  1,016,500 |     4 |    -3 |
|    8 | Lisp Family (1) |    951,300 |    11 |     3 |
|    9 | JavaScript      |    764,100 |     8 |    -1 |
|   10 | FORTRAN         |    565,000 |     9 |    -1 |
|   11 | Perl            |    541,700 |     7 |    -4 |
|   12 | Ruby            |    509,900 |    10 |    -2 |
|   13 | ML Family (2)   |    445,716 |    13 |       |
|   14 | Scheme          |    381,200 |    19 |     5 |
|   15 | Go              |    378,200 |    16 |     1 |
|   16 | (S)ML (3)       |    376,116 |    21 |     5 |
|   17 | Lisp            |    357,900 |    15 |    -2 |
|------+-----------------+------------+-------+-------|
|   18 | Scala           |    236,770 |    23 |     5 |
|   19 | Haskell         |    220,580 |    17 |    -2 |
|   20 | Prolog          |    152,450 |    20 |       |
|   21 | Lua             |    133,360 |    22 |     1 |
|   22 | COBOL           |    113,700 |    14 |    -8 |
|   23 | Erlang          |    101,840 |    18 |    -5 |
|   24 | Common Lisp     |    100,580 |    25 |     1 |
|   25 | Rust            |     90,860 |   N/A |   N/A |
|   26 | Clojure         |     85,930 |    27 |     2 |
|   27 | Smalltalk       |     75,000 |    24 |    -2 |
|   28 | OCaml           |     69,600 |    26 |    -1 |
|   29 | Coffeescript    |     60,420 |    29 |     1 |
|   30 | Julia           |     51,315 |   N/A |   N/A |
|   31 | Forth           |     50,920 |    28 |    -1 |
|------+-----------------+------------+-------+-------|
|   32 | Racket          |     25,660 |    30 |       |
|------+-----------------+------------+-------+-------|
|   33 | Elixir          |      8,112 |   N/A |   N/A |
|------+-----------------+------------+-------+-------|

Update Oct. 6: I added Bing search results, and they’re significantly different. For a while Elixir was #2 because one of the four phrases resulted in no results, so Bing automatically switched to an unquoted search which skewed the results. I caught that and looked over the raw data and didn’t see anything similar for other languages. It’s interesting that C# came out #1 on Bing 🙂

|------+-----------------+----------|
| Rank | Language        | # Search |
|      |                 |  Results |
|------+-----------------+----------|
|    1 | C#              | 32100000 |
|    2 | C++             | 31900000 |
|------+-----------------+----------|
|    3 | C               |  4845000 |
|    4 | Java            |  3319000 |
|------+-----------------+----------|
|    5 | R               |  1219310 |
|    6 | Python          |  1123700 |
|    7 | Php             |  1093800 |
|    8 | Javascript      |   945200 |
|------+-----------------+----------|
|    9 | Lisp Family (1) |   525530 |
|   10 | Perl            |   460360 |
|   11 | Common Lisp     |   398039 |
|   12 | Ruby            |   301740 |
|   13 | Go              |   270593 |
|   14 | FORTRAN         |   245500 |
|   15 | Lua             |   214200 |
|   16 | Scala           |   188570 |
|   17 | Haskell         |   188520 |
|   18 | Lisp            |   160930 |
|   19 | COBOL           |   140910 |
|   20 | Scheme          |    98012 |
|   21 | ML Family (2)   |    97425 |
|   22 | Prolog          |    75740 |
|   23 | Erlang          |    63270 |
|   24 | (S)ML (3)       |    56375 |
|   25 | OCaml           |    41050 |
|   26 | Smalltalk       |    27960 |
|   27 | Julia           |    25319 |
|   28 | Clojure         |    23091 |
|   29 | Coffeescript    |    20322 |
|   30 | Forth           |    17529 |
|   31 | Rust            |    17086 |
|------+-----------------+----------|
|   32 | Racket          |     6388 |
|------+-----------------+----------|
|   33 | Elixir          |     2542 |
|------+-----------------+----------|

Here is the delta difference between Google and Bing:

|--------+------+-------+-----------------|
| Google | Bing | Delta | Language        |
|   Rank | Rank | G - B |                 |
|--------+------+-------+-----------------|
|      1 |    3 |    -2 | C               |
|      2 |    5 |    -3 | R               |
|      3 |    4 |    -1 | Java            |
|      4 |    2 |     2 | C++             |
|      5 |    7 |    -2 | Php             |
|      6 |    1 |     5 | C#              |
|      7 |    6 |     1 | Python          |
|      8 |    9 |    -1 | Lisp Family (1) |
|      9 |    8 |     1 | Javascript      |
|     10 |   14 |    -4 | FORTRAN         |
|     11 |   10 |     1 | Perl            |
|     12 |   12 |       | Ruby            |
|     13 |   21 |    -8 | ML Family (2)   |
|     14 |   20 |    -6 | Scheme          |
|     15 |   13 |     2 | Go              |
|     16 |   24 |    -8 | (S)ML (3)       |
|     17 |   18 |    -1 | Lisp            |
|     18 |   16 |     2 | Scala           |
|     19 |   17 |     2 | Haskell         |
|     20 |   22 |    -2 | Prolog          |
|     21 |   15 |     6 | Lua             |
|     22 |   19 |     3 | COBOL           |
|     23 |   23 |       | Erlang          |
|     24 |   11 |    13 | Common Lisp     |
|     25 |   31 |    -6 | Rust            |
|     26 |   28 |    -2 | Clojure         |
|     27 |   26 |     1 | Smalltalk       |
|     28 |   25 |     3 | OCaml           |
|     29 |   29 |       | Coffeescript    |
|     30 |   27 |     3 | Julia           |
|     31 |   30 |     1 | Forth           |
|     32 |   32 |       | Racket          |
|     33 |   33 |       | Elixir          |
|--------+------+-------+-----------------|

(1) combines Lisp, Common Lisp, Scheme, Clojure & Racket
(2) combines (S)ML & OCaml
(3) summed separate searches for standard ml, sml & ml

Written by Brian Adkins

October 6, 2015 at 3:53 am