Discussion:
From JoyceUlysses.txt -- words occurring exactly once
(too old to reply)
HenHanna
2024-05-30 20:09:39 UTC
Permalink
i'd not use Gauche for this, but maybe someone can change my mind.


_______________________
From JoyceUlysses.txt -- words occurring exactly once


Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?

-- Also, a list of words occurring once, twice or 3 times



re: hyphenated words (you can treat it anyway you like)

ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Jeff Barnett
2024-05-30 22:33:30 UTC
Permalink
Post by HenHanna
i'd not use Gauche for this, but maybe someone can change my mind.
_______________________
From JoyceUlysses.txt -- words occurring exactly once
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
              -- Also, a list of words occurring once, twice or 3 times
re: hyphenated words        (you can treat it anyway you like)
       ideally, i'd treat  [editor-in-chief]
                           [go-ahead]  [pen-knife]
                           [know-how]  [far-fetched] ...
       as one unit.
Make a list (or array) of the individual words (as strings or symbols in
a special package) of the original document then sort the list using the
Lisp-supplied sort function. You than write a loop using your favorite
tools and look for interior sequences of the required length. This gives
you a program that is asymptotically efficient as the theoretical
run-time will look something like (* c N (log N)), where N is the length
of the list produced by the first step and c is some constant.

Note, any solution resembling this one is not really what you want. For
example it would think "Snark" and "Snarks" are different words. Some
differences such as capitalization can be suppressed by choosing a sort
predicate that is case insensitive. You can, of course, write your own
sort predicate. The thing to note is that the predicate (the <= operator
used by sort) will not access the words or maintain state between
invocations; otherwise, the complexity can become arbitra
Stefan Monnier
2024-05-30 22:45:00 UTC
Permalink
Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
tr ' .;:,?!' '\n' | sort | uniq -u

?


- Stefan
Kaz Kylheku
2024-05-30 23:20:08 UTC
Permalink
Post by Stefan Monnier
Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
tr ' .;:,?!' '\n' | sort | uniq -u
Yep, that's pretty much how Doug McIlroy famously shut down Knuth.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Madhu
2024-06-08 16:47:18 UTC
Permalink
Post by Kaz Kylheku
Post by Stefan Monnier
Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
tr ' .;:,?!' '\n' | sort | uniq -u
Yep, that's pretty much how Doug McIlroy famously shut down Knuth.
https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf

(how do you cite this?)

Knuth didn't invent the "hash trie" data structure for this the article,
it was already there in TeX, in this article knuth credits Frank Liang's
phd thesis for the data structure.

This was one of the first things things I coded up at the time of the
article. The fun was in designing how to best modify the structure
without sacrificing space

Phil Bagwell's paper "Ideal Hash Trees" described its invention
correctly as Hash Array Mapped Tries. However at some point, (probably
after the coming from clojure developers with "functional" pretensions?)
the "hash trie" was appropriated meaning something else,
something"immutable" and all that.

At least there isn't a wiki page for it.
steve g
2024-08-11 22:34:45 UTC
Permalink
< > On 2024-05-30, Stefan Monnier <***@iro.umontreal.ca> wrote:
< >>> Given a text file of a novel (JoyceUlysses.txt) ...
< >>> could someone give me a pretty fast (and simple) program that'd give me
< >>> a list of all words occurring exactly once?
< >>
< >> tr ' .;:,?!' '\n' | sort | uniq -u
< >
< > Yep, that's pretty much how Doug McIlroy famously shut down Knuth.
Post by Madhu
https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf
(how do you cite this?)
you would think the university would have figured this out.

http://www.cs.tufts.edu/comp/250NN/neuralfaq.html
https://www-cs-faculty.stanford.edu/~knuth/lp.html

like I said before an FTP server can be usefull. imagine having to do
your assignments with a web browser or even worse: email. poor children.
Paul Rubin
2024-05-31 07:40:59 UTC
Permalink
Post by HenHanna
could someone give me a pretty fast (and simple) program that'd give
me a list of all words occurring exactly once?
To first approximation, this works for me (bash command):

tr -c "[a-zA-Z-]" "\n" < ulysses.txt |sort|uniq -c|sort -n
B. Pym
2024-05-31 10:13:50 UTC
Permalink
Post by HenHanna
i'd not use Gauche for this, but maybe someone can change my mind.
_______________________
From JoyceUlysses.txt -- words occurring exactly once
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Gauche Scheme

(use file.util) ;; file->string
(use srfi-13) ;; character sets
(use srfi-14) ;; string-tokenize

(define h (make-hash-table 'string=?))

(dolist
(s
(string-tokenize (file->string "Alice.txt")
(char-set-adjoin char-set:letter #\-)))
(hash-table-update! h
(regexp-replace* (string-upcase s) #/^-+/ "" #/-+$/ "")
(pa$ + 1) 0))

(filter (lambda(kv) (< (cdr kv) 3))
(hash-table->alist h))

===>

(("LASTED" . 2) ("WAY--NEVER" . 1) ("VISIT" . 1) ("CHANCED" . 1)
("WILDLY" . 2) ("BEHEAD" . 1) ("PROMISE" . 1) ("MEANWHILE" . 1)
("ENGAGED" . 1) ("KNIFE" . 2) ("ROARED" . 1) ("RETIRE" . 1)
("BLACKING" . 1) ("HATED" . 1) ("BRIGHT-EYED" . 1)
("SHEEP-BELLS" . 1) ("PROTECTION" . 1) ("CRIES" . 1) ("ADA" . 1)
("ENJOY" . 1) ("WRITHING" . 1) ("RAW" . 1) ("APPEALED" . 1)
("RELIEVED" . 1) ("CHILDHOOD" . 1) ("WEPT" . 1) ("RACE-COURSE" . 1)
("THEIRS" . 1) ("MAD--AT" . 1) ("SPOKEN" . 1) ("PENCILS" . 1)
("CLEAR" . 2) ("TREADING" . 2) ("RETURNED" . 2) ("CHERRY-TART" . 1)
("UNEASY" . 1) ("LOW-SPIRITED" . 1) ("BONE" . 1) ("PROMISED" . 1)
("HAPPENING" . 1) ("OYSTER" . 1) ("PATIENTLY" . 2) ("NEEDS" . 1)
("LESSON-BOOK" . 1) ("PITIED" . 1) ("UNCOMFORTABLY" . 1)
("ANTIPATHIES" . 1) ("PICTURED" . 1) ("DESPERATE" . 1)
("ENGRAVED" . 1)
...
)
Loading...