Several people have asked me about Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words", Nature Scientific Reports 12/10/2012. The abstract (emphasis added):
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.
The paper is thought-provoking, and the conclusions definitely merit further exploration. But I feel that the paper as published is guilty of false advertising. As the emphasized material in the abstract indicates, the paper claims to be about the frequency distributions of words in the vocabulary of English and other natural languages. In fact, I'm afraid, it's actually about the frequency distributions of strings in Google's 2009 OCR of printed books — and this, alas, is not the same thing at all.
It's possible that the paper's conclusions also hold for the distributions of words in English and other languages, but it's far from clear that this is true. At a minimum, the paper's quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call "words". Whether the qualitative results hold or not remains to be seen.
In the 20090715 version of the Google dataset that the paper relies on for its analysis of English, there are about 360 billion string tokens in total, representing 7.4 million distinct string types. Of these, 292 billion string tokens are all-alphabetic (81.2%), representing 6.1 million string types (83%). 66 billion string tokens are non-alphabetic (18.3%), representing 560 thousand string types (7.6%). And 1.8 billion string tokens are mixed alphabetic and non-alphabetic (0.5%), representing 692,713 types (9.4%). More exactly:
Tokens
Types
All
359,675,008,445
7,380,256
Alphabetic
292,060,845,596
6,127,099
Non-alphabetic
65,799,990,649
560,444
Mixed
1,814,172,200
692,713
Why have I referred to "string tokens" and "string types" rather than to "words"? Well, here's a random draw of 10 from that set of 7,380,256 string types (preceded in each case by the count of occurrences of that string type in the collection):
116856 discouragements
2485 NH4CI
42 attorniea
425 PEPPERED
51 prettyboys
191 iillll
506 Mecir
68 inkdrop
3926 LATTIMORE
18631 cart's
Here's another:
174 Pfii
51 Lnodicea
126 almofb
82 0egree
672 Garibaldina
47 excllence
5693 Eoosevelt
118 Dypwick
83 opinion19
65 VVouldst
And another:
55 txtx
218 suniving
91 fn_
48 Kultursysteme
54 notexpected
137 handsoap
46 tornarmene
9551 Rohault
48 Blrnstlngl
150 1037C
As samples of English "words", these are not very persuasive. About half of them are typographical or OCR errors, and of those that are not, many are regularly-derived forms of other words ("cart's", "discouragements") or numeric strings ("1037C"). The fact that Google's collection of such strings exhibits certain statistical properties is interesting, but it's not clear that it's telling us anything much about the English language, rather than about typographical practices and the state of Google's OCR as of 2009.
Though several of the key results in their paper deal with the full dataset — numbers, OCR errors, and all — the authors do recognize this problem, while (in my opinion) seriously underplaying it:
The word frequency distribution for the rarely used words constituting the “unlimited lexicon” obeys a distinct scaling law, suggesting that rare words belong to a distinct class. This “unlimited lexicon” is populated by highly technical words, new words, numbers, spelling variants of kernel words, and optical character recognition (OCR) errors.
In fact, the order in which they give these categories is rather the reverse of the truth: "highly technical words [and] new words" are radically less common in this dataset than "numbers, spelling variants of kernel words, and OCR errors". Or to put it another way, the majority of the "words" in this list are not words of English at all — not "highly technical words", not "new words", not any kind of words.
The authors propose to remedy the problem this way:
We introduce a pruning method to study the role of infrequent words on the allometric scaling properties of language. By studying progressively smaller sets of the kernel lexicon we can better understand the marginal utility of the core words.
They set their word-count threshold to successively larger powers of two, ending with a threshold of 2^15 = 32768 as defining the "kernel" or "core" lexicon of English. This reduces the set to 353 billion tokens representing 143 thousand types, of which 286 billion tokens (81%) representing 132 thousand types (92%) are all-alphabetic, 65 billion tokens (18.6%) representing 5 thousand types (4%) are non-alphabetic, and 1.3 billion tokens (0.4%) representing 5 thousand types are mixed (4%).
Tokens
Types
All ("core")
353,131,491,190
142,700
Alphabetic ("core")
286,352,289,855
131,598
Non-alphabetic ("core")
65,465,833,242
5,360
Mixed ("core")
1,313,368,093
5,742
Do we now have a set representing the vocabulary of English? It's certainly better, as this random draw of 20 suggests:
103971 tardive
34879 Lichnowsky
40031 punctiliously
40543 recumbency
85620 Lupton
50627 GLP
156373 fertilize
54410 Niu
139924 ogre
38535 Burnett's
108385 chymotrypsin
70076 rigueur
680293 staunch
56995 56.6
467320 Habsburg
57726 populists
41164 occu
47133 Scapula
42483 Buhl
242641 Olmsted
But there are still some issues. We still have about 8% (by types) and 19% (by tokens) of numbers, punctuation, and so on. And about half of the higher-frequency string-types in the "kernel" lexicon are proper names — these are certainly part of the language, but it seems likely that their dynamics is quite different from the rest of the vocabulary.
And now that most of egregious typos and OCR errors are out of the way, we need to consider the issue of variant capitalization and regular inflection. Here are the dataset's 22 diverse capitalizations of the inflected forms of the word copy, arranged in increasing order of overall frequency (with 9 frequent enough to make it into the "core" lexicon):
46 copyIng
56 COPiES
83 cOpy
99 CoPY
107 copY
144 COPYing
222 coPy
280 COPy
367 COpy
435 CoPy
484 CopY
4601 COPIED
14651 COPYING
50412 Copied
54338 COPIES
194846 COPY
374302 Copying
545386 Copies
1143116 Copy
2633786 copied
7316913 copies
13920809 copy
Here are the 23 variants of the word break, with 13 in the "core" lexicon:
51 breaKs
73 breaKing
82 BreaK
88 BReak
187 broKe
276 breakIng
356 broKen
512 breaK
14875 BROKE
21849 BREAKS
54462 BREAKING
74913 BROKEN
103631 BREAK
132482 Breaks
149392 Broke
560479 Breaking
629544 Break
740875 Broken
4564785 breaks
8538111 breaking
12330476 broke
19325762 break
19617163 broken
Or the 14 forms of succeed, of which 7 make it into the "core" lexicon:
46 SUCCeed
51 SUCceeded
4897 SUCCEEDING
6800 SUCCEEDS
7026 SUCCEEDED
12619 SUCCEED
22154 Succeeds
43856 Succeeded
67645 Succeed
88912 Succeeding
1556533 succeeds
3466786 succeeding
6919764 succeed
13166229 succeeded
Increasing the threshold successively prunes these lists, obviously. Monocasing everything reduces the number of all-alphabetic types to 4,472,529 in the "unlimited lexicon" (a reduction of 39.4% from the full type count of 7,380,256), and to 98,087 in the "core lexicon" of items that occur at least 32,768 times (a reduction of 31.3% from the full type count of 142,700). Limiting the list to all-lower-case alphabetic words (which will tend to decrease the number of proper nouns) reduces the "unlimited lexicon" to 2,642,277 items (down 64.2%), and the "core lexicon" to 67,816 (down 52.5%). But it's not clear which properties of these successively reduced distributions are due to the nature of the English language, and which are due to the interaction of typographical practice, OCR performance, and sampling effects.
And even if we were to limit our attention to alphabetic words, to fold or screenout screen capital letters, and to restrict item frequencies, we'd be left to unravel the (sampled) distribution across time of the various inflected forms of words, compared to the (sampled) distribution across time of different words. Or the development (and its reflection in print) of derived and compound words, or of proper nouns, which probably also differ from that of novel coinages or borrowings.
So to sum up: The authors' conclusions about the "unlimited lexicon" should be seen as conclusions about the sets of strings that result from typographical practices, Google's OCR performance as of 2009, and Google's tokenization algorithms. Even their conclusions about the "kernel" or "core" lexicon are heavily influenced, and perhaps dominated, by the distribution of proper names and by variant capitalization, as well as by the residue of the issues affecting the "unlimited" lexicon. Questions about the influence of inflectional and derivational variants also remain to be addressed.
Given these problems, the quantitative results cannot be trusted to tell us anything about the nature and growth of natural-language vocabulary. And the qualitative results need to be checked against a much more careful preparation of the underlying data.
It's too bad that the authors (who are all physicists or economists) didn't consult any computational linguists or others with experience in text analysis, and that Nature's reviewers apparently didn't include anyone in this category either.
Note: The underlying data is available here. For convenience, if you'd like to try some alternative models on various subsets or transformation of the "unlimited lexicon", here is a (90-MB) histogram of string-types with their counts from the files
googlebooks-eng-all-1gram-20090715-0.csv
googlebooks-eng-all-1gram-20090715-1.csv
googlebooks-eng-all-1gram-20090715-2.csv
googlebooks-eng-all-1gram-20090715-3.csv
googlebooks-eng-all-1gram-20090715-4.csv
googlebooks-eng-all-1gram-20090715-5.csv
googlebooks-eng-all-1gram-20090715-6.csv
googlebooks-eng-all-1gram-20090715-7.csv
googlebooks-eng-all-1gram-20090715-8.csv
googlebooks-eng-all-1gram-20090715-9.csv
According to the cited paper, the authors accessed the data on 14 January 2011, which means that this is the version they worked from for English. A newer and larger English version (Version 20120701) is now available — at some point I'll post about the properties of the new dataset…
↧