Quantcast
Channel: anthropology news ticker - antropologi.info » anthropology
Viewing all articles
Browse latest Browse all 2364

Language Log: Word String frequency distributions

$
0
0
Several people have asked me about Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words", Nature Scientific Reports 12/10/2012. The abstract (emphasis added): We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature. The paper is thought-provoking, and the conclusions definitely merit further exploration. But I feel that the paper as published is guilty of false advertising. As the emphasized material in the abstract indicates, the paper claims to be about the frequency distributions of words in the vocabulary of English and other natural languages. In fact, I'm afraid, it's actually about the frequency distributions of strings in Google's 2009 OCR of printed books — and this, alas, is not the same thing at all. It's possible that the paper's conclusions also hold for the distributions of words in English and other languages, but it's far from clear that this is true. At a minimum, the paper's quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call "words". Whether the qualitative results hold or not remains to be seen. In the 20090715 version of the Google dataset that the paper relies on for its analysis of English, there are about 360 billion string tokens in total, representing 7.4 million distinct string types. Of these, 292 billion string tokens are all-alphabetic (81.2%), representing 6.1 million string types (83%). 66 billion string tokens are non-alphabetic (18.3%), representing 560 thousand string types (7.6%). And 1.8 billion string tokens are mixed alphabetic and non-alphabetic (0.5%), representing 692,713 types (9.4%). More exactly: Tokens Types All 359,675,008,445 7,380,256 Alphabetic 292,060,845,596 6,127,099 Non-alphabetic 65,799,990,649 560,444 Mixed 1,814,172,200 692,713 Why have I referred to "string tokens" and "string types" rather than to "words"? Well, here's a random draw of 10 from that set of 7,380,256  string types (preceded in each case by the count of occurrences of that string type in the collection): 116856 discouragements 2485 NH4CI 42 attorniea 425 PEPPERED 51 prettyboys 191 iillll 506 Mecir 68 inkdrop 3926 LATTIMORE 18631 cart's Here's another: 174 Pfii 51 Lnodicea 126 almofb 82 0egree 672 Garibaldina 47 excllence 5693 Eoosevelt 118 Dypwick 83 opinion19 65 VVouldst And another: 55 txtx 218 suniving 91 fn_ 48 Kultursysteme 54 notexpected 137 handsoap 46 tornarmene 9551 Rohault 48 Blrnstlngl 150 1037C As samples of English "words", these are not very persuasive.  About half of them are typographical or OCR errors, and of those that are not, many are regularly-derived forms of other words ("cart's", "discouragements") or numeric strings ("1037C"). The fact that Google's collection of such strings exhibits certain statistical properties is interesting, but it's not clear that it's telling us anything much about the English language, rather than about typographical practices and the state of Google's OCR as of 2009. Though several of the key results in their paper deal with the full dataset — numbers, OCR errors, and all — the authors do recognize this problem, while (in my opinion) seriously underplaying it: The word frequency distribution for the rarely used words constituting the “unlimited lexicon” obeys a distinct scaling law, suggesting that rare words belong to a distinct class. This “unlimited lexicon” is populated by highly technical words, new words, numbers, spelling variants of kernel words, and optical character recognition (OCR) errors. In fact, the order in which they give these categories is rather the reverse of the truth: "highly technical words [and] new words" are radically less common in this dataset than "numbers, spelling variants of kernel words, and OCR errors". Or to put it another way, the majority of the "words" in this list are not words of English at all — not "highly technical words", not "new words", not any kind of words. The authors propose to remedy the problem this way: We introduce a pruning method to study the role of infrequent words on the allometric scaling properties of language. By studying progressively smaller sets of the kernel lexicon we can better understand the marginal utility of the core words. They set their word-count threshold to successively larger powers of two, ending with a threshold of 2^15 = 32768 as defining the "kernel" or "core" lexicon of English. This reduces the set to 353 billion tokens representing 143 thousand types, of which 286 billion tokens (81%) representing 132 thousand types (92%) are all-alphabetic, 65 billion tokens (18.6%) representing 5 thousand types (4%) are non-alphabetic, and 1.3 billion tokens (0.4%) representing 5 thousand types are mixed (4%). Tokens Types All ("core") 353,131,491,190 142,700 Alphabetic ("core") 286,352,289,855 131,598 Non-alphabetic ("core") 65,465,833,242 5,360 Mixed ("core") 1,313,368,093 5,742 Do we now have a set representing the vocabulary of English? It's certainly better, as this random draw of 20 suggests: 103971 tardive 34879 Lichnowsky 40031 punctiliously 40543 recumbency 85620 Lupton 50627 GLP 156373 fertilize 54410 Niu 139924 ogre 38535 Burnett's 108385 chymotrypsin 70076 rigueur 680293 staunch 56995 56.6 467320 Habsburg 57726 populists 41164 occu 47133 Scapula 42483 Buhl 242641 Olmsted But there are still some issues. We still have about 8% (by types) and 19% (by tokens) of numbers, punctuation, and so on. And about half of the higher-frequency string-types in the "kernel" lexicon are proper names — these are certainly part of the language, but it seems likely that their dynamics is quite different from the rest of the vocabulary. And now that most of egregious typos and OCR errors are out of the way, we need to consider the issue of variant capitalization and regular inflection. Here are the dataset's 22 diverse capitalizations of the inflected forms of the word copy, arranged in increasing order of overall frequency (with 9 frequent enough to make it into the "core" lexicon): 46 copyIng 56 COPiES 83 cOpy 99 CoPY 107 copY 144 COPYing 222 coPy 280 COPy 367 COpy 435 CoPy 484 CopY 4601 COPIED 14651 COPYING 50412 Copied 54338 COPIES 194846 COPY 374302 Copying 545386 Copies 1143116 Copy 2633786 copied 7316913 copies 13920809 copy Here are the 23 variants of the word break, with 13 in the "core" lexicon: 51 breaKs 73 breaKing 82 BreaK 88 BReak 187 broKe 276 breakIng 356 broKen 512 breaK 14875 BROKE 21849 BREAKS 54462 BREAKING 74913 BROKEN 103631 BREAK 132482 Breaks 149392 Broke 560479 Breaking 629544 Break 740875 Broken 4564785 breaks 8538111 breaking 12330476 broke 19325762 break 19617163 broken Or the 14 forms of succeed, of which 7 make it into the "core" lexicon: 46 SUCCeed 51 SUCceeded 4897 SUCCEEDING 6800 SUCCEEDS 7026 SUCCEEDED 12619 SUCCEED 22154 Succeeds 43856 Succeeded 67645 Succeed 88912 Succeeding 1556533 succeeds 3466786 succeeding 6919764 succeed 13166229 succeeded Increasing the threshold successively prunes these lists, obviously. Monocasing everything reduces the number of all-alphabetic types to 4,472,529 in the "unlimited lexicon" (a reduction of 39.4% from the full type count of 7,380,256), and to 98,087 in the "core lexicon" of items that occur at least 32,768 times (a reduction of 31.3% from the full type count of 142,700). Limiting the list to all-lower-case alphabetic words (which will tend to decrease the number of proper nouns) reduces the "unlimited lexicon" to 2,642,277 items (down 64.2%), and the "core lexicon" to 67,816 (down 52.5%). But it's not clear which properties of these successively reduced distributions are due to the nature of the English language, and which are due to the interaction of typographical practice, OCR performance, and sampling effects. And even if we were to limit our attention to alphabetic words, to fold or screenout screen capital letters, and to restrict item frequencies, we'd be left to unravel the (sampled) distribution across time of the various inflected forms of words, compared to the (sampled) distribution across time of different words. Or the development (and its reflection in print) of derived and compound words, or of proper nouns, which probably also differ from that of novel coinages or borrowings. So to sum up: The authors' conclusions about the "unlimited lexicon" should be seen as conclusions about the sets of strings that result from typographical practices, Google's OCR performance as of 2009, and Google's tokenization algorithms. Even their conclusions about the "kernel" or "core" lexicon are heavily influenced, and perhaps dominated, by the distribution of proper names and by variant capitalization, as well as by the residue of the issues affecting the "unlimited" lexicon. Questions about the influence of inflectional and derivational variants also remain to be addressed. Given these problems, the quantitative results cannot be trusted to tell us anything about the nature and growth of natural-language vocabulary. And the qualitative results need to be checked against a much more careful preparation of the underlying data. It's too bad that the authors (who are all physicists or economists) didn't consult any computational linguists or others with experience in text analysis, and that Nature's reviewers apparently didn't include anyone in this category either. Note: The underlying data is available here. For convenience, if you'd like to try some alternative models on various subsets or transformation of the "unlimited lexicon", here is a (90-MB) histogram of string-types with their counts from the files googlebooks-eng-all-1gram-20090715-0.csv googlebooks-eng-all-1gram-20090715-1.csv googlebooks-eng-all-1gram-20090715-2.csv googlebooks-eng-all-1gram-20090715-3.csv googlebooks-eng-all-1gram-20090715-4.csv googlebooks-eng-all-1gram-20090715-5.csv googlebooks-eng-all-1gram-20090715-6.csv googlebooks-eng-all-1gram-20090715-7.csv googlebooks-eng-all-1gram-20090715-8.csv googlebooks-eng-all-1gram-20090715-9.csv According to the cited paper, the authors accessed the data on 14 January 2011, which means that this is the version they worked from for English. A newer and larger English version (Version 20120701) is now available — at some point I'll post about the properties of the new dataset…

Viewing all articles
Browse latest Browse all 2364

Trending Articles