Quantcast
Channel: anthropology news ticker - antropologi.info » anthropology
Viewing all articles
Browse latest Browse all 2364

Language Log: "Clutter" in (writing about) science writing

$
0
0
Paul Jump, "Cut the Clutter", Times Higher Education: Is there something unforgivably, infuriatingly obfuscatory about the unrestrained use of adjectives and adverbs? In a word, no.  But Mr. Jump is about to tell us, approvingly, about some "science" on the subject: Zinsser and Twain are quoted by Adam Okulicz-Kozaryn, assistant professor of public policy at Rutgers University Camden, in support of his view that the greater the number of adjectives and adverbs in academic writing, the harder it is to read. Okulicz-Kozaryn has published a paper in the journal Scientometrics that analyzes adjectival and adverbial density in about 1,000 papers published between 2000 and 2010 from across the disciplines. Perhaps unsurprisingly, the paper, "Cluttered Writing: Adjectives and Adverbs in academia," finds that social science papers contain the highest density, followed by humanities and history. Natural science and mathematics contain the lowest frequency, followed by medicine and business and economics. The difference between the social and the natural sciences is about 15 percent. "Is there a reason that a social scientist cannot write as clearly as a natural scientist?" the paper asks. I'm not going to discuss the neurotic aversion to modification.  Instead, I'm going to explore Paul Jump's apparent ignorance of the norms of scientific communication and of standard English prose, and the much more surprising parallel failures of the editors of the Springer journal Scientometrics. The first three sentences of "Cluttered writing: adjectives and adverbs in academia": Scientific writing is about communicating ideas. Today, simplicity is more important than ever. Scientist are overwhelmed with new information. This scientist are indeed overwhelmed, but the first overwhelming thing about Okulicz-Kozaryn's paper is its profusion of grammatical errors, like "Readable scientific writing could reach wider audience and have a bigger impact outside of academia", or "Why measuring readability by counting adjectives and adverbs?"  Dr. Okulicz-Kozaryn even manages to introduce a grammatical error into his quotation from William Zinsser: You will clutter your sentence and annoy the reader if choose a verb that has a specific meaning and then add an adverb that carries the same meaning. Okulicz-Kozaryn is not a native speaker of English, so I'm not going to blame him, but you'd think that Springer could have hired a competent copy editor, especially given that they want to charge you $39.95 to read this (3-page-long) article. But I'm even more overwhelmed by papers whose research methods are so badly documented that I can't determine whether my suspicions about elementary experimental artifacts are valid. There are two good things about how Okulicz-Kozaryn got his adjective and adverb proportions. First, he used a somewhat reproducible source of data, namely JSTOR's Data For Research: I use data from JSTOR Data For Research [dfr.jstor.org] The sample is about 1,000 articles randomly selected from all articles published in each of seven academic fields between 2000 and 2010. I made the following selection from JSTOR: 1. Content type: Journal (to analyze research, not the other option: Pamphlets) 2. Page count: (5–100) (to avoid short letters, notes, and overly long essays; fewer than five pages may not offer enough to evaluate text, and longer than 100 may have a totally different style than the typical one for a given field) 3. Article type: Research article (other types such as book reviews may contain lengthy quotes, etc) 4. Language: English 5. Year of Publication: (2000–2010) (only recent research; did not select 2011, 2012, since for some fields JSTOR does not offer most recent publications—the number of available articles in most recent years dramatically drops, based on a JSTOR graph available at the selection). The dataset is only somewhat reproducible, since it represents a random selection made by JSTOR's site, and Okulicz-Kozaryn doesn't give us the selection he got. More important, he doesn't tell us what kind of data he actually got, and how he processed it. The JSTOR DFR service does not provide full text, but rather offers overall word counts along with a choice among bigrams, trigrams, and quadgrams. And many tokens are indicated in these lists only in terms of the number of characters they contain. Thus for one scientific article whose quadgram data I got from this service, the top of the frequency list was: <quadgram weight="212" > ## ## ## ## </quadgram> <quadgram weight="38" > ### ### ### ### </quadgram> <quadgram weight="29" > ### et al # </quadgram> <quadgram weight="21" > ### ## ## ## </quadgram> <quadgram weight="20" > ## ## ## ### </quadgram> <quadgram weight="16" > ### ### ## ## </quadgram> <quadgram weight="14" > ## ## ### ### </quadgram> <quadgram weight="13" > ## ### ### ## </quadgram> <quadgram weight="13" > ### ### ## ### </quadgram> <quadgram weight="11" > of density dependence in </quadgram> <quadgram weight="11" > rate of population increase </quadgram> <quadgram weight="10" > ### ## ### ### </quadgram> <quadgram weight="10" > density dependence in the </quadgram> In an article from the humanities, the top of the quadgram list looked like this: <quadgram weight="8" > ### ### ### ### </quadgram> <quadgram weight="3" > scraps of ### chilean </quadgram> <quadgram weight="3" > of ### chilean arpilleras </quadgram> <quadgram weight="2" > is a tale of </quadgram> <quadgram weight="2" > of the ### # </quadgram> <quadgram weight="2" > ### ### and ### </quadgram> <quadgram weight="2" > it is a tale </quadgram> <quadgram weight="2" > of ### ### ### </quadgram> <quadgram weight="1" > food ### lack of </quadgram> <quadgram weight="1" > # ### distinguishing these </quadgram> <quadgram weight="1" > hearts and minds of </quadgram> <quadgram weight="1" > disappeared ### hope for </quadgram> <quadgram weight="1" > women the usual treatment </quadgram> I presume that the items indicated with variable-length octothorp sequences are numbers, mathematical symbols, proper names, OCR errors, and other things that JSTOR's algorithms did not recognize as "words".  (I couldn't find any documentation on the DFR website about this.) But Okulicz-Kozaryn doesn't tell us what he did with these all-# tokens. Did he just use JSTOR's word counts, without subtracting the count of #-tokens? If so, then it's not surprising that the proportion of adjectives and adverbs would be lower in disciplines where the #-token count was higher: presumably all part-of-speech categories had smaller proportions in such texts. Or were the #-words subtracted from the word counts as well as left out of the part-of-speech tagging? If so, then it matters a lot just what those #-tokens were. If the #-tokens included all the tokens that weren't in some list of "words" that JSTOR used, then maybe writing in the natural sciences uses a larger fraction of modifiers that didn't make the list. A second good thing about Okulicz-Kozaryn's paper is that he used a somewhat reproducible method of analysis: I identify parts of speech using Penn Tree Bank in Python NLTK module. Unfortunately, it's not clear which NLTK tagger he actually used — there is no NLTK tagger called "Penn Tree Bank", though there are several available alternatives that use the Penn Tree Bank tagset. But this uncertainty is a small thing in comparison with a really big problem. Okulicz-Kozaryn doesn't tell us how he actually tagged the n-gram lists. Did he extract individual words and look them up in one of NLTK POS lexicons? Did he ask one of the NLTK taggers to cope with things like ### et al # or food ### lack of Without answers to these questions about what data he actually analyzed and how he analyzed it, Okulicz-Kozaryn's results — presented only as a graph of proportions — are completely meaningless: The fact that Okulicz-Kozaryn doesn't tell us anything about these issues leads me to expect the worst. This all raises an important question about Scientometrics (and its publisher, Springer): Never mind copy editors, do they have any reviewers or content editors? And it also raises a question about Paul Jump (and his publisher, Times Higher Education): Did he actually read this article? If he did, how did he miss (or why did he ignore) its scientific problems and its ironically poor English? If he didn't, why is he telling us about the paper's conclusions as if they were from a credible source? I can't resist adding Okulicz-Kozaryn's own prescription for improving the scientific literature: How do we keep up with the literature? We can use computers to extract meaning from texts. Better yet, I propose here, we should be writing research in machine readable format, say, using Extensible Markup Language (XML). I think, it is the only way for scientists to cope with the volume of research in the future. Words fail me, adjectives and adverbs included. If you care about the curious phenomenon of modifier phobia, here are some of our many previous posts on the topic: "Those who take the adjectives from the table", 2/18/2004 "Avoiding rape and adverbs", 2/25/2004 "Modification as social anxiety", 5/16/2004 "The evolution of disornamentation", 2/21/2005 "Adjectives banned in Baltimore", 3/5/2007 "Automated adverb hunting and why you don't need it", 3/5/2007 "Worthless grammar edicts from Harvard", 4/29/2010 "Getting rid of adverbs and other adjuncts, 2/21/2013 In an attempt to reduce the volume of irrelevant comments, let me stipulate: It might be true that social scientists use 15% more adjectives and adverbs in their papers than physicists do, but we can't tell from Okulicz-Kozaryn's paper whether they do or don't.  He provides no evidence, aside from appeals to unscholarly and hypocritical authorities, that small differences in modifier proportions have any effect on readability, whether positive or negative. And the idea that removing one modifier in eight, on average, would be a good way to reduce the volume of scientific publication is so silly that I wondered for a while whether this whole article might be a sly joke.

Viewing all articles
Browse latest Browse all 2364

Trending Articles