By Erin McKean. January 9, 2011
I know, it’s getting a little late to reminisce about 2010. But besides being a new year, it has also been about a year since I started writing about language for Ideas (thanks again to Jan Freeman for sharing this column!), and I thought it would be fun to look back at a year’s worth of the best and worst stories about words.
From a lexicographer’s point of view, the best language story of 2010 was the recent paper in Science about “culturomics.” The authors define this term as “the application of high-throughput data collection and analysis to the study of human culture,” but what they literally did, working with Google Books, was take the full text of a huge number of books — about 4 percent of all titles ever published — and crunch the words as data, on the model of the Human Genome Project.
One amazing finding: They estimated “that 52% of the English lexicon — the majority of the words used in English books — consists of lexical ‘dark matter’ undocumented in standard references.” They found vast quantities of words like aridification, slenthem (a musical instrument), and deletable, none of them in normal dictionaries. Time to get crackin’, fellow lexicographers!
Along with their study came a public release of billions of sorted phrases, ranging up to five words in length, and a tool that allows any user to chart how common specific words are over time (it’s at www.ngrams.googlelabs.com). As fascinating as they are, these graphs have also led to lots of great discussion about one big thing missing from the data: the context in which the words are used. When we use the word class, are we talking about social rank (“middle class”) or school (“geometry class”)? That kind of analysis will have to wait for a different tool.
(more…)