How can algorithms help us understand books?
Recently the Sunday Times outed J.K. Rowling as the author of the detective novel The Cuckoo’s Calling, published under her nom de plume Robert Galbraith. While devotees of Rowling quickly procured and binge-read her latest work, linguists and language lovers worldwide celebrated the computational analysis of the two scholars who helped reveal the true author of the book in question.
Patrick Juola (Duquesne University) and Peter Millican (Oxford University) were both approached by a Times reporter to compare The Cuckoo’s Calling with the novels of J.K. Rowling and three other possible authors. In a guest post on Language Log, Juola describes his process. He first explains the theory of “forensic stylometry”: “language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices.” By running tests on variables (such as distribution of word length, percentage of the 100 most common words, and frequency of pairs of adjacent words), Juola found that though the results were “mixed,” it suggested Rowling as the most likely author. Millican ran computational tests to arrive at the same conclusion, discovering along the way that Rowling is less likely to use the phrase “as soon as” than the three other writers examined.
Rowling is not the first mystery writer to have her text subjected to the exacting analysis of computational linguistics and their complex algorithms. One episode of the WNYC show Radiolab features Ian Lancashire, a professor at the University of Toronto, who made a startling discovery about Agatha Christie upon running computational word-frequency and vocabulary tests on her novels. On her 73rd detective novel, her vocabulary decreased by a shocking 20% from that of her previous 72 novels. Additionally Christie’s use of words such as “thing,” “anything,” “something,” and “nothing” increased sixfold. Lancashire concluded that Christie’s 73rd novel, appropriately titled Elephants Can Remember, marked the onset of Alzheimer’s for this cherished author, who was never diagnosed in her lifetime. Lancashire told Radiolab, “I was seeing the author in the text in a way that people haven’t seen the author in the text before.”
This kind of textual analysis enabled by computers can give readers a richer understanding of books and the authors behind those works. One paper by researchers at the Federal Technological University of Paraná (Brazil) and the University of Aberdeen (UK) explores the social network in the Odyssey, comparing it with modern social networks to suggest that Homer’s epic is based, in part, on actual events. A visualization of character co-occurrences in Les Misérables created by Mike Bostock helps readers instantly understand the interrelationships of characters in a way that is much more subtle when reading the book.
The Google Ngram Viewer is an excellent resource for language lovers, historians, or sociologists who wish to look at more than just one book; it allows users to search the various Google Books’ corpora (collections of words and texts) to understand trends of word usage over time, often providing insight into social and cultural implications of these trends. Recently the term Popemobile was added to Dictionary.com. As part of the research for that new entry, lexicographers used the Google Ngram Viewer to generate a visualization of when this word first started appearing in English-language books—the mid-1970s. We can also learn from this graph that Popemobile appears more frequently with an initial capital letter than in all lowercase type. This kind of data helps Dictionary.com provide the most accurate and high-quality definitions for our users.
From revealing the true author of mystery books to helping lexicographers write better definitions, technology quickly illuminates books in ways that might have taken a lifetime of research without the aid of computers. Writers who wish to stay anonymous can attempt to outsmart stylometry experts—there’s even a program being developed for this very purpose called Anonymouth. Perhaps J.K. Rowling will use a tool like this to disguise her writing the next time she decides to clandestinely break into a new genre.