How algorithms help us understand books

How can algorithms help us understand books?

Recently the Sunday Times outed J.K. Rowling as the author of the detective novel The Cuckoo’s Calling, published under her nom de plume Robert Galbraith. While devotees of Rowling quickly procured and binge-read her latest work, linguists and language lovers worldwide celebrated the computational analysis of the two scholars who helped reveal the true author of the book in question.

Patrick Juola (Duquesne University) and Peter Millican (Oxford University) were both approached by a Times reporter to compare The Cuckoo’s Calling with the novels of J.K. Rowling and three other possible authors. In a guest post on Language Log, Juola describes his process. He first explains the theory of “forensic stylometry”: “language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices.” By running tests on variables (such as distribution of word length, percentage of the 100 most common words, and frequency of pairs of adjacent words), Juola found that though the results were “mixed,” it suggested Rowling as the most likely author. Millican ran computational tests to arrive at the same conclusion, discovering along the way that Rowling is less likely to use the phrase “as soon as” than the three other writers examined.

Rowling is not the first mystery writer to have her text subjected to the exacting analysis of computational linguistics and their complex algorithms. One episode of the WNYC show Radiolab features Ian Lancashire, a professor at the University of Toronto, who made a startling discovery about Agatha Christie upon running computational word-frequency and vocabulary tests on her novels. On her 73rd detective novel, her vocabulary decreased by a shocking 20% from that of her previous 72 novels. Additionally Christie’s use of words such as “thing,” “anything,” “something,” and “nothing” increased sixfold. Lancashire concluded that Christie’s 73rd novel, appropriately titled Elephants Can Remember, marked the onset of Alzheimer’s for this cherished author, who was never diagnosed in her lifetime. Lancashire told Radiolab, “I was seeing the author in the text in a way that people haven’t seen the author in the text before.”

This kind of textual analysis enabled by computers can give readers a richer understanding of books and the authors behind those works. One paper by researchers at the Federal Technological University of Paraná (Brazil) and the University of Aberdeen (UK) explores the social network in the Odyssey, comparing it with modern social networks to suggest that Homer’s epic is based, in part, on actual events. A visualization of character co-occurrences in Les Misérables created by Mike Bostock helps readers instantly understand the interrelationships of characters in a way that is much more subtle when reading the book.

The Google Ngram Viewer is an excellent resource for language lovers, historians, or sociologists who wish to look at more than just one book; it allows users to search the various Google Books’ corpora (collections of words and texts) to understand trends of word usage over time, often providing insight into social and cultural implications of these trends. Recently the term Popemobile was added to As part of the research for that new entry, lexicographers used the Google Ngram Viewer to generate a visualization of when this word first started appearing in English-language books—the mid-1970s. We can also learn from this graph that Popemobile appears more frequently with an initial capital letter than in all lowercase type. This kind of data helps provide the most accurate and high-quality definitions for our users.

From revealing the true author of mystery books to helping lexicographers write better definitions, technology quickly illuminates books in ways that might have taken a lifetime of research without the aid of computers. Writers who wish to stay anonymous can attempt to outsmart stylometry experts—there’s even a program being developed for this very purpose called Anonymouth. Perhaps J.K. Rowling will use a tool like this to disguise her writing the next time she decides to clandestinely break into a new genre.


When in Rome: Everyday Latin Phrases


Ibid. is an abbreviation for the Latin word ibidem meaning “in the same place.” The phrase is useful for citations and bibliographies to refer to a source cited in a previous entry, possibly appearing in a citations page as:
[1] Cicero, M.T. Latin 101 (Rome: Academic, 21 BCE), p. 4.
[2] Ibid.
Ibid. is always followed by a period, is capitalized only at the beginning of a sentence or citation, and may or may not be italicized depending on the writer’s preference.

Merry Mix-Ups: 9 British Terms That Flummox



1 of 9

When many of us hear the word saloon we think of an old-timey bar with large swinging doors and ragtime music playing in the background. But in the UK, if you ask to be taken to a saloon, you might be guided to what we in the States refer to as a sedan–a car that seats four or five people. If this outcome is disappointing, remember it’s the perfect opportunity to ask for a ride to the pub.


2 of 9

Proceed with caution when disclosing any information regarding your pants to a Brit because in British English, the word pants means “underpants.” If you must discuss the heavenly breathability or superior fabric grade of your new slacks, consider using the term trouser to ensure it translates accurately.


3 of 9

In the US, the word jumper refers to a person or thing that jumps. But in the UK, jumper has the softer, cozier meaning of “sweater.” This sense of the word can be traced back to the now-obsolete definition of jump as a short coat worn by men in 1600 and 1700s.


4 of 9

Jim Henson is credited with coining this term in the US to refer to his beloved part-puppet, part-marionette creations, but in Britain the word muppet has taken on a colloquial sense to refer to an incompetent or ineffective person–an idiot. No doubt Ms. Piggy would take umbrage at this less-than-flattering derivation.


5 of 9

When many of us think of biscuits, we think of soft, flaky baked buns lathered in butter or gravy. But US tastebuds be warned: in British English, the word biscuit, also known as a digestive biscuit, or sometimes just a digestive, refers to what we might call a cookie or cracker. The word derives from the Latin biscoctum panem which means “twice-baked bread.”

Agony aunt

6 of 9

The term agony aunt, may conjure a family member who pinches her relatives’ cheeks a little too hard, squeezes them a little too tight, and gossips about their marriage prospects (or lack thereof) a little too fervently. But in the UK, this term refers to an editor of what is called an agony column, or what US readers might know as an advice column. The writers of Dear Abby and Ask Ann Landers are examples of agony aunts.


7 of 9

For many of us, the word braces conjures middle-school memories of metal-clad mouths and trips to the orthodontist. For the Brits, braces can also refer to suspenders, as in straps that hold up trousers. Both of these meanings can be traced back to the Old French word brace or braz, meaning “arms,” and its verb form bracier meaning “to embrace or to render firmly or steady by tensing.”

Boot and bonnet

8 of 9

Boots and bonnets are not items that we commonly associate with cars, but in the UK, boot refers to a car’s trunk, and bonnet refers to its hood. Be it with boots, bonnets, or hoods, it’s clear that people like to dress their cars in clothing terminology on both sides of the pond.


9 of 9

If you want what we in the States know as pudding–a soft, thick gelatinous treat–while traveling in Great Britain, you might have to get more specific with your terminology. Across the pond, pudding can refer to any sort of dessert course or to a stuffed entrail or sausage. This broader meaning explains the rise of everyone’s favorite pudding-related proverb, The proof is in the pudding, or as it was originally phrased, All the proof of the pudding is in the eating.