Identifying an author by their punctuation
June 5, 2019 1:20 PM   Subscribe

Neural networks and other machine learning approaches can often guess who wrote a given piece of text by analyzing patterns in the way they use words. But what about identifying an author by the punctuation they use instead? University of Oxford researcher Alex Darmon (and coauthors) has a web app that will compare the punctuation style of any writing sample to the authors in its database. Who do you punctuate like? (details on arxiv here)
posted by dbx (56 comments total) 19 users marked this as a favorite
Interesting! I tried this with one of my longer Metafilter comments and my closest match is George Bernard Shaw. Huh. Definitely not who I would have thought.
posted by hurdy gurdy girl at 1:39 PM on June 5

Alexandra N. M. Darmon, Marya Bazzi, Sam D. Howison, and Mason Porter. Great. Now I have to do four Google searches.
posted by Mogur at 1:45 PM on June 5 [2 favorites]

Lester del Rey, Daniel Defoe, E. Nesbit.
I punctuate with the best!
posted by pipeski at 1:47 PM on June 5 [1 favorite]

Insert three different samples, get three different authors. (In my case, Winston Marks, Anthony Hope, and L. Frank Baum.) It would be nice if it returned a shortlist of a few authors, ordered by degree of similarity.
posted by Iridic at 2:00 PM on June 5 [3 favorites]

Is this the new Oxford comma thread?
posted by sjswitzer at 2:02 PM on June 5 [7 favorites]

Anna Katharine Green, mother of the detective novel.
posted by Don Pepino at 2:05 PM on June 5

I tried a bunch of different samples. Every sample had a different result, but two results were Edgar Rice Burroughs and H. Rider Haggard so now I am feeling so adventurous and macho I can't even tell you...
posted by queensissy at 2:07 PM on June 5 [2 favorites]

So, I learned about this site while attending a conference at which the paper was presented by one of its authors. One of the interesting questions they asked was: can we identify 'good' scientific papers by punctuation style?

I'm happy to confirm that when I input the text from a paper I just finished writing, the machine told me it was similar, punctuationally, to a preeminent Greek historian. Yay!
posted by dbx at 2:08 PM on June 5 [1 favorite]

A long chunk of my writing supposedly resembles Hillaire Belloc. Score!

But a long chunk of The Golden Bowl supposedly resembles Elizabeth von Arnim.

I don't think James Wood has to worry about the robots taking his job quite yet.
posted by sy at 2:10 PM on June 5

Based on my experiences editing collaboratively written documents, this is one of the main ways I can tell who wrote what. (There's someone in upper administration here who uses en-dashes frequently and incorrectly...)
posted by Tesseractive at 2:11 PM on June 5 [5 favorites]

It doesn't inspire confidence in the authors' methodology that they analyse Shakespeare's punctuation without, apparently, being aware that this varies enormously from edition to edition. Ever since the time of Samuel Johnson, editors have freely repunctuated the text of Shakespeare. The claim that (to take one example) 'Shakespeare appears to use more exclamation marks and question marks than H.G. Wells' is thus completely meaningless.

The same goes for most of the earlier texts in their sample, as they are using public domain texts from Project Gutenberg, many of which will have been repunctuated. In other words, their text corpus is totally contaminated and their claims about 'the evolution of punctuation marks over time' are completely untenable. (And that's even before we get into the question of whether the punctuation of nineteenth- and twentieth-century books reflects authors' preferences or printers' house styles ..) I'm afraid this is what happens when four mathematicians write a paper without bothering to consult any literary scholars, textual editors or bibliographers.
posted by verstegan at 2:17 PM on June 5 [26 favorites]

Oh god I cant even deal with another app that identifies people
posted by jenfullmoon at 2:19 PM on June 5 [4 favorites]

So I fed it a few of my mefi comments, and it consistently said that I write like various late 19th century American philosophers.

I started to think about what that says about my writing style, but then I realized that maybe it says I write like a late-19th century American philosopher not because I write like a late 19th century American philosopher but instead because their database has an overrepresentation of late-19th century writers — something that I can easily see happening if they're using public-domain sources to build their corpus.

So just as an experiment I copied in a bit of my published work — Slothrop's last scene from GR — just to see what the site claims it resembles. And according to this site, the punctuation style I used back in the early 70s strongly resembled the style of... Jerome K. Jerome, the late 19th century English humorist who brought us Three Men in a Boat.

Go figure...
posted by Reclusive Novelist Thomas Pynchon at 2:20 PM on June 5 [7 favorites]

heavy-set prattle (full of parentheses, em-dashes, compound adjectives, and Oxford commas)...

I don't know if this ill-advised assertion is meant to be ironic—for it contains some (all?) of the items defined as "prattle"—or not; perhaps I'll ask my parents, Ayn Rand and God.
posted by sylvanshine at 2:23 PM on June 5 [4 favorites]

Yeah I was going to ask about the Shakespeare punctuation, thanks verstegan. And even just taking a raw string of the punctuations without any context from words seems likely to be limited in its usefulness. Machine Learning giving us a new rash of "data" driven personality quizzes.
posted by little onion at 2:23 PM on June 5

Also according to this site:
  • Jonathan Edward's Sinners in the Hands of an Angry God resembles Jonathan Swift
  • The punctuation in Cotton Mather's "Theopolis Americana: An Essay on the Golden Street of the Holy City" resembles that of abolitionist Eliza Lee Cabot Follen, who wrote about a century later. (Interestingly, Theopolis Americana is in part an abolitionist or proto-abolitionist text. Is there a common punctuation style used in abolitionist texts? probably not — this is almost certainly just a coincidence)
  • Charles Sanders Peirce's article Nominalism versus Realism resembles the writing of 19th century Scottish poet Andrew Lang.
Shrug! This is one of those situation — common in using machine techniques to analyze large corpuses of text — where the people making the inquiry are definitely uncovering something, but where what they're uncovering seems to bear little resemblance to what they claim to be uncovering. As observed upthread, what they might really be looking at is changes in editorial practice over the years.

If I'm in charge of this project, my goal at this point would be to make sure that if I feed it Shaw, it says the writing style resembles Shaw, if I feed it Wilde, it says the writing style resembles Wilde, and so forth. At the very least I'd want my classifier to consistently classify works into the correct centuries, which this doesn't quite do.
posted by Reclusive Novelist Thomas Pynchon at 2:36 PM on June 5 [5 favorites]

"...Ernest Hemingway wouldhave died rather than have syntax. Or semicolons. I use a whole lot of half-assed semicolons; there was one of them just now; that was a semicolon after"semicolons," and another one after "now."The setDkfor this quote isf, , . , . , . , ; , ; , \ , , , " ..."

-From the arxiv link.

The machine forgot the 'The and 'And' to papas description faux Paridosio.

Punching in a few old comments, apparently the machine posits that I have declared a moratorium on the apostrophe.
posted by clavdivs at 2:36 PM on June 5 [2 favorites]

Feed it Shaw, you break it, feed Gertrude Stein.
"No 3 fouder there is plenty of Complant of the diffulty of pasing those briges Now as it is troue if those giddy people have Liberty to bould A brigg it wont pay but three or four per sent at most then they must have one halfe the passing of my brigg as I call it A mad bisness” (from Folio 2 of Plain Truths in a Homespun Dress)"

-Timothy Dexter, excerpt from the book, on building bridges
posted by clavdivs at 2:45 PM on June 5 [1 favorite]

I was happy to see my quasi-Holmes pastiche pop up as Arthur Conan Doyle, but then I tried with a paragraph from his fairy book to see if ACD punctuates like ACD, but my internet went out. Probably fairies.
posted by betweenthebars at 2:54 PM on June 5 [3 favorites]

Closest Author: Mencken, H. L. (Henry Louis)

That's right, mf'ers.
posted by rhizome at 2:57 PM on June 5 [3 favorites]

this is what happens when four mathematicians write a paper without bothering to consult any literary scholars, textual editors or bibliographers.
I'm not sure why, but it's oddly satisfying to get an erudite smackdown on my first-ever front page post. The historical / literary angle is a really fascinating wrinkle, and I guarantee they'd be interested to hear about it.

Authorship classifiers are here to stay no matter what, being used in plagiarism detection and kremlinology alike. I was surprised to learn that punctuation is often discarded as the first step in such algorithms, and I think the authors here certainly have shown that retaining that extra information can't hurt.

Their methods should be easy enough to reproduce on any corpus one likes; project Gutenberg makes a lot of sense for a study like this since it's widely available in a consistent form.
posted by dbx at 3:25 PM on June 5 [2 favorites]

William Shatner and Christopher Walken, wtf
posted by St. Peepsburg at 3:26 PM on June 5 [1 favorite]

I don't need a machine to tell me that I rely excessively on parenthetical asides (I'm trying to cut down (Dammit!)); I also use semicolons more than most folks.
posted by adamrice at 3:38 PM on June 5 [2 favorites]

Anna Katharine Green, mother of the detective novel.

Mother of the Detective novels: an hitherto unreported genre dealing with issues like finding affordable daycares without inexplicable thefts or mysterious disappearances amongst their staff; and impressing on teachers the vital importance of responding to students' queries without any extraneous and/or incriminating details. The genre has mostly been replaced by Parents of the Detective novels, a genre in which the father makes an occasional appearance towards the end of the story.
posted by Joe in Australia at 3:48 PM on June 5 [3 favorites]

One of my most recent MetaTalk comments is Twainian; one of my recent longer policy posts in MetaTalk is like unto Joseph Conrad. Hrm.

I'd love to see output from this more like a top-five best fit set of star charts and average correlation values; I think the idea under it of making some sort of weighted multivector correlation between distribution of punctuation marks is fun but the public-facing output on the web app definitely feels like it's aiming for easy namechecks over communicating more complicated and more useful info about what the groupings and strengths of the various matches are.
posted by cortex at 3:54 PM on June 5 [1 favorite]

Poe, Edgar Allan.

Take that, Montresor!
posted by GenjiandProust at 4:31 PM on June 5 [2 favorites]

my punctuation style is
most like e e cummings

posted by Merus at 4:40 PM on June 5 [4 favorites]

I picked a random longish comment of mine from here: Mary Baker Eddy.

And a paragraph from my most recent grant: Louis Becke.

A few paragraphs from the intro of a published solo author scientific paper I wrote: Mark Twain.

I find the last one funniest.

Separately, wonder how hard it would be to spin up a bot to pull an entire comment history on mefi for punctuation. Or if someone could run their analysis on the entire mefi corpus; who do we, collectively, most punctuate like?
posted by nat at 5:07 PM on June 5

I, Aye. Ne'eer er'e Alas: O'er.
posted by clavdivs at 5:11 PM on June 5

> Their methods should be easy enough to reproduce on any corpus one likes; project Gutenberg makes a lot of sense for a study like this since it's widely available in a consistent form.

It makes some sense as a test database. Nevertheless, the issues pointed to upthread by verstegan are in fact serious flaws that undermine the project-as-presented rather than like erudite nitpicks or whatever. A more responsible presentation would be something like "we'll tell you what 19th century author (as regularized by Project Gutenberg) your writing resembles by analyzing your punctuation" — of course, that's an inherently less interesting thing to put up on the web.
posted by Reclusive Novelist Thomas Pynchon at 5:16 PM on June 5 [3 favorites]

(>'-')> <> \_( .")> <( ._.)-

These ASCII dancing folks are like something Charles Darwin would write, apparently.

(On preview, apparently they have too many escape characters! Oh well -- dance on, then.)
posted by klausman at 5:21 PM on June 5

Wish they referenced Richard Galpin's erasure work.
posted by recklessbrother at 5:57 PM on June 5

I put in some '60s-era Tom Wolfe and I swear to god I broke it.
posted by queensissy at 6:01 PM on June 5

Randall Garrett (Lord Darcy Investigates). I think my use of exclamation and question marks come into play here.
posted by SPrintF at 6:04 PM on June 5

First intro to my big ol DW post : H.L. Mencken.

The full post: Andrew Lang (who?) Time to look up.

That said, I want to see what (the syntax analyzers, not this punctuation one) some of my fiction is like and see if it strikes authors I like or ones I really dislike reading.
posted by symbioid at 6:10 PM on June 5

So apparently I really write like Marie LeBert who is apparently a French ethnolinguist and has done some ILO translations and does a lot of writing about... I'm not sure - but Gutenberg project and other open-source literature stuff? What I'm finding seems to be primarily in French.

Andrew Lang is a folklorist, I also had Charles Godfrey Leland who also appears to have been a folklorist.

Analyzing on other sites for general style (not punctuation) I apparently write like Arthur C Clark, Doctorow (Cory, not the other one), and Stephen King.

I guess my life literary goal is to write sci-fi horror based on folk and myth or something.
posted by symbioid at 6:48 PM on June 5 [1 favorite]

I don't know if the punctuation approach has any validity, but the database these guys are working with is so limited as to make this experiment almost entirely bogus. Jane Austen writes like Conan Doyle? Samuel Johnson writes like Mark Twain? Give me a break!
posted by yinchiao at 7:05 PM on June 5 [3 favorites]

what about people who serially abuse the ellipsis

asking for a friend...
posted by Mayor West at 7:29 PM on June 5 [3 favorites]

This Is Just To Say

that should
be in here
is missing

and we
cannot possibly
which poet

Wrote this poem
despite our efforts
and so bold
posted by mandolin conspiracy at 8:32 PM on June 5 [4 favorites]

I didn't read the post properly and I inserted samples of famous authors' writing to see if the app guessed them correctly (instead of inserting a sample of my writing to compare to famous authors). I entered a paragraph each from Twain's "Mysterious Stranger," Kate Chopin's "The Story of an Hour," Elizabeth Gaskell's "Cranford," and Tom Robbin's "Jitterbug Perfume." The answers I received in order were 1) Twain, 2) H. Beam Piper, 3) William Dean Howells and 4) Nathanial Hawthorne. So I'm not sure how much I'd trust this app to accurately match my writing style of those of famous authors.
posted by SA456 at 8:36 PM on June 5 [1 favorite]

I've always wanted someone to analyze the writing of a "person" whose work is written by others to see if there are tells for the others.

Sometimes, when I go back and read something from work, the only way to ID the author is digging through emails to see which one of us wrote it in the voice of the boss.

Emails, personal business: Rabindranath Tagore
Emails, spouse: Arthur Conan Doyle (!?) (US or UK editions? Early or later editions?)
Work memos: Jack London

So, the least personal writing is the author I like the most. Curious.
posted by Lesser Shrew at 9:19 PM on June 5

Oscar Wilde.

Obviously, it doesn't check for humor.
posted by dobbs at 9:36 PM on June 5


Gives an error. Guess we're all dead.
posted by axiom at 9:39 PM on June 5

Gonna see what it comes up with when I feed it @kidswritejokes
posted by rhizome at 10:06 PM on June 5

I was waiting for the e e cummings and This is Just to Say jokes and you did not disappoint.
posted by St. Peepsburg at 10:46 PM on June 5 [2 favorites]

As the app seems to only compare writing samples by frequency and patterns of punctuation and type of punctuation used absent any context regarding the text itself, say how the punctuation is used in different kinds of sentences, how those sentences are organized, or how it might have an effect on meaning, the use of the tool is pretty limited and probably best at noting writing with unusual quantities of certain marks or in the more than expected use of uncommon marks or in cases where authors follow the same patterns of use but where the kind of work they write then shows itself as similar through like convention, more quote marks likely means fiction, parentheses perhaps being more common in certain types of non-fiction for example.
posted by gusottertrout at 11:15 PM on June 5 [3 favorites]

authors' preferences or printers' house styles

Indeed; I believe Jane Austen’s punctuation was, um, innovative; what we see in published versions is entirely down to her editor.
posted by Segundus at 11:27 PM on June 5 [1 favorite]

Insert three different samples, get three different authors.

I put in three samples from different stories what I wrote and got Balzac, Twain, and Shakespeare.
posted by Segundus at 11:31 PM on June 5

*distant sounds of running, getting closer*

*door bursts open, Mr. Bad Example storms in out of breath*


It doesn't inspire confidence in the authors' methodology that they analyse Shakespeare's punctuation without, apparently, being aware that this varies enormously from edition to edition.

Oh. I, uh...I'll just go fix the door then, shall I?

(When I come back, I may or may not go on a rant about Shakespeare companies I've encountered who think the punctuation in the First Folio serves as SUPER SECRET ACTING CLUES from Shakespeare himself...)
posted by Mr. Bad Example at 1:47 AM on June 6 [7 favorites]

I got Conan Doyle. Knowing now that Tagore is an option, I feel like I've let my Bangladeshi family down.
posted by divabat at 1:47 AM on June 6 [2 favorites]


posted by chavenet at 3:51 AM on June 6 [1 favorite]

I put in an Iceland text by myself and got Daniel Defoe. I then tried the same text, translated into English, and got Samuel Pepys.

Given that they both wrote diaristic texts about the same time and place, that's pretty impressive by the translator.
posted by Kattullus at 4:58 AM on June 6 [1 favorite]

Pepys! PEPYS! that's freaking amazing
posted by wellred at 5:51 AM on June 6 [2 favorites]

Put in the text of an instructional email I wrote to a client and got John McElroy. Realized that included a bunch of ICD-10 and CPT codes, so it's full of extra periods and dashes. Clipped out that paragraph and then got L. Frank Baum.

I'm not sure what to make of that dramatic difference, but I'm okay with it.
posted by The Man from Lardfork at 5:56 AM on June 6

I got George Gibbs (longer fanfare comment), Arthur Conan Doyle (blog entry), Louis Beck (longer email), William Thackary (longer metafilter comment).

I like commas.
posted by dinty_moore at 6:16 AM on June 6

I like commas.

Me too! And as long as the app doesn't measure appropriateness of use, I'll be happy to take the comparisons to any real authors who don't engage in rampant comma splicing or any of the other frequent atrocities of punctuation I inflict on readers.
posted by gusottertrout at 7:29 AM on June 6 [1 favorite]

« Older Who can adopt a Native American child?   |   Celebrating the life and work of Leah Chase, Queen... Newer »

You are not currently logged in. Log in or create a new account to post comments.