"Most people are not aware how sensitive their writing style can be"
January 27, 2013 4:46 AM   Subscribe

 
To counter the implications for privacy and security, the researchers have created two open-sourced tools - the first one, JStylo, recognises an individual writer's style. The second tool, Anonymouth, is used to "anonymise" writing by providing user specific suggestions to change writing style.

They need a plug-in for Firefox.
posted by pracowity at 4:50 AM on January 27, 2013 [10 favorites]


Gait analysis for message boards.
posted by ryanrs at 4:55 AM on January 27, 2013 [5 favorites]


Off the cuff, this sounds like the kind of thing which might serve as the basis for probable cause, i.e., enough for the police to arrest you and possibly even for the prosecutor to indict you, but I highly doubt that it's sufficient to constitute "proof beyond a reasonable doubt." Depending on how it works and how reliable it proves to be, it may or may not even be admissible. The courts tend to be pretty dubious of unproven technologies like this one.

So while this sort of thing might get you in trouble, I would be very surprised if this, by itself, was sufficient for a criminal conviction.
posted by valkyryn at 5:24 AM on January 27, 2013 [4 favorites]


Even if it's just enough to put them on your scent, that's bad enough for me. After the cops are pretty sure they know who did something, they can investigate you until they do have something that will stick.
posted by pracowity at 5:30 AM on January 27, 2013 [5 favorites]


..to "anonymise" writing by providing user specific suggestions to change writing style.

ur doin it rong
posted by DU at 5:30 AM on January 27, 2013 [2 favorites]


On the Internet in public I only write in my second language.
And I breadcrumb my text with false data.
Am I as big a Sex And The City fan as I appear?
Who can tell?

Seconding pracowity: we need a plug in stat.
posted by Mezentian at 5:36 AM on January 27, 2013


This is why I only write my manifestos in LOGLAN.
posted by jenkinsEar at 5:37 AM on January 27, 2013 [6 favorites]


Sweet! I've been looking for something like this for a while. Seems pretty complicated to use, though. You have to know where to bury the corpuses.
posted by XMLicious at 5:46 AM on January 27, 2013 [21 favorites]


On the Internet in public I only write in my second language.

It wouldn't surprise me if a detailed analysis of everything you've written gives clues to what your first language is though.
posted by atrazine at 5:47 AM on January 27, 2013 [4 favorites]


I've wondered for a few years as to whether any computer science PhD students have been using MeFi as a training ground for stylometric research to see if you can match up anonymous AskMes with users or link sock puppet accounts to primary accounts. There's ample data to work with, I suppose, so it wouldn't surprise me if someone has tried it.
posted by barnacles at 5:52 AM on January 27, 2013 [1 favorite]


It wouldn't surprise me if a detailed analysis of everything you've written gives clues to what your first language is though.

F0r61v3 m3 m0d5, 8u7 1 h4v3 51nn3d.
posted by Mezentian at 5:54 AM on January 27, 2013 [2 favorites]


Ah, actually Lojban writing can be quite idiosyncratic, mainly in its semantics rather than syntax. In any case, writing in Lojban makes it safe to assume that you associate with the thousand or so people who can read it, so probably not the best security measure!
posted by LogicalDash at 5:55 AM on January 27, 2013 [2 favorites]


I build across this comment through Google translate into Spanish and vice versa, in English. Catch me if you can, stylometry.
posted by klarck at 6:07 AM on January 27, 2013 [1 favorite]


I love you! I kiss you!
posted by Mezentian at 6:11 AM on January 27, 2013


Given how much cut/paste goes on in the Internet; I likely have the writing style of a 15 yr old from the Bronx. Just sayin'.
posted by arcticseal at 6:15 AM on January 27, 2013 [2 favorites]


It's very interesting as a data-mining technique, I think, although research in this general area is not new (see techniques such as latent semantic analysis). I'd be surprised if better funded secret govt agencies are not doing this already.

However, in general, I guess you have to build the corpus in the first place, to do the analysis, and then once you have idents, to link them back to writers IRL. It's one thing to pull an analysis from an existing corpus, and another to build an analysis aimed at a particular question, starting from scratch.

It's still a cause for concern, though.
posted by carter at 6:16 AM on January 27, 2013


I wonder if writers could use the anonymiser in the opposite way - to make their own writing sound more like themselves. You could have a super William Gibson, for instance (which the fake twitter accounts suggest are possible anyway!)
posted by horopter at 6:21 AM on January 27, 2013 [2 favorites]


Shoot me if I'm wrong, but if these tools are truly effective, wouldn't they be equally effective in helping you adopt another person's writing style? Meaning, if I want to get my nemesis in trouble, I go to his blog, apply some of these algorithms, then write a bunch of anonymous, incriminating shit across the internet in the suggested style?
posted by simen at 6:23 AM on January 27, 2013 [5 favorites]


This is why everything I write is done by dictating my comments telephone-game-style through a string of unwitting accomplices. I don't even know the password to this account, the last dupe in line logs in and types this for me hey wait THIS IS WHAT I HAVE BEEN DOING FOR THE LAST SEVEN YEARS?
posted by caution live frogs at 6:29 AM on January 27, 2013 [11 favorites]


I'm curious as to how this works when writing by committee, a la Anonymous.

#oplastresort
posted by butterstick at 6:48 AM on January 27, 2013


Shoot me if I'm wrong, but if these tools are truly effective, wouldn't they be equally effective in helping you adopt another person's writing style? Meaning, if I want to get my nemesis in trouble, I go to his blog, apply some of these algorithms, then write a bunch of anonymous, incriminating shit across the internet in the suggested style?

That appears to be exactly how it works if I'm understanding it correctly. It looks like you have to give it a document you've written that you want to anonymize, a large corpus of other writing you've done, and then a corpus of writing for it to target as what you'd like to camouflage yourself in.
posted by XMLicious at 6:54 AM on January 27, 2013 [1 favorite]


I needed to clean up smh.com.au's youtube link before watching the 29c3 talk. We mentioned recently that most other 29c3 talks are available at media.cc.de, but oddly not this one.
posted by jeffburdges at 6:57 AM on January 27, 2013


The courts tend to be pretty dubious of unproven technologies like this one.

Whereas they are mostly credulous w/r/t pseudoscience like fingerprints and polygraphs. A bunch of lawyers are hardly the gold standard for determining scientific validity.
posted by indubitable at 7:02 AM on January 27, 2013 [7 favorites]


In case anyone doubts the usefulness of this sort of analysis, this is exactly what just brought down Jim Letten, the US attorney in New Orleans and the longest serving US attorney in the country. Article

TL;DR: Two of Letten's prosecutors, Jan Mann and Sal Perricone, were leaving anonymous comments on the articles at NOLA.com, the website of the New Orleans Times-Picayune newspaper. The comments were very critical of Fred Heebe, the owner of a local landfill whose dealings were under investigation. Some of the comments reeked of insider knowledge of the investigation, so Heebe hired an investigator to compare the comment writing style with the plentiful documents produced by Letten's office.

Once he had enough data Heebe filed defamation lawsuits against Perricone and Mann, and the shit proceeded to hit the fan.
posted by localroger at 7:19 AM on January 27, 2013 [8 favorites]


Despite providing an anonymising tool, I note they were funded by DARPA. No prizes for guessing what DARPA's interest is.
posted by FrereKhan at 7:39 AM on January 27, 2013


I just took a look at the .tar file, and the jstylo.txt file is empty. I'm not particularly interested in anonymising my writing, but I'd love to see a stylistic comparison of two files to see if they were apparently by the same author.

Can the package be used for this and if not are there any similar open-source tools that do this?
posted by CheeseDigestsAll at 8:03 AM on January 27, 2013


Can we use this tool to create a metric space among writers? And then start a dating site to match those whose differences are small?
posted by Obscure Reference at 8:55 AM on January 27, 2013 [1 favorite]


Ah, like I'm going to download and install DARPA-funded software to see how anonymous I am ...

I remember reading Andrew Q. Morton's books on stylometric research back in the early 90s, and farting about with simple test schemes written in Icon. I was an unusual youth.
posted by scruss at 9:07 AM on January 27, 2013 [2 favorites]


There's a New Yorker article (Sorry, paywall!) about "forensic linguistics." As I recall, the conclusion was that while it could be a useful tool, it was not nearly as cut and dried as many people would like it to be. And, it is yet another thing you can hire an expert witness to testify about, if you can afford it.
posted by (Over) Thinking at 9:37 AM on January 27, 2013 [1 favorite]


pseudoscience like fingerprints

I had no idea.
posted by jeather at 9:45 AM on January 27, 2013 [1 favorite]


Hmm. I'm not hugely interested in hiding, but I've got a writing project in the works that involves different narrative voices, and I wonder if these tools might help with that.
posted by reprise the theme song and roll the credits at 11:08 AM on January 27, 2013 [1 favorite]


indubitable: "The courts tend to be pretty dubious of unproven technologies like this one.

Whereas they are mostly credulous w/r/t pseudoscience like fingerprints and polygraphs. A bunch of lawyers are hardly the gold standard for determining scientific validity.
"

Does it really matter when the black helicopters come for you?
posted by Samizdata at 11:13 AM on January 27, 2013 [1 favorite]


As I recall, the conclusion was that while it could be a useful tool, it was not nearly as cut and dried as many people would like it to be.

It's really not, even if you're dealing with a corpus of 21st-century English, and there's no one really solid method that's guaranteed to identify an author just yet. At best, we can say "I'm X% sure that text A is by author B".

It's quite a bit worse if you're dealing with, say, Elizabethan plays, where the text has been through the hands of editors and compositors before reaching readers. You can't depend on the punctuation being original to the author, or in some cases even word choice--for instance, scholars are still arguing about whether Hamlet's flesh that he wishes would melt, thaw, and resolve itself into a dew was meant to be "solid", "sullied", or even "sallied". (For my money, "solid" is the only word that makes sense there...but I digress.) For another example, you've got to be careful if you're going by the distribution of feminine endings, because printers could sometimes be haphazard indicating verse--the entire Queen Mab speech in Romeo and Juliet is verse, but it's set as prose in the First Folio.

Apologies for the Shakespeare-themed derail, but this sort of thing is my bag, baby. There's a theory that's been around for over a century now that when writing Titus Andronicus, Shakespeare either collaborated with (or reworked an earlier piece by) a contemporary of his named George Peele. I'm trying to pick out what parts are Peele-ish, if any. (So far my test software is doing okay-ish...I've got 70% accuracy when I train it on a few early 1590s plays and test it on Comedy of Errors.)

Finally, on a six-degrees-of-Kevin-Bacon note: this is a really small field. I've met and had dinner with Patrick Juola, the guy who created the JGAAP framework the Drexel students are using.
posted by Mr. Bad Example at 11:15 AM on January 27, 2013 [6 favorites]


Heh. that's funny. I just started reading G. Willow Wilson's Alif the unseen, and the main character has just started modifying a keylogger to do style analysis and detection.
posted by dhruva at 11:16 AM on January 27, 2013


Even if stylometry doesn't rise to the level of proof it can provide direction and lend weight to more admissible things such as discovery requests for IP logs.
posted by localroger at 11:39 AM on January 27, 2013 [2 favorites]


"Most people are not aware how sensitive their writing style can be,” said Sadia Afroz, one of the researchers and a PhD candidate in computer science.

Is sensitive at all synonomous with distinct? I am thinking she means distinct and find this quote clanging on me.
posted by bukvich at 11:54 AM on January 27, 2013 [2 favorites]


>The courts tend to be pretty dubious of unproven technologies like this one.

Whereas they are mostly credulous w/r/t pseudoscience like fingerprints and polygraphs. A bunch of lawyers are hardly the gold standard for determining scientific validity.


First, polygraphs are generally not used in legal proceedings. In most jurisdictions, the person being polygraphed has to consent both to the taking of the polygraph and to its admission in court, and even then the judge has discretion to keep it out. In some jurisdictions they're just inadmissible period. They're certainly used in many contexts, including police investigations and security clearance protocols, but formal proceedings are not generally one of those contexts. Getting a polygraph into court is really hard.

Second, what the hell makes you think dactyloscopy is pseudoscience?
posted by valkyryn at 11:57 AM on January 27, 2013




I posted on the paper on this previously, but I like this post.
posted by BrotherCaine at 12:24 PM on January 27, 2013


pseudoscience like fingerprints

There's no doubt about it. It's the pseudoscience of fingerprints. That's why we must learn to live alone.
posted by hattifattener at 1:12 PM on January 27, 2013 [2 favorites]


If I remember right, one of the reasons the Unabomber was caught is that after his manifesto was published, his brother recognized the writing style and called the FBI.

Before the publication of the manifesto, Theodore Kaczynski's brother, David Kaczynski, was encouraged by his wife Linda to follow up on suspicions that Ted was the Unabomber.[77] David Kaczynski was at first dismissive, but progressively began to take the likelihood more seriously after reading the manifesto a week after it was published in September 1995. David Kaczynski browsed through old family papers and found letters dating back to the 1970s written by Ted and sent to newspapers protesting the abuses of technology and which contained phrasing similar to what was found in the Unabomber Manifesto.[78]

Prior to the publishing of the manifesto, the FBI held numerous press conferences requesting the help of the public in identifying the Unabomber. They were convinced that the bomber was from the Chicago area (where he began his bombings), had worked or had some connection in Salt Lake City, and by the 1990s was associated with the San Francisco Bay Area. This geographical information, as well as the wording in excerpts from the manifesto that were released prior to the entire manifesto being published, was what had persuaded David Kaczynski's wife, Linda, to urge her husband to read the manifesto.[79]

posted by thewalrus at 1:33 PM on January 27, 2013 [1 favorite]


from article: "It's been used to question or confirm the authorship of Shakespeare's plays, Homer's Illiad and Odyssey and St Paul's letters for hundreds of years."

It has been "used" to do these things dubiously and terribly. In fact, I think it's fair to say that "stylometry" has been less than worthless when it comes to the study of classics. Not least because it relies on the laughably idiotic assumption that people in the past had no idea that writing style exists and were wholly unconscious of the way they phrased their texts, and on the fervent belief that some dude with a questionable degree and a spreadsheet is more knowledgeable about literary style and the mark it leaves on a text than was Shakespeare. This is leaving aside the fact that considering the authorship of something thousands of years old that was written in a language that is essentially now a dead language and generally lost to the sands of time is already kind of inane. I mean - the Iliad and the Odyssey are almost the sole works we have in their dialect. It doesn't even make sense to begin to talk about authorship comparisons with such a small sample size. And the letters of Paul, well, those of us who believe in their divine provenance are not generally so stupid as to accept the notion that aloof researchers are more versed in their style than was St Paul himself. He was not that dumb.

Seriously, it is a fine thing to look over texts and notice what words are being used most often and how they are being used, and to consider what this means and how a given author's style affects her or his work. But people have been doing this for thousands of years. It's called "reading," and while it may be more in-depth form of reading than most people nowadays are used to, it is reading nonetheless. I don't object to close reading of texts. What I object to is this intimation that somehow close reading amounts to a kind of modern technological marvel which suddenly allows us to magically identify the authors of texts.
posted by koeselitz at 11:17 PM on January 27, 2013 [2 favorites]


The DARPA association is not very strong evidence for the tool being compromised. DARPA also developed much of the technology necessary for the modern internet. In any case, the software is open source, so if you're worried you can get someone to audit it for you.
posted by LogicalDash at 4:04 AM on January 28, 2013


« Older Wuthering Bytes   |   From the mouths of babes... Newer »


This thread has been archived and is closed to new comments