Join 3,559 readers in helping fund MetaFilter (Hide)


This debt we pay to human guile; with torn and bleeding hearts we smile, and mouth with myriad subtleties.
February 22, 2012 5:33 AM   Subscribe

We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors. [...] In over 20% of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35% of cases the correct author is one of the top 20 guesses.
On the Feasibility of Internet-Scale Author Identification[pdf] is a draft of a paper for the IEEE Symposium on Security and Privacy.

Covered yesterday by Cory Doctorow, via one of the paper's authors Narayanan's blog. If you want to build a tool to obfuscate your text you might start by reading Obfuscating Document Stylometry to Preserve Author Anonymity[pdf] by Gary Kacmarcik & Michael Gamon, and Practical Attacks Against Authorship Recognition Techniques[pdf] by Michael Brennan who presented at the 26th Chaos Communication Congress. Unfortunately, the Automouth tool does not appear to be out of beta.
posted by BrotherCaine (22 comments total) 15 users marked this as a favorite

 
*anonymouth* rather
posted by BrotherCaine at 5:40 AM on February 22, 2012


it wasn't me
posted by fistynuts at 5:48 AM on February 22, 2012


There are perhaps some interesting applications to detecting astro-turfing here, certainly wikipedia might benefit. Also, anyone know if author identification tricks handle google translated documents reasonably?
posted by jeffburdges at 5:50 AM on February 22, 2012 [1 favorite]


I'm going to read this paper, but a quick skim of the first few pages suggests something about stylometric analysis that I personally find chilling: stylometric analysis does not use all of the available information to identify the author. They don't look at content for clues.

What I mean by this is that -- if I'm reading the paper right -- once this analysis has narrowed the pool down to 20 possible authors, you can still read through the corpus looking for the bits that say things like "it's snowing today" and "here in the Midwest" which will enable you to further rule out authors.
posted by gauche at 5:53 AM on February 22, 2012 [2 favorites]


Privacy is the hardest thing.
posted by Grimp0teuthis at 5:57 AM on February 22, 2012


Fuck, they're gonna figure out I'm really a dog!
posted by From Bklyn at 5:59 AM on February 22, 2012 [5 favorites]


Note to self: obtain some new magazines from which to cut the letters.
posted by three blind mice at 6:00 AM on February 22, 2012 [1 favorite]


4 yeers i has been usin dis anti stylometry tool when writin mah communiquez 2 teh resistance.
posted by Foci for Analysis at 6:04 AM on February 22, 2012 [1 favorite]


Wasn't there something like this in one of Douglas Coupland's novels?
posted by jonmc at 6:15 AM on February 22, 2012


This, combined with that post about face recognition dazzle makeup from a few weeks ago, could make for the basis of a really neat cyberpunk novel...
posted by showbiz_liz at 6:39 AM on February 22, 2012


4 yeers i has been usin dis anti stylometry tool when writin mah communiquez 2 teh resistance.
I just had a flashback to JIVE.EXE.
posted by Wolfdog at 6:44 AM on February 22, 2012


This type of thing is one of the reasons that "anonymized" data is such a joke. You can't just strip the usernames and expect everything to be okay. Pretty much anything that is not aggregated for populations will surrender to analysis eventually.
posted by Nothing at 6:47 AM on February 22, 2012


This type of thing is one of the reasons that "anonymized" data is such a joke. You can't just strip the usernames and expect everything to be okay. Pretty much anything that is not aggregated for populations will surrender to analysis eventually.

Yep. I posted a link to this paper last week: it seems that some 87.1% of respondents to the 1990 census can be positively identified by reference to just three facts: birthdate, gender, and zip code.

It is amazing how much information is out there, and how much of a pandora's box computing is turning out to be, in terms of making that information persist.
posted by gauche at 6:57 AM on February 22, 2012 [4 favorites]


We are rapidly entering a new social age. It's an upheavel: our traditions of privacy and anonymity are going, going, gone. Used to be you could up and run away, move off to escape one's reputation. No more. Used to be one's idiocy was noted only by one's friends and co-workers. Now it's likely to be youtubed. Etc.

IMO we are undergoing a massive social change. As an old fart, that scares me.
posted by five fresh fish at 7:11 AM on February 22, 2012 [1 favorite]


I think gauche hits the nail on the head. A 20% success rate for this technique doesn't tell the whole story. If you combine this with the tools already out there (knowledge base, timing information) this success rate will be much higher.
posted by kuatto at 7:17 AM on February 22, 2012


As the EFF has shown us, you can identify your browser in exactly the same way.

There's nothing new about this, but it's becoming increasingly easy to do. Instead of madly hiding, we should have checks in place against the entities that would exploit or punish us for expressing ourselves.

It's an ongoing issue with computer security. Privacy and security are great to a certain level, but if someone really wants to get you, there's always a way. Build yourself a maginot line of firewalls, and they'll get better at phishing. :)
posted by Stagger Lee at 8:08 AM on February 22, 2012 [1 favorite]


@five fresh fish

its okay, the world and especially the internet public is 100% just, you can trust their judgment
posted by This, of course, alludes to you at 8:24 AM on February 22, 2012


The lack of privacy/anonymity/pseudonymity is VERY concerning in light of who can access and abuse the information. It can have a very chilling effect on conversations from people belonging to marginalised/vulnerable groups. Not only Governments acting surreptitiously, but also Anonymous themselves doxxing or stalkers targeting individuals. Interesting how the people most likely to use this information themselves hide their identities.
posted by saucysault at 10:00 AM on February 22, 2012


...stylometric analysis does not use all of the available information to identify the author. They don't look at content for clues.

I don't think the constraints adopted were because stylometric analysis is supposed to be free from content analysis, but rather that they were proving the more general case first so that it would be demonstrated that text in a specific jargon laden or topic focused domain (technical job, academic article, BDSM erotica, activist rant, etc...) could easily be linked to text outside of that domain and vice-versa. It seems apparent that other than tools that ruthlessly purge idiosyncrasies of style or slavishly ape a target style (frame job), the future of anonymity is going to be narrower and narrower. With the body of text most of us have out on the public Internet, any past text posted (ano/pseudo)nymously will be exposed even if we do protect ourselves going forward.

Clearly, as the authors state in the paper, 35% is just the tip of the iceberg.

There are several solutions to the problem of keeping your (ano/pseudo)nymity in the future. One is to restyle or obfuscate your (ano/pseudo)nymous postings, another is to obfuscate your public text that's associated with your name, a third is to create an army of obfuscated sock puppets so that your history is less traceable, and the last is to create false identities that map to real people which slavishly ape your style (although there's a time stamp issue and an ethical issue there for most of us).

Perhaps we could form clubs such that our named public postings serve as stylometric covers for each others' (ano/pseudo)nyms with perfect multilateral reciprocity. Sort of the identity mapping equivalent of a TOR network.

Another thing to consider is how to sign public comments in such a way that you can authenticate them as your own after the fact. As well, I'm not sure what to do about torch and pitchfork provoking comments that ape your style right down to the defensive and angry denial when supposedly called out on it.
posted by BrotherCaine at 11:21 AM on February 22, 2012


As well, I'm not sure what to do about torch and pitchfork provoking comments that ape your style right down to the defensive and angry denial when supposedly called out on it.

That wasn't me! How dare you!?
posted by gauche at 1:25 PM on February 22, 2012


So did Peter Gleik write that thing the Heartland Institute says is a fake? Or not? Enquiring minds want to know.
posted by jfuller at 1:46 PM on February 22, 2012


Yikes.
posted by LobsterMitten at 9:10 PM on February 22, 2012


« Older Lucha: VAVOOM!...  |  An oral history of The Adventu... Newer »


This thread has been archived and is closed to new comments