Taxis, Rainbows and Stars
October 16, 2014 1:23 AM   Subscribe

Earlier this year, Chris Whong made a FOIL request to the New York City Taxi and Limousine Commission, receiving fare and trip data for all licensed cabs in New York in 2013. (previously) The data was anonymised, but as Vijay Pandurangan realised, only partially.

As Pandurangan explains:
This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire dataset. It would be even be easy to calculate drivers’ gross income, or infer where they live.
It was taken further, though. In a recent essay on data privacy, the author demonstrates that the dataset can be used to track individuals as well as taxis.
Examining one of the clusters [of customers of the Hustler Club] in the map above revealed that only one of the 5 likely drop-off addresses was inhabited; a search for that address revealed its resident’s name. In addition, by examining other drop-offs at this address, I found that this gentleman also frequented such establishments as “Rick’s Cabaret” and “Flashdancers”. Using websites like Spokeo and Facebook, I was also able to find out his property value, ethnicity, relationship status, court records and even a profile picture!

Of course, I will not publish any of this information here, but this was by no means a unique situation. While the online availability of all this potentially private information is part of a wider discussion, it’s fair to say that this guy has a right to keep his nighttime activities a secret.
Via
posted by frimble (12 comments total) 13 users marked this as a favorite
 
This is why privacy is dead. Not because of what people choose to reveal about themselves on Facebook or other online venues, but because of the amount of remarkably powerful data that companies and governments can collect about us indirectly and even without focused intent.
posted by sonic meat machine at 4:52 AM on October 16, 2014 [8 favorites]


That's why I like my hash … salted
posted by scruss at 5:28 AM on October 16, 2014 [8 favorites]


This is why privacy is dead. Not because of what people choose to reveal about themselves on Facebook or other online venues, but because of the amount of remarkably powerful data that companies and governments can collect about us indirectly and even without focused intent.

I am reading a noir novel from about 1960 right now, and precisely the things that people in the novel are doing for privacy are the things that most expose us now to commercial and governmental scrutiny, including telephone calls, hotel stays, and taxi rides. It's to the point where I wouldn't be surprised at all to learn that my car is uploading all of its data (including gps information from the navigation system) every night, or that my library check-outs are cc'ed to the NSA and Amazon, because my default expectation now is both full governmental intrusion and complete commercial data collection.

I have more or less made my peace with it because it seems entirely unchangeable at the moment, and I have the privilege of living a life where I am not needing to hide something like my sexuality or political beliefs for fear of persecution, but it also makes me aware of how much everyday privacy we have so casually lost.
posted by Dip Flash at 5:29 AM on October 16, 2014 [1 favorite]


Ouch. I likely would have made the MD5 mistake.
posted by Tell Me No Lies at 5:36 AM on October 16, 2014 [1 favorite]


It's not just that companies and governments can collect this data, but that we now have the tools to process it. Indeed, this kind of data has been around pretty much as long as civilization. What is new is the ability of any random person with a bit of knowledge to find the correlations in a set of millions of records. The truth is, it takes work -not- to collect such data. The world is leaky as hell and we leave traces everywhere.

This was a pretty obvious rookie mistake, but note that there are many ways to deanonymize this data without the being able to reverse the hashes. You could compare tax records to estimated income, travel or hospital records to gaps in work, immigration or work permit records to starting dates, cell phone location data, posts in taxi driver forums that mention locations... And the point about tracking individuals instead of drivers applies regardless of how well obscured the license numbers are.
posted by Nothing at 5:41 AM on October 16, 2014 [2 favorites]


or that my library check-outs are cc'ed to the NSA and Amazon

FUNNY YOU SHOULD MENTION. Librarians are pretty fuckin' mad about it.
posted by clavicle at 5:46 AM on October 16, 2014


On a guess, I used Facebook search to show me people who live in New York and work as a driver. It's pretty easy to filter that down to taxi drivers. There are thousands of public profiles that match. How many could be correlated to this data? Probably quite a few.
posted by Nothing at 5:48 AM on October 16, 2014


My puzzlement: what does this all mean for the "open data" movement? You can anonymize data. But you can't prevent inferences of identity so easily in an age of popularized machine learning. If data are so scrubbed that inferences become unlikely, then it is likely that the data no longer derive useful information.
posted by 3.2.3 at 6:20 AM on October 16, 2014


"It's OK, we're only collecting data in the aggregate."
posted by stupidsexyFlanders at 6:38 AM on October 16, 2014


Taxis, Rainbows and Stars

Isn't that what Jimi called his band at Woodstock?
 
posted by Herodios at 7:41 AM on October 16, 2014


3.2.3: My puzzlement: what does this all mean for the "open data" movement? You can anonymize data. But you can't prevent inferences of identity so easily in an age of popularized machine learning. If data are so scrubbed that inferences become unlikely, then it is likely that the data no longer derive useful information.

This is addressed in the "that the dataset can be used to track individuals as well as taxis." link. It is possible to add noise to the data in such a way that specific queries are properly scrubbed, but the aggregate data stays meaningful.

That said, that sort of scrubbing is another level entirely from the more basic type of "md5 the medallion number" scrubbing that the TLC still managed to screw up. And given what little I know about cryptography, small, easily-overlooked variables can make your seemingly impenetrable and robust system trivially easy to circumvent, so even if "Differential Privacy" is sound, it may be exceedingly easy to screw up.

So obviously while it's probably possible to create open datasets that are anonymous but also worth working with, it seems unlikely that many organizations will produce them. We will continue to see organizations either keeping this data private based on these concerns, or making them public-but-poorly-anonymized.
posted by grandsham at 8:16 AM on October 16, 2014 [2 favorites]


Isn't that what Jimi called his band at Woodstock?


No, you're thinking of the short-lived 70's supergroup featuring Joni Mitchell, Ritchie Blackmore, and Alex Chilton.
posted by TedW at 8:18 AM on October 16, 2014 [4 favorites]


« Older Let Me Tell You About Homestuck   |   Inflammatory Seattle pastor of Mars Hill Church... Newer »


This thread has been archived and is closed to new comments