an evolutionary cyclic approach to data-sharing
January 6, 2019 9:43 PM   Subscribe

Alice goes to the hospital in the United States. Her doctor and health insurance company know the details ― and often, so does her state government. Thirty-three of the states that know those details do not keep the information to themselves or limit their sharing to researchers . Instead, they give away or sell a version of this information, and often they’re legally required to do so. The states turn to you as a computer scientist, IT specialist, policy expert, consultant, or privacy officer and ask, are the data anonymous? Can anyone be identified? Chances are you have no idea whether real-world risks exist. Here is how I matched patient names to publicly available health data sold by Washington State, and how the state responded. Doing this kind of experiment helps improve data-sharing practices, reduce privacy risks, and encourage the development of better technological solutions.
- Only You, Your Doctor, and Many Others May Know, Latanya Sweeney . Technology Science. 2015092903. September 29, 2015.

Dr. Sweeney's first contribution involved linking de-identified patient-specific medical data to a population register (e.g., a voter list) to re-identify patients by name [cite, cite]. She then showed that "87% of the U.S. Population are uniquely identified by {date of birth, gender, ZIP}."

A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.
posted by the man of twists and turns (10 comments total) 40 users marked this as a favorite
this would appear to bear deep reading on my part, as a near 20-year WA resident and current Apple Health enrollee.
posted by mwhybark at 10:00 PM on January 6, 2019 [1 favorite]

My assumption at this point is that everything I write down everywhere - medical forms, government records, all information in a commercial transaction, social media interactions - is collated and sold as soon as it is processed.
Anonymity is an illusion in the world of Big Data.
posted by madajb at 10:15 PM on January 6, 2019 [9 favorites]

Anonymity is an illusion in the world of Big Data.

Cool. So, legally prosecutable?
posted by filtergik at 4:09 AM on January 7, 2019

Simple probability says there's a 1/365 chance of your birthdate being any particular day of the calendar. And there's a (approximately) 1/2 chance of you being one or other of the bureaucratically-recognized genders. So we get a 1 in 730.5 probability of you matching a given birthdate/gender, and if fewer than 730.5 people live in your zip/postal code, then that's a hit.

(I live in the UK. When I go to my GP surgery for an appointment, there's a happy friendly touchscreen that prompts me to choose my gender and year of birth from a drill-down menu; it almost always gets my name and appointment immediately without a supplemental day/month check. The GP practice has 3000-5000 patients and maybe 8 nurse/doctor practitioners seeing about 20-30 people/day/practitioner, but the computer can exclude patients at check-in who have already had their appointment, so ...)
posted by cstross at 7:17 AM on January 7, 2019 [3 favorites]

Anonymization and re-identification are two areas of research I'm pretty familiar with, though not an expert (I study other aspects of privacy).

Latanya Sweeney has done great work here in raising people's awareness of re-identification. Her work is cited in pretty much privacy report looking at data sharing. However, regarding the issue of not funding this line of research, I can also understand why. At the time, one of the big issues was that it's unclear how well any approach could ever work.

A famous attack very likely re-identified people in Netflix's challenge. By itself, Netflix's data seemed well anonymized, in that it didn't have anything other than User ID and ratings for movies. However, a team of researchers linked people's votes with votes from IMDB, under the assumption that people who rated a movie on Netflix probably rated it similarly and around the same time on IMDB.

Another issue is generalizability. Ok you can re-identify people in this data set with enough elbow grease and linking it to other data sets. What's the science here? It's like a hacker that shows a buffer overflow vulnerability. Ok, put it over there with the other 10000+ buffer overflow attacks that we've already seen in the past 30 years. Yeah, it's a pain operationally, but in terms of science, there's no new knowledge there. And to a large extent, research on anonymity had the same problem.

Fortunately, there is an exciting line of research called Differential Privacy which is seeking to address these problems on a firm mathematical basis. The basic idea is that, given a limited set of query results, you can't easily tell whether a specific data element is in the data set or not. Again, I'm not an expert in this area, but this idea has already strongly influenced the US Census in how they release their results, is built into Google Chrome in terms of reporting what sites people go to in a safe manner, and is in use by Apple for telemetry (in some unclear way).
posted by jasonhong at 7:22 AM on January 7, 2019 [12 favorites]

The practical upshot of this is that you should always lie on paperwork, given the opportunity, if there's no valid reason for the possessor of one piece of information to need to link it with information possessed by someone else. Too many medical offices ask for way too much information, far in advance of it being clinically relevant, and often seem to do so just for convenience in their own recordkeeping (i.e. they want your SSN or DOB because it's a useful key, not because they actually need those values; a randomly-generated MRN would work fine).

I've never had anyone ever check my DOB on a piece of registration paperwork against my ID. It's just a lazy natural key. I've taken to just generating them more or less randomly, and then putting a record into my password-keeper program (which is, admittedly, a single point of failure for my entire life) with what information I decided to give them.

The bigger hospitals and insurance companies actually seem to be a bit better about this than small offices. They generally have recordkeeping systems with a synthetic key (typically an "MRN") and they put that value on everything. This is presumably so you can show up with zero personal info—e.g. some John Doe who shows up in the ED, butt naked and amnesiac—and they can still track you through their internal system. But this is how it ought to be.
posted by Kadin2048 at 8:21 AM on January 7, 2019 [3 favorites]

I signed up for MeFi specifically to comment here that Dr. Sweeney is boss! Black woman STEM expert who has made a huge contribution to data privacy.
posted by schwinggg! at 10:23 AM on January 7, 2019 [4 favorites]

The basic idea is that, given a limited set of query results, you can't easily tell whether a specific data element is in the data set or not.

How interesting. I shall google farther, but can you break this down any further? Do you mean that a user only gets returned a small set of data elements from a query?
posted by schwinggg! at 10:26 AM on January 7, 2019

Date of birth is not just your birthdate - it is also your birth year. I thought about it this way: When I was in high school, my high school class contained almost all the people in my zipcode who were born the same year I was. There were about 230 kids in my graduating class, give or take a few. So spread 230 people out across 365 days in the year, and sure, probably a few shared birthdates, but most did not. Add in gender and it's around 165 people spread across 365 days, the probability of shared birthdates goes down some more.

I went to high school in a smallish town with more kids coming in on the bus from the nearby countryside. The school catchment area didn't match the zipcode, but it would be similar enough in size. In the nearby larger city there were way more people, enough to have multiple high schools, but they also have multiple zipcodes.

So, yeah, given birthdate, gender, and zipcode, I believe you could identify almost everyone in my hometown. And a large portion of people in a denser town. 85% seems very reasonable as an average across the USA.
posted by elizilla at 10:58 AM on January 7, 2019 [3 favorites]

The basic idea is that, given a limited set of query results, you can't easily tell whether a specific data element is in the data set or not.

How interesting. I shall google farther, but can you break this down any further? Do you mean that a user only gets returned a small set of data elements from a query?

The rough idea is that sufficient noise is added to the query results so that an optimal attacker can't discern whether a specific data element is in (or is not in) the original data set. So you don't get individual elements, but only aggregate statistics like averages, counts, stdevs, etc. This also means that you have to limit the number of query results released, since an optimal attacker with enough time and resources could figure things out.

Also, as an aside, one of the deployed differential privacy systems that I've found rather fascinating is Google's RAPPOR, which lets Google aggregate information about where people are going on the web. One of the key ideas is to use the Randomized Response Technique. For example, imagine you want to get an estimate of how many people are Communists in the 1950s State Dept, and no one wants to tell the truth out right. With Randomized Response, you might flip a coin: heads say "yes", tails tell the truth. Given enough results, you can still estimate the total number of Communists in aggregate, but still offer plausible deniability for each individual.

From a data science perspective, it's a pretty exciting time since these differential privacy techniques are on a firm mathematical foundation, it addresses some longstanding problems of re-identification, and is already showing potential in terms of actual deployments rather than just scientific papers.
posted by jasonhong at 12:47 PM on January 7, 2019 [1 favorite]

« Older Working After 50: Assumption vs Reality   |   A True Patriot and American Hero Newer »

This thread has been archived and is closed to new comments