protecting outliers
June 30, 2021 11:28 AM   Subscribe

It's important for the US Census to collect and publish a lot of information. It's also important for individual respondents to retain their privacy. Re-identification techniques pose a problem, so The Markup's Julia Angwin interviews Cynthia Dwork, one of the creators of the "differential privacy" approach, about how differential privacy could help ensure the US can meet both goals.
posted by brainwane (13 comments total) 17 users marked this as a favorite
 
Apparently they are testing this idea on the NYC primary election results. (joke)
posted by save alive nothing that breatheth at 12:04 PM on June 30, 2021


This work is really subtle and interesting. They're at the forefront of solving a very pernicious problem in our era of big data. It's far too easy to pick an individual out of a dataset even if it's been anonymized. Census asks sensitive questions and publishes a lot of public data, I admire their efforts to get ahead of the implicit privacy problems.

There's pushback though. The Economist this week has an article about the complaints: An important census product may soon use synthetic data. The concern is that the injected noise might disrupt some kinds of useful research.

The Economist article isn't detailed enough to evaluate those concerns; certainly the Census Bureau has considered the problem already! I can't tell if the complaints are from statistics-ignorant people who just don't like anything complicated. See also: the refusal to use statistically corrected data for US political representation. Although that also has a political / racist component to it, it's a tool explicitly used to disenfranchise minorities.
posted by Nelson at 12:06 PM on June 30, 2021 [2 favorites]


Another privacy researcher walks through an example of reconstruction from 2010-style census data, and how it can link an individual's census answers to what commercial data brokers have on them.
posted by away for regrooving at 12:06 PM on June 30, 2021 [2 favorites]


Really fascinating topic. I’ve been exposed to Differential Privacy issues through both my day job and my side hustle in partisan gerrymandering and redistricting. On the day job side, DP seems to either be used as a nerds’-nerd finger trap for excessively clever data scientists, or a helpful measurement of privacy loss in datasets we publish.

On the redistricting side, we’ve seen a bunch of worries about the trustworthiness of US Census data being post-processed through DP algorithms. I’ve become convinced of their value through familiarity with the existing, well-established presence of noise and uncertainty in decennial results. DP advocates like Simson Garfinkel persuasively argue that publishing an epsilon value alongside Census simply quantifies the amount of noise that was already present. In some places this is becoming a moot point: states like Illinois, Colorado, and Oklahoma are releasing proposed new districts plans right now based on ACS statistics rather than not-yet-released decennial counts.
posted by migurski at 1:35 PM on June 30, 2021 [2 favorites]


This is really interesting. Thanks! (I don't entirely understand it yet, but looking into it on my list of things to do when procrastinating.)

As someone who deals with noisy data that has nothing to do with humans, my knee-jerk thought is that injecting noise is a problem unless your model also perfectly understands all the correlations that someone might ever want to look for. e.g., to use the article's example, if the difference between a town's population being 499 and 501 happened to have an impact on federal school funding that the data-modifiers didn't notice, that information would be obscured. But, if everything is well documented, it's probably worth it. Especially given the job the US census is actually charged with doing.
posted by eotvos at 3:46 PM on June 30, 2021 [1 favorite]


Here is a very good article about database reconstruction attacks (ACM Queue, no paywall), with differential privacy being the main approach for preventing it.

The basic idea behind database reconstruction is that if you have a lot of summary statistics about a dataset, you can reconstruct a lot of the individual rows. Here is another really good article (Communications of the ACM) that highlights the issue:
One of those employees, John Abowd, associate director for research and methodology at the Bureau, worked with a team to investigate whether advances in computing power could enable database reconstruction attacks on the U.S. Census.

The results were shocking.

Abowd and his team retroactively used database reconstruction techniques on these public data summaries, and found they could use advanced computational power and techniques to recreate private data that was never meant to be public.

In fact, Abowd and his team found they could reconstruct all the records contained in the database with approximately 50% accuracy. When they allowed a small error in the age of an individual, the accuracy with which they could associate public data with individuals went up to 70%. And if they allowed getting one piece of personal information like race or age wrong, but everything else right, their reconstruction was more than 90% accurate.
The main reason that Differential Privacy has been seeing traction is that, compared to previous approaches (swapping, k-anonymity, etc), it has a firm mathematical and statistical foundation. Differential Privacy is also already in use in Google Chrome, Apple iOS (though unclear how they are using it), and the US Census. However, there are still lots of challenges, e.g. with respect to usability in choosing parameters and distortion of statistics due to noise.
posted by jasonhong at 6:40 AM on July 1, 2021 [4 favorites]



Migurski and others,

Are the concerns by (PDF) by Steven Ruggles (twitter thread and recent working paper[PDF] ) that the Census' most recent implementation will result in nearly useless data on the census-block level, legitimate?

(This is not to say that the privacy concerns are not valid - they are! but that there are tradeoffs with the usage of differential privacy techniques which should be discussed and if so, explicitly acknowledge the decision to prioritize privacy over data accuracy; which were not mentioned in the Markup's article.
posted by fizzix at 9:29 AM on July 2, 2021


Are the concerns by (PDF) by Steven Ruggles (twitter thread and recent working paper[PDF] ) that the Census' most recent implementation will result in nearly useless data on the census-block level, legitimate?

Absolutely. CBGs are very small---I think there's a quarter of a millino of them, and moving noise around at that level basically invalidates many uses.
posted by MisantropicPainforest at 9:40 AM on July 2, 2021


I've been looking for good, informed complaints about exactly what problems the noise injection will create. Fizzix' links are definitely to a qualified critic, but I'm not sure how serious the problems are they point out.

The linked paper isn't about the quality of the fuzzed data. Instead it aims to show there's not really a privacy problem by arguing that the privacy-busting examples the Census demonstrated are actually just luck, the same results would be found by chance (instead of true correlations). I'm not expert enough to evaluate this statistical argument.

The Twitter thread is mostly establishing that Census has upped the amount of noise they're injecting at the block level (the smallest unit, and therefore most problematic for privacy). He highlights absurdities like blocks that show people living there but no occupied housing units, blocks with children but no adults, etc. IMO these absurdities are fuzzing acting as intended. If your approach to adding noise is to occasionally add or subtract children and adults to random blocks, you're going to see some blocks with children and no adults. The main point of the tweet thread is there's more noise at the block level than previously disclosed; that certainly seems plausible. Census should be transparent about exactly what kinds of noise they are injecting and how much.

But I'm still wondering about the larger question; does the noise really impact real research? Sure it invalidates some uses, but does it invalidate important uses? Ie: is anyone really looking census block data to find out how many children live without adults? I don't think so, but if they are then yes, now that just got a lot harder.

Census blocks have always been problematic statistically; their tiny size means they've always had wild statistical variance (see also: the map of census blocks with 0 population). Most serious research I know of is done at the census tract level, areas carefully designed for usefulness for demographic statistics. (For a sense of scale; there's 11M census blocks in the 2010 census and 74,000 census tracts.)

Also this is kinda off-topic but wringing our hands over privacy of the census is sort of funny in our era of surveillance capitalism. Facebook and Google know way, way more than Census does about individual demographics in the US. So do political analysts; voter file data is remarkably detailed. But Census is a government project and should be held to a higher privacy standard. Also the whole point of Census is to provide accurate data, and the goal of this fuzzing is to make people feel safe enough to answer the questions.
posted by Nelson at 9:47 AM on July 2, 2021


Ie: is anyone really looking census block data to find out how many children live without adults?

Yes?

"Absent some 11th hour stop to this DP madness, it will be virtually impossible for all but a few privileged researchers to study residential segregation with Census data. Or environmental racism. Or any other micro-level instance of racial injustice."

https://twitter.com/blfraga/status/1410894851013591049
posted by MisantropicPainforest at 9:50 AM on July 2, 2021


Thanks for highlighting that tweet, MisantropicPainforest, that is an important point. Naively I'd thought studies of segregation and environmental racism were doable or even better done at the census tract level, not census blocks. The maps of census block stuff are cool but they have always been statistically problematic. But I'm way out of my expertise here and don't really know the research landscape.

The flipside argument is that census data is already not good for, say, studying racism because there's racial bias in who trusts the Census Bureau and the ACS enough to answer questions. We know this bias is a problem, it's been a major point of contention in Trump's corrupted 2020 Census. We know current Census methods lead to lack of representation for some racial minorities and particularly non-citizen immigrants. (Census tries to correct for some of these biases in their demographic publications, but the corrections aren't allowed for use in the political count.)

I get the concerns about the fuzzing! I'd hate to lose things like the "Nobody Lives Here" map I linked above. What I don't understand is just how serious the loss is for real sociological and political research. Census started telling us they were going to add noise years ago, surely there's been plenty of time for affected researchers to have their say? I'd like to read more about it!
posted by Nelson at 9:57 AM on July 2, 2021 [2 favorites]


I think the concerns are legit though I am generally suspicious of block-level demographics anyway. The further we get from April 1, 2020, the more likely those block counts are to have changed due to people moving around. Anyone in the past using decennial numbers at the highest granularities either brushed this aside or was simply unaware. It’d be more typical to use ACS numbers which come in a range of time spans with explicit margins of error always attached. I think the Census is doing the right thing by surfacing and emphasizing the added noise.
posted by migurski at 11:29 AM on July 2, 2021 [1 favorite]


Also, the quantitative promise of differential privacy is that the block-level noise disappears at higher aggregations. If you’re using block-level data, you shouldn’t be making arguments about specific blocks. You should be putting them together into larger analytical areas.
posted by migurski at 11:32 AM on July 2, 2021


« Older God in Love Unites Us   |   Into the Unknown Unknown Newer »


This thread has been archived and is closed to new comments