The Internet With a Human Face
May 27, 2014 2:57 AM   Subscribe

"These big collections of personal data are like radioactive waste. It's easy to generate, easy to store in the short term, incredibly toxic, and almost impossible to dispose of. Just when you think you've buried it forever, it comes leaching out somewhere unexpected." A talk by Maciej Ceglowski, founder of Pinboard, about why we have Big Data and why it's frightening.

Check his last talk (previously) if you missed it.
posted by 23 (48 comments total) 89 users marked this as a favorite
 
Man, his talk made me really glad all my fanfiction recs are on Pinboard rather than Delicious, which had me log in with my Yahoo account (and all that related data).

Thanks, Maciej, for giving me a place where I can keep at least one tiny part of my online life separate from the rest. Except when I keep linking to it. Because I'm stupid.

And that's why his talk is important, because people are stupid, and we'll keep giving these companies all our data because we just go "Oh, it's just one time, no one's paying attention."

(Also, all slideshows should have regular baby animal slides. I demand this rule.)
posted by Katemonkey at 3:12 AM on May 27, 2014


My new phrase for the week is "privacy-seeking behavior".

Excellent talk by someone who is a genuine treasure.
posted by hwestiii at 4:06 AM on May 27, 2014 [3 favorites]


This was very good.
posted by bystander at 4:46 AM on May 27, 2014 [1 favorite]


Every time someone posts a Maciej talk, I wish I could figure out how to work pinboard into my day-to-day usage. I just can't seem to get over the "form a habit" step.
posted by DigDoug at 4:57 AM on May 27, 2014


This comment has been deleted before posting.
posted by fairmettle at 5:15 AM on May 27, 2014


DigDoug, I don't know if this will be a great deal of help, but I use this Chrome extension for Pinboard. It sits right next to the omnibox, so it requires just a little less awareness for me to use. I am not a huge book marker, but this helped wean me away from the Firefox/delicious combo I used before moving to Pinboard.
posted by hwestiii at 5:31 AM on May 27, 2014 [2 favorites]


I'm not a cheerleader for government spying, but it continues to irk me how the American public is told how great corporate data collection is while government data collection is evil. I hope this talk is the sort of thing that begins to spread into more mainstream channels to get people like my parents thinking about what they voluntarily give away.
posted by lownote at 5:41 AM on May 27, 2014 [4 favorites]


Often the first step in treating cancer is surgical excision of the active lumps. To rescue the internet we should surgically excise the technolibertarian scourge that is silicon valley. Only then we can move on to implementing a more expensive but well regulated, privacy focused, and self managed internet.
posted by Poldo at 6:21 AM on May 27, 2014 [4 favorites]


His talk on startup culture, Barely Succeed! It's Easier! Is also excellent.
posted by Hlewagast at 6:36 AM on May 27, 2014 [7 favorites]


Note that by using the "omnibox" with Chrome, you are sending all of your data to Google, whether you want to search or not. This is why Firefox has a separate search box, and is relevant to this speech for obvious reasons.
posted by gen at 6:44 AM on May 27, 2014 [1 favorite]


A nice continuation and extension of his rant from Our Comrade the Electron. (in case others hadn't seen that before).
posted by nubs at 7:13 AM on May 27, 2014 [1 favorite]


And this is also interesting: The right to be forgotten
posted by nubs at 7:34 AM on May 27, 2014


This is fantastic. There are so many salient sentences that it's almost unquotable.
posted by oulipian at 8:00 AM on May 27, 2014


gen, I don'know all the ins and outs of Chrome, but my reference to the omnibox was only positional, as in "it sits next to...". I don't know how that extension actually works, but it does not require any direct omnibox input to function, AFAIK.
posted by hwestiii at 8:07 AM on May 27, 2014


Ugh, so I'm in a CS masters program. I've taken some big-data related classes. And NOT ONCE has anyone teaching these classes so much as suggested that there might be any ethical considerations that a person might consider before charging ahead and collecting & processing that data any way they might like.
posted by Blue Jello Elf at 8:24 AM on May 27, 2014 [4 favorites]


Blue Jello Elf, unfortunately that is all too common in engineering schools. Nobody ever made mention of the ethics of dropping bombs on people in my supersonic aerodynamics course, for example.

(Really the only contact I had with ethics in E-school was a professor making fun of my visible discomfort with his "bombing things" analogies during Dynamics lectures)
posted by indubitable at 8:54 AM on May 27, 2014 [1 favorite]


that was a fantastic read, and my first introduction to Maciej Ceglowski. also cool to see matthowie in the credits slide at the end.
posted by moss free at 9:08 AM on May 27, 2014


This essay is fantastic. It's such a huge problem we have, the way information technology shifts the reality of privacy, the way that obscurity is no longer feasible. I've read very little material that has a convincing argument about what to do about it. I'm grateful to have something other than the Transparent Society to point to. I'm not sure Maciej's proposed solutions are sufficient, but it's a very thoughtful take on the problem.

As an aside, I posted this to Hacker News and one thing a lot of people picked up on is that this style of slide presentation is really good. I feel like I didn't miss much. Really it's an essay that happens to be derived from talk notes and images, but it's really well done.

Maciej's a really interesting writer, very thoughtful and careful in what he publishes. Previously on Metafilter: modern surveillance, a cancer scam, Argentinian steak, and the Alameda-Weehawken burrito tunnel.
posted by Nelson at 10:01 AM on May 27, 2014 [1 favorite]


I've taken some big-data related classes. And NOT ONCE has anyone teaching these classes so much as suggested that there might be any ethical considerations that a person might consider before charging ahead and collecting & processing that data any way they might like.

And that's interesting to me, because where I work we have built up some considerable databases of largely anonymous client interactions over time, but we are moving very very slowly on using them for analysis of any kind (even in large aggregates) out of concern for the ethics/confidentiality around the data - because in human/social services, it's our over-riding concern.

So it's interesting to see how the perspectives depending on the professional background differ in terms of approaching large data sets; the one researcher we've allowed in to do some work with our aggregate data has described the data as "sacred" from his perspective, because (while aggregated) it sill represents the stories of hundreds/thousands of people who contacted us, and therefore worthy of being approached with a lot of respect and consideration of how the data is being used.
posted by nubs at 10:23 AM on May 27, 2014


Back during Star Wars/SDI, we had Computer Professionals for Social Responsibility. They dissolved last year, but are there other similar professional ethics/social responsibility organizations?

Programming is probably a less formal field now than it was in the 80s (and that's a good thing) so professional groups aren't the most important place for ethics discussions to happen, but it would still be a good thing and might at least offer students some exposure to ethical considerations.
posted by jjwiseman at 10:50 AM on May 27, 2014 [1 favorite]


My alma-mater offers an undergrad course on "Professional Practice in Computer Science" which covers ethical responsibilities but it is neither mandatory for a degree nor is it a prerequisite for "Introduction to Data Mining."
posted by RobotHero at 10:56 AM on May 27, 2014 [1 favorite]


Any course on big data really ought to include a section on the AOL search data release. Not only is it fascinating data (and still useful and relevant!), it was a well intentioned effort to aid data analysis research. And it completely backfired because they hadn't thought out the privacy implications, how even completely anonymized search data can be creepy and personally revealing.
posted by Nelson at 11:12 AM on May 27, 2014 [2 favorites]


Back during Star Wars/SDI, we had Computer Professionals for Social Responsibility. They dissolved last year....

That seems like some unfortunate timing, to say the least.

It's a little dissonant to see @matthowie in the credits since Metafilter is one of the few remaining community sites or blogs of any stature to attempt to assert complete control in perpetuity of all member data and accounts. Metafilter completely disregards the right to be forgotten even in those jurisdictions where it is the law of the land, treating (with no real legal basis) every interaction with the site as an irrevocable license to store and publish the complete content of that interaction for all time. Even Facebook isn't that grasping.
posted by enn at 11:12 AM on May 27, 2014 [1 favorite]


Interesting analogy, data = radioactive waste.

What if collectors had to make a financial and legal commitment to protection in perpetuity or responsible decommisioning of these databases? Not dischargable by bankruptcy, or dissolvable by corporate failure?

Having to pooka away some of that profit as a decommisioning fund might discourage the rampant "more data, always" impulse. Still, it wouldn't prevent Google from buying the tiny startup you trusted.
posted by ctmf at 11:28 AM on May 27, 2014


In my eyes, hosting to a world-readable forum like Metafilter is publication. If you publish a book, or a letter to the editor, you don't retain the right to excise those words later from everyone's copy. The point of this site is for people to have discussions, after all, which is difficult if someone can take their half of the argument and go home.
posted by idlewords at 11:35 AM on May 27, 2014 [8 favorites]


Lots of people use Facebook as a world-readable forum, too, so I'm not sure of the relevance there.

It seems obvious to me that posting a comment on a blog is much more like posting a status update on Facebook—or, for that matter, like shooting the shit in the bar with your friends—than it is like publishing a book. In fact, I'm hard pressed to see many parallels at all between posting a blog comment (automated, instantaneous, informal, and completely revocable on 90%+ of blogs out there) and publishing a book (governed by a contract negotiated by a trained agent, goes through a months- or years-long editorial process during which it is approved by many people, often including lawyers).

Even when you publish a book, you usually sign away only specific rights (e.g., to publish a paperback only, to publish for a limited time, or to publish only in one country), reserving all rights not explicitly granted. Metafilter claims the right to reproduce all comments in any form, including new forms (like the structured, SQL-queryable infodumps that it now publishes, which pose considerably greater privacy risks than a simple collection of HTML pages) introduced long after the comment was written, in any country, for all time.

And people have discussions in impermanent media all the time. On the phone, for example, or face-to-face. I don't really see why you think it's impossible to have a discussion without recording every word for posterity.
posted by enn at 11:58 AM on May 27, 2014


Metafilter claims the right to reproduce all comments in any form, including new forms (like the structured, SQL-queryable infodumps that it now publishes, which pose considerably greater privacy risks than a simple collection of HTML pages)

This is a reiteration, out of context, of an argument in Metatalk four years ago and if you really want to have it again it should be in Metatalk again, but in short form:

1. We have never represented any kind of expectation of a right to arbitrarily use people's comments or posts in non-Metafilter contexts.

2. None of the files in the Infodump reiterate the text content of people's comments; a basic familiarity with its actual content would make that clear.

People ask us to remove things for them for privacy reasons, etc, on a regular basis. As a one-off thing—specific regretted comments, posts, questions, etc—we routinely work with them to make that happen.

Casually wiping an entire account history on a community website where account history consists basically entirely of conversation with other people is a much stickier thing, which is why we don't provide a button for it and very strongly prefer to avoid it whenever possible.

It's a complicated question, and, again, if you really want to revisit it you know where Metatalk is and how to use it. If you want to just have mentioned it as an aside in here, fine, but please do not pursue this as derail of the thread.
posted by cortex at 12:07 PM on May 27, 2014 [2 favorites]


In my eyes, hosting to a world-readable forum like Metafilter is publication. If you publish a book, or a letter to the editor, you don't retain the right to excise those words later from everyone's copy. The point of this site is for people to have discussions, after all, which is difficult if someone can take their half of the argument and go home.

Hi idlewords - if your name in your profile is accurate, how awesome to have you here!

To my mind, what we really need to do on the internet, is make it clear to people the difference between the public and the private side of our activities - when I write a blog post, or update my Facebook wall or tweet or comment - it's public facing, and I think certainly akin to publication (once you've said it, it should be difficult to remove; although there are many other examples of publications that fall out of print or become inaccessible over time). However, that does not mean that my private data - IP addresses, search history, site visit histories, identifying information, etc should either necessarily be (a) collected or (b) used/stored without my consent (and the ability to withdraw that consent).

I think the consent issue is a huge one - most users (and I include myself in this) on the internet are incredibly unaware of what is being collected, and how it might be/could be used not only now but in the future. How do we begin to explain it, much less ensure that people understand it and that there are means for "forgetting" those things that can be/should be forgotten?
posted by nubs at 12:13 PM on May 27, 2014


I work in the field and it is really alarming to sit in sales and marketing calls where the gist from both vendors and IT shops is, "we don't know what this data holds, but there might be something very useful in there that we could glean, so let's keep stockpiling data".

It is pretty much a fear of missing out at this point as organizations fear their competitors are gaining a competitive advantage with their big data initiatives while they are getting left behind.

Data governance and retention rules are pretty much made up as one goes along and the closer one gets to Silicon valley, the worse the attitudes towards data. There are some organizations that are governed by FCC rules weak as they are, but there are many more that don't seem to be governed by any regulations at all.
posted by viramamunivar at 12:14 PM on May 27, 2014


>Casually wiping an entire account history on a community website where account history consists basically entirely of conversation with other people is a much stickier thing, which is why we don't provide a button for it and very strongly prefer to avoid it whenever possible.

I hope to not derail this, and would accordingly bring it to MetaTalk if appropriate, but I for one would rather like to be capable of retroactively eliminating evidence of my younger idiocy off the internet.

The stated strong preference, tho caveated by Maciej himself above, is in my interpretation exactly in line with the range of personal data that I would like to remain in control of. Further, I find the comparison of a book or a letter to the editor to not be analogous, as a) books involve rather more foresight and can be taken off the market and b) letters to the editor didn't used to be easily cross referenced across all of my other activity.

The point isn't to change what people may have already read but prevent future people from also reading it.

In this day and age, what I today consider harmless information may five years down the road prove to be embarrassingly naive or even dangerous to my career or personal life as circumstances and lifestyle choices change.

What if I *used* to be sexist or racist and have grown to be less of an asshole? What if I underwent a gender transition but wished to keep my extant online identity? I would prefer to remain in control of those decisions rather than being subject to the whims/preferences/etc of an (overworked) metafilter moderator whose interests (the integrity of the discussion forum) are definitionally opposite to my own*.

(*Not that I doubt for a second that my future self censorship requests would not be honoured by mefi staff, as you are all beautiful and wonderful and nice people)

In my own apps, one of the first handful of things I implement is a "delete everything about me forever" button and that gives me inordinate satisfaction. I find this lack of control - if nowhere near by a long shot the passive mining Facebook performs - to be part and parcel of the sense of 'digital alienation' Maciej so wonderfully captured in his talk.
posted by pmv at 12:42 PM on May 27, 2014


1. We have never represented any kind of expectation of a right to arbitrarily use people's comments or posts in non-Metafilter contexts.

You've used this data in the infodump and in things like the Markov filter.

2. None of the files in the Infodump reiterate the text content of people's comments; a basic familiarity with its actual content would make that clear.

I've used the infodump. It is trivial, given the information in the infodump, to use the data in a given comment's (or post's) row to retrieve the full text of the comment (or post) from Metafilter the website.

But, regardless, metadata is data in its own right, and there is no way you can reasonably claim that, in posting comments prior to the initial release of the infodump in 2008, I somehow consented to have the metadata associated with those comments published in easily crunchable form later on.

(A hypothetical prospective employer can now, for example, in a single query, figure out what percentage of my Metafilter posting is done during working hours. Sure, pre-infodump they could have written a scraper to crawl the entire site and eventually figure it out anyway—but that's a lot more work and they almost certainly wouldn't have.)

Casually wiping an entire account history on a community website where account history consists basically entirely of conversation with other people is a much stickier thing, which is why we don't provide a button for it and very strongly prefer to avoid it whenever possible.

This is a mischaracterization of your actual practice, which is not that you strongly prefer to avoid it, but that you categorically refuse to do it, as you have done on the many occasions I have—quite firmly—asked you to wipe my own account.

It's a complicated question, and, again, if you really want to revisit it you know where Metatalk is and how to use it. If you want to just have mentioned it as an aside in here, fine, but please do not pursue this as derail of the thread.

I don't think it's a derail at all. The lifespan of user data on the internet, and who is to have control over that data, are the explicit topics of this post.
posted by enn at 12:45 PM on May 27, 2014


The lifespan of user data on the internet, and who is to have control over that data, are the explicit topics of this post.

Well, actually, the topic discussed in the article linked in the FPP is not about user generated content, but rather is about user behavioral data, like what you search for, what cell tower your smartphone is using, purchases you have made, etc. Stuff which actually says a lot more about you and how you actually live your life than anything you choose to write as a comment here or as a contribution of any form onto any other website.

And this is data which is being gathered without your consent or in many cases even your knowledge. Which is quite different from anything you have posted to this or any other website, which was done with your full knowledge, consent, and agency.
posted by hippybear at 12:57 PM on May 27, 2014 [2 favorites]


You've used this data in the infodump and in things like the Markov filter.

The Infodump by design excludes basically all actual text content of comments and posts from its files. It's a skeleton of site activity by numbers, not qualitative stuff; further, we provide a munge function to obscure userids of folks who would rather not have even that degree of direct userid-in-plaintext connection in the files. MarkovFilter, the years ago when it was even operable, was operating off the live database and not creating new records, and was far less public-facing in practice than the comments off which its text was synthesized. The corpus frequency tables we have treat site content in total aggregate with no direct ties to users, and again contains far less info than the actual public-facing site itself.

It is trivial, given the information in the infodump, to use the data in a given comment's (or post's) row to retrieve the full text of the comment (or post) from Metafilter the website.

It is also trivial to retreive the full text of the comment or post form Metafilter the webiste without the infodump, because it's content on the website.

Again, if you want to pursue this as a Metafilter policy discussion, start a Metatalk thread and we can go through the whole thing there, but I do not want to derail this thread into a detailed back and forth on the mechanics of Metafilter or your specific hypothetical or grievances about same.
posted by cortex at 1:13 PM on May 27, 2014 [1 favorite]


Oh oh. Two independent thought alarms in one day.
posted by entropicamericana at 1:26 PM on May 27, 2014


This kind of debate is why I limited my remarks to "behavioral data" in the talk. The question of what to do with voluntarily submitted data is much more fraught—one person's sacred right to edit is another person's creepy Orwellian rewriting of the past. But behavioral data-the kind of stuff a site learns about you in passing, sometimes through very invasive means (like javascript that tracks your mouse cursor position), feels much more clear-cut to me.

I apologize for starting a debate specifically about Metafilter, a community whose norms I don't know well enough to spout off about. If only there was some way to go back and change my comment...
posted by idlewords at 2:06 PM on May 27, 2014 [5 favorites]


Video of Maciej's talk at Webstock, "Our Comrade the Electron" - (as mentioned above, previously)
posted by maupuia at 2:09 PM on May 27, 2014 [1 favorite]


idlewords: you did not start that debate. That's a long-standing member regrinding an axe they've held for a long time. There is nothing to apologize for.
posted by hippybear at 2:12 PM on May 27, 2014 [1 favorite]


Thanks for posting this. The talk/essay is very clear, informative, and also hilarious.
posted by medusa at 2:18 PM on May 27, 2014


Of interest: the FTC just released a report on data brokers.

Glad you stopped by, idlewords. Fantastic essay.
posted by ropeladder at 3:12 PM on May 27, 2014


Well, I heard there was a cheese plate.
posted by idlewords at 3:20 PM on May 27, 2014 [2 favorites]


But exactly which cheeses you chose and coupled with which crackers and cold cuts... those have been logged in perpetuity.
posted by hippybear at 3:22 PM on May 27, 2014


bean plate, idlewords
posted by grubby at 3:22 PM on May 27, 2014 [1 favorite]


Privacy policy is my day job and this is just a phenomenally good summary of what is what in that area right now. Plus, hilarious.
posted by Sebmojo at 4:03 PM on May 27, 2014


Privacy policy is my day job and this is just a phenomenally good summary of what is what in that area right now. Plus, hilarious.

Ditto - I just sent it out to all my staff. Thanks for this, idlewords, it's awesome.
posted by His thoughts were red thoughts at 4:19 PM on May 27, 2014


This kind of debate is why I limited my remarks to "behavioral data" in the talk. The question of what to do with voluntarily submitted data is much more fraught—one person's sacred right to edit is another person's creepy Orwellian rewriting of the past. But behavioral data-the kind of stuff a site learns about you in passing, sometimes through very invasive means (like javascript that tracks your mouse cursor position), feels much more clear-cut to me.

I guess my argument would be that you are drawing a distinction that is very much in the eye of the beholder, and that the main rhetorical tactic employed by entities like Google and Facebook to justify what they do is to blur the line between "behavioral data" and "voluntarily submitted data" by redefining as much of the former as the latter as they can get away with (which seems to be quite a lot).

For example, when I "like" a friend's wedding announcement on Facebook, that's at least somewhat of a public act, in that I intend at least my friend to see it, and probably our mutual friends as well—otherwise, it's pretty pointless. But notice how Facebook has cleverly made people use the same "like" mechanic if they want to get updates from an organization or business. When I "like" Joe's Food Truck, it looks and feels the same as when I like my friend's status, and Facebook would certainly argue that this is "voluntarily submitted data." But in fact I'm not trying to perform a public act at all, I just want to be notified when the food truck is near my office, which is surely behavioral data.

I'm not convinced that such a distinction can really be drawn in a way that is consistent enough to let people make reliable predictions about which of these two categories a given datum is likely to be judged (by a website, by the law, by other people in general) to fall into.

Maybe I'm wrong. But I think it's worth trying to define your terms, because at least to me the basis of this distinction is not clear. As a user, I click one link and I go to a different web page—ok, most people would consider that to be behavioral. I click another link and I upvote somebody's post (or "+1" a Google search result)—is that behavioral or is it content? What about the record of a past upvote or +1 which I performed and then changed my mind about and removed? How about adding a post on a discussion forum to a list of "My Watched Threads" or something? Does it matter if the list of watched threads is visible on the interface only to me or to others as well? Do I have to think about POSTs versus GETs to figure out the answer? Personally, I would like for the consensus to be that we should err on the side of the users. At the moment, things seem to be headed the other way.
posted by enn at 1:53 PM on May 28, 2014


With Google at last launching its European Court of Justice–mandated "Right to Be Forgotten" form—but only in the EU—Cegłowski's thoughts are all the more timely.

N.B. His blog post is potentially NSFW if a selfie of Robert Scoble wearing Google Glass in the shower counts.
posted by Doktor Zed at 9:36 AM on May 30, 2014




Omnivore: What's the next big thing in big data?
posted by homunculus at 1:24 PM on June 18, 2014


« Older Na na na na nah-na, na na na na na   |   More like the Internet of Surveillance Newer »


This thread has been archived and is closed to new comments