Skip

The Dreams Of Big Data
March 21, 2013 8:12 AM   Subscribe

Does Big Data Mean The Demise Of The Expert - And Intuition? - "Data-driven decisions are poised to augment or overrule human judgment." What Is Big Data?

Are Smart Gadgets Making Us Dumb?
First, thanks to the proliferation of cheap, powerful sensors, the most commonplace objects can finally understand what we do with them—from umbrellas that know it's going to rain to shoes that know they're wearing out—and alert us to potential problems and programmed priorities. These objects are no longer just dumb, passive matter. With some help from crowdsourcing or artificial intelligence, they can be taught to distinguish between responsible and irresponsible behavior (between recycling and throwing stuff away, for example) and then punish or reward us accordingly—in real time.
A Stunning Vision Of Our Interoperable Future
At the heart of "Trillions" is the power of the microprocessor. To date, it's been mostly put at the service of personal computing, using individualized tools that only interact with each other when we tell them to. For the authors, we've gone about as far as we can go in this direction - and that's why we've been so frustrated of late with the seemingly unmanageable firehose of data coming out of the global economy. If, as the authors put it, we want to tame the complexity of our world, we need to walk back down the mountain of personal computing and begin climbing Trillions Mountain, which is based on pervasive computing.
When Public Data Is Too Public
Everything We Know About What Data Brokers Know About You
Federal law protects the confidentiality of your medical records and your conversations with your doctor. There are also strict rules regarding the sale of information used to determine your credit-worthiness, or your eligibility for employment, insurance, and housing. For instance, consumers have the right to view and correct their own credit reports, and potential employers have to ask for your consent before they buy a credit report about you.

Other than certain kinds of protected data—including medical records and data used for credit reports—consumers have no legal right to control or even monitor how information about them is bought and sold. As the FTC notes, “There are no current laws requiring data brokers to maintain the privacy of consumer data unless they use that data for credit, employment, insurance, housing, or other similar purposes.”
The Information: A History, A Theory, A Flood
In The Information, Gleick neatly captures today’s reality. “We know about streaming information, parsing it, sorting it, matching it, and filtering it. Our furniture includes iPods and plasma screens, our skills include texting and Googling, we are endowed, we are expert, so we see information in the foreground,” he writes. “But it has always been there.”
CIA's CTO Gus Hunt: "We try to collect everything and hang on to it forever."

Big Data Is Neutral: A Tool For Both Good And Evil
David Brooks concludes in "What Data Can't Do": "This is not to argue that big data isn’t a great tool. It’s just that, like any tool, it’s good at some things and not at others. As the Yale professor Edward Tufte has said, “The world is much more interesting than any one discipline.” "

reinventing Society In The Wake Of Big Data
What those breadcrumbs tell is the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you're likely to get diabetes.

They can do this because the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd. You can tell all sorts of things about a person, even though it's not explicitly in the data, because people are so enmeshed in the surrounding social fabric that it determines the sorts of things that they think are normal, and what behaviors they will learn from each other.
"Big Data" is not without its critics.

Evgeny Morzov: The Curse Of 'You May Also Like'
But when it comes to discovery of new talent and the subsequent production of their work, things look much gloomier. After all, recommendation matters only if there's great art to recommend. If that art is selected based on how likely it is to match the success of previous selections and if it's produced based on immediate feedback from the audience, sales might increase, but will anything truly radical emerge out of all this salesmanship?
Toward A Complex, Realistic, and Moral Tech Criticism
Morozov's second book, To Save Everything, Click Here: The Folly of Technological Solutionism, is the most wide-ranging and generative critique of digital technology I've ever read. There's so much substance to argue about between its covers. At the center of it all, there's a brilliant, idiosyncratic mind at work.

Describing and destroying two concepts -- "Internet-centrism" and "solutionism" -- form the core of his book, and both are fascinating frames for the discourse surrounding our network technologies.
A review of Evgeny Morozov’s “To Save Everything, Click Here”
From here it is just one more short step to the buy­ing and sell­ing of our per­sonal data: to insur­ers in return for lower pre­mi­ums, to adver­tis­ers in return for bet­ter deals. Our per­sonal data becomes a new “asset class” and exec­u­tives respond by “try­ing to shift the focus [of debate] from purely pri­vacy to what we call prop­erty rights” (235). New social pres­sures emerge as the dig­i­tiz­ers fol­low their path of bits, algo­rithms and mar­kets (career coun­sel­lors now rou­tinely rec­om­mend that build­ing a strong pres­ence on LinkedIn is a route to a bet­ter job), and we can replace debates about pri­vacy with reas­sur­ances about per­sonal choice. “Pri­vacy is mostly an illu­sion, but you’ll have as much of it as you want to pay for” says Kevin Kelly (236). New com­pa­nies emerge to opti­mize our self-presentation on the web (reputation.com), new norms emerge as “If you’re going out with some­one, and they don’t have a Face­book pro­file, you should be sus­pi­cious” (Slate’s Farhad Man­joo, quoted on p. 239). Why would you not share your real-time blood alco­hol lev­els with your employer if you don’t have any­thing to hide? (240).
Douglas Rushkoff: Unlike - Why I'm Leaving Facebook
I can no longer justify this arrangement. Today I am surrendering my Facebook account, because my participation on the site is simply too inconsistent with the values I espouse in my work. In my upcoming book Present Shock, I chronicle some of what happens when we can no longer manage our many online presences. I argue - as I always have - for engaging with technology as conscious human beings, and dispensing with technologies that take that agency away.
Adaptation from Present Shock:
Instead of our offloading time-intensive tasks to our machines, we attempt to match the speed of our network connections. Thanks to the Internet, we travel more on business not less, we work at all hours on demand, and spend our free time answering email or tending to our social networks. Staring into screens, we are less attuned to light of day and the physiological rhythms of our housemates and co-workers. We are more likely to accept the digital clock's illusion that all time is equivalent and interchangeable. But it isn't.
NY Times review of Present Shock: Out of Time: The Sins of Immediacy
Among the intuitive ideas turned tangible by “Present Shock” is “filter failure,” the writer and teacher Clay Shirky’s improved term for what used to be called “information overload.” Mr. Rushkoff’s translation: “Whatever is vibrating on the iPhone just isn’t as valuable as the eye contact you are making right now.”
Nassim Taleb: Beware The Big Errors of 'Big Data' and referenced here: Spurious Correlations Everywhere: The Tragedy Of Big Data

Bruce Schneier (via): Government, Big Data Pose Bigger 'Net Threat Than Criminals
Schneier said the threat is often obfuscated by the tremendous technical advances the big data players have offered. Google mail is a safer alternative for average users because there's almost no chance they'll ever lose a message. Apple's iPhone is wildly popular because it's easy to use and to date has proved largely impervious to real-world malware attacks. But behind the security and reliability, there are threats many don't consider.

"I can't find a program that will erase the data on this thing to a reasonable assurance without jailbreaking it," he said, holding up his iPhone. "For me that's bad."
see also: Battle Of The Internet Giants

Michael Lind: Stop Pretending Cyberspace Exists - "Treating The Internet As A Mythical Country Makes Us Dumber"
It is not a parallel universe, coexisting with our world but in a different dimension. It is just a bad metaphor that has outlived its usefulness. Using the imagery of a fictitious country makes it harder to have rational arguments about government regulation or commercial exploitation of modern information and communications technologies.
Some via. Previously
posted by the man of twists and turns (73 comments total) 148 users marked this as a favorite

 
We should A/B test this.
posted by Artw at 8:17 AM on March 21, 2013 [15 favorites]


I'm going to analyze the sentiments expressed here towards the big data brand before I comment.

OOPS!
posted by Mister_A at 8:26 AM on March 21, 2013


Big Data, BTW, is the same technocratic pipe dream as universal surveillance. It's only as good as the available analysis and the application of findings to the real world, which are processes that are still governed by us humans.

And remember, we are the ones asking the questions. Asking the questions defines the range of possible answers. "Big data" will be a slightly embarrassing term in the future when we realize it's really still just 'data'.
posted by Mister_A at 8:30 AM on March 21, 2013 [12 favorites]


Terrific post though Twisty! May I call you Twisty?
posted by Mister_A at 8:31 AM on March 21, 2013


A bunch of people are going to make a bunch of stupid decisions based on stupid big data and are going to lose a lot of money for it.
posted by effugas at 8:31 AM on March 21, 2013 [1 favorite]


A bunch of people are going to make a bunch of stupid decisions based on stupid big data and are going to lose a lot of money for it.

and/or lives.
posted by Thorzdad at 8:36 AM on March 21, 2013 [4 favorites]


A bunch of people are going to make a bunch of stupid decisions based on stupid big data and are going to lose a lot of money for it.

Indeed. While it's true that your analysts are only as good as their data, it is equally true that your ability to make good data-driven decisions is only as good as your analysts.

(viz. Nate Silver vs. the GOP establishment, who all had access to the same polling dataset leading up to November 2012, and reached, umm, different conclusions)
posted by dersins at 8:37 AM on March 21, 2013 [10 favorites]


Terrific post though Twisty! May I call you Twisty?

No, that is someone else.
posted by the man of twists and turns at 8:39 AM on March 21, 2013 [2 favorites]


This post has introduced me to Farnam Street, for which I sincerely thank you.
posted by jbickers at 8:39 AM on March 21, 2013 [3 favorites]


Well again, it comes down to asking the right questions. Beane was successful with the A's because he asked the right questions. Or rather, questions that had measurable answers and that seemed to have some correlation with on-field success. You see, Beane was not "just a number cruncher," he was a very bright man, an expert in his field, and a highly creative thinker.

As time wears on, we see other people asking other questions, with varying degrees of success. Do the Mets have analysts? Of course! But they (the team) still suck. There's a lot more to it than access to data. You need access to brilliance to really shine the way Beane's A's teams did.

So yes, a lot of people will use data to make decisions without considering the implications fully, and without understanding the limits of the data. I think the second part there is the biggest issue. People have a limited understanding of the value of data, how to tell if it's clean, when to discard as outdated or outliers, etc. We have, as a society, even less of a grasp on the ways in which data can be manipulated into supporting almost and position.

So same as it ever was, really.
posted by Mister_A at 8:43 AM on March 21, 2013 [3 favorites]


"Big data" will be a slightly embarrassing term in the future when we realize it's really still just 'data'.

Even then, it's just "data" in the sense of ones and zeroes. Just because it's information doesn't mean it's informative, y'know?
posted by Sys Rq at 8:48 AM on March 21, 2013


I am a bit of a Big Data nerd -- not so much using and implementing it, but looking at what it allows us to do and what the implications may be -- so this is an awesome post. I likely won't have time to read it until much later today, but thank you for compiling these links!
posted by asnider at 8:48 AM on March 21, 2013


Big data" will be a slightly embarrassing term in the future when we realize it's really still just 'data'.

Yes ... but it will also give us a term to rally around for a while while we discuss these big issues at cocktail parties, around various water coolers.

A bunch of people are going to make a bunch of stupid decisions based on stupid big data and are going to lose a lot of money for it.

and/or lives.


To which I would counter, but how many better informed decisions will be made based on greater accumulations of available information. How many lives will be saved? Which will forever be the issue with progress/innovation -- being humans, it tends to bring out both the good and the bad in us.

Or as Mister_A pointed out:

Well again, it comes down to asking the right questions.

I mean, hate the airplane all you want, it's done magnitudes to bring together people from all over the world who would otherwise never have met ... and the steam engine before it.
posted by philip-random at 8:56 AM on March 21, 2013


If the only concrete example of an actual instance of successful use of 'big data' the author can come up with is picking a baseball team then I find it very hard to believe that these big data decisions are going to have an impact in many other fields.

The article presumes that for most decisions there is some possible data set out there that could somehow inform my choice in the matter. I just don't believe that is true of many things.
posted by mary8nne at 8:58 AM on March 21, 2013


Data — so much that 90% of the data in the world today has been created in the last two years alone

Isn't it a little suspicious? Like aren't there useful pieces of information from before 2011?
posted by Reasonably Everything Happens at 9:04 AM on March 21, 2013 [1 favorite]


You know when it's Thanksgiving, and you've got this plate full of really awesome food and have no idea where to start because it all looks so good? Great post - as a data guy myself I really enjoy reading the opinions of others in related fields.
posted by antonymous at 9:07 AM on March 21, 2013


I find it very hard to believe that these big data decisions are going to have an impact in many other fields.

It's huge in politics (election campaigns, primarily, but also in the partisan hackery that goes on between elections). Big Data makes it a lot easier to target individual households that may represent swing voters who just need that extra push, particularly among demographics who parties would traditionally ignore as not being the type who vote for them.

So...to pick a made-up example: the Yellow Party normally does terrible among African-American women. But women who shop at Target, prefer beer over wine, and have a household income of greater than $60,000 are very likely to vote for the Yellow Party, regardless of their race. Suddenly, instead of not bothering to send someone to go door-knocking in African-American neighbourhoods, you can send someone to go knock on 5 - 10 very specific doors in an African-American neighbourhood and pick up those extra votes that you'd have overlooked in the past.

That may be a really cynical way of running an election campaign but, based on what I've been hearing (mostly anecdotally) for the past few years, it works.
posted by asnider at 9:15 AM on March 21, 2013 [1 favorite]


Isn't it a little suspicious? Like aren't there useful pieces of information from before 2011?

Without knowing the context (as in, which article you're referring to), I'd say it's definitely plausible. To take an example from surveillance tech, there is a technology called ShotSpotter, which helps pinpoint the location of gunfire. This means that in areas of the city where this technology is deployed, there are microphones which are constantly listening to everything, and the technology that this company deploys does an analysis of every sound and alerts the police only when gunfire is detected. Think of all the times that there is NOT gunfire during the day, and you start to imagine the scope of data that is being processed by this technology. While this particular tech has been around for several years, it's a good example of the massive quantity of data (such as sound) that has become easier to analyze in recent years. Now apply facial recognition and gait-detection to the video cameras that have become ubiquitous in recent years, and the data points just multiply...
posted by antonymous at 9:15 AM on March 21, 2013 [2 favorites]


Isn't it a little suspicious? Like aren't there useful pieces of information from before 2011?

The keyword there is useful - we're producing data at an incredibly expanding rate at the moment, but not all data is useful data. Data does not equal information; that requires a lot of work and I don't think we as a species are very good at that yet. I'm currently reading Nate Silver's Signal and the Noise, which is a good introduction to the problems we run into when dealing with making predictions, even when we have large data sets behind us.

I'm on the edge of Big Data, and working with the very large historical databases my organization has created over the years, so this post is very much in my interest zone. Thanks!
posted by never used baby shoes at 9:15 AM on March 21, 2013 [1 favorite]


If the only concrete example of an actual instance of successful use of 'big data' the author can come up with is picking a baseball team then I find it very hard to believe that these big data decisions are going to have an impact in many other fields.

Every decision made by the President's reelection campaign was driven by data, "Big" and small. It's how the campaign knew who to talk to, how to talk to them, and what to say.

Turnout (and persuasion) target universes were created based on demographics, geography, publicly available information, consumer data, field-collected data, and just about every other kind of information you could imagine. Methods of contact, script--hell, even email subject lines-- were data- (and a/b-testing) driven decisions as well.

Digital analytics were also a huge part of an incredibly successful effort (as measured by on-line donations) at targeting on-line advertising.

These data-driven decisions were the effort that made it possible for a very vulnerable President of a country with a slumping (ish) economy to win re-election in what was effectively an electoral college landslide.

This is also exactly how then-Senator Obama won first the 2008 Democratic nomination, then the Presidency.

I'd classify that as, um, kind of an impact.
posted by dersins at 9:19 AM on March 21, 2013 [2 favorites]


As long as there is data, there will be people to misinterpret it.
posted by TheWhiteSkull at 9:21 AM on March 21, 2013 [1 favorite]


The closer to 'ground truth' we are able to sample the better, recognizing that we can rarely get there for any large N.

Seconding 'big data' is just data and we've previously been small-minded and forced to invent statistics and sampling (and over- and under- sampling) for just about any data source of a useful size. That's not a dig on statistics, any truthful practitioner thereof will own that the human at the front end of it who needs to look at it and ask "does that even kind of look right?" is as much a part of the practice as the model.

Some forms of interconnected, data-emitting universal computing matter will appear useful, only until someone actually plots the Big Table that contains all the nodes not reporting. So your garbage can determine if your'e recycling? Cool, that's say a few million rows of data in. Now let's go ahead and populate with the guess of total rows we should use if we want to account for all the refuse that's not so enabled.... oh that's a trillion rows. Of almost all nulls. So from that data we can conclude....

Let's just filter out those pesky null reports because... reasons.

There we go! People in Pennsylvania hate the environment and nobody recycles on Tuesdays. Rejoice!

I'm a twelfth-dan data nerd and the first tenet of my personal technical religion is "no magic." Big data presents the same problems as small data with an enormously higher error rate. But, as with most things, when any inference is presented based on data without denominators, some folks will be impressed and find very clever ways to adapt it to the conclusion they were gunning for.

Further, the A's success wasn't big data, it was good analysis. It was falsifying hypotheses so easily falsifiable the work could have been done pre-computers (and was), then the very hard work of building new testable hypotheses and refining them. The hard part was getting someone to apply basic empiricism in a system full of gut, tradition, and a universe of untracked data on which decisions were based.
posted by abulafa at 9:23 AM on March 21, 2013 [5 favorites]


I did my undergrad in Poli Sci and Psych, and work in IT professionally. One of the most important thing I learned in my undergrad was the difference between analog and digital measurement.

I think more and more about Xeno's Paradox. The more you are counting the segments in movement of an arrow's arc towards its target, there could be a million points where it may seem like its not going to hit. Having more data doesn't mean you can make accurate predictions.

God that sounds convoluted. I need to brush up on my writing skills. Anyways, thanks for the post. I'll try to read up on what I've missed recently.
posted by lslelel at 9:25 AM on March 21, 2013 [3 favorites]


There will be a lot of failures in 'big data'. There already have been. Measuring the wrong thing, false precision, personality winning out over science, trying to use tools and methods that worked great for 100K samples, but fallover at 100M. That happens everywhere, all the time. It's just more noticeable in new and sexy trends that get print ink, and twitter links.
posted by DigDoug at 9:25 AM on March 21, 2013 [1 favorite]


As long as there is data, there will be people to misinterpret it.

This is the key point. We need, to pick a current hero of big data, Nate Silvers to use their knowledge and intuition to guide the results. Garbage in still gets garbage out. Quality/Sanity checking model outputs is really important.

Stuff like Google Now can be impressive, but automatic inference still gets things wrong more often than not (what happens if I want to stop at the Grocery store instead of just heading home? Traffic info is not so useful for that .)
posted by bonehead at 9:30 AM on March 21, 2013 [1 favorite]


dersins: "Indeed. While it's true that your analysts are only as good as their data, it is equally true that your ability to make good data-driven decisions is only as good as your analysts.

(viz. Nate Silver vs. the GOP establishment, who all had access to the same polling dataset leading up to November 2012, and reached, umm, different conclusions.
"

Something else that's occurred to me, as I've begun delving into one of the archaeological equivalents of "big" data, is that there are also two radically different ways to approach massive datasets. You can either ask questions of the data, or you can use the data to create new questions. Just about any koopa can query a dataset for information, but it takes a deeper understanding to synthesize the data into something more than the sum of its parts.

The GOP are just koopas. Nate Silver, on the other hand, is clearly a magikoopa.
posted by barnacles at 9:40 AM on March 21, 2013 [3 favorites]


On the subject of smart gadgets, I thought this looked interesting. A Bluetooth-enabled subdermal sensor that could theoretically warn you a few hours before a heart attack.
posted by CheeseDigestsAll at 9:51 AM on March 21, 2013 [1 favorite]


but it takes a deeper understanding to synthesize the data into something more than the sum of its parts

Yes, exactly! My current way of talking about this during interviews (boo job search) is that you can't just "look at the numbers" to be effective in analyzing data. You have to be able to puzzle out what story the numbers are really trying to tell you, which usually means knowing the right questions to ask-- just like a reporter or an investigator.

But even that's not enough: you have to be able to tell that story in a way that makes sense to the people making the actual data-driven decisions, who usually aren't the people actually elbows-deep in the dataset.
posted by dersins at 9:57 AM on March 21, 2013 [4 favorites]


Wonderful post! I literally just came out of a meeting at our community college about how our math and composition developmental curriculum is being completely and quickly revamped by the state. The faculty have raised numerous concerns where we see our classroom experience conflicting with these new proposals, but we're repeatedly told that the college system is moving to data-driven decision making, and we just have to adapt to this new process.

I don't doubt that some of our anecdotal evidence is not as universal as we think it may be, but when we can actually use data back at them - the numbers you propose are different than the numbers we actually see - we're being told that we must be an "outlier." We're the 2nd largest community college in the state, but we're an outlier.

I'm not optimistic.
posted by bibliowench at 9:57 AM on March 21, 2013


Scientists (more physicists and chemists than biologists) have been swimming in data for years, and I feel like something could be learned there; but the answer "you need to become a scientist to make use of Big Data" seems suspect :). I'm watching (and participating) as biology becomes increasingly data driven and it's pretty horrifying to watch people assume that "data = answers", although of course there have always been people pointing out that data can actively mislead you as well. The physicists with CERN can combine theory with data analysis to produce results, which seems like the best way -- I'm sorry but I can't find a good link that explains how they do it, although I've heard multiple talks about it!
posted by ctitusbrown at 10:00 AM on March 21, 2013 [1 favorite]


The other day an internet pal of mine tipped me off to this "surveillance symposium" on youtube with Jacob Appelbaum, Bill Binney,and Laura Poitras. Claim is the NSA is collecting all our e-mails, phone logs, facebook friends, &c. and they claim it is legally not surveillance to store it without inspecting it. They don't need a court order unless they decide they want to read it, at which time they think they have enough data to read your mind. What they are collecting, just in case, is every datum you have ever created by existing.
posted by bukvich at 10:00 AM on March 21, 2013 [2 favorites]


The faculty have raised numerous concerns where we see our classroom experience conflicting with these new proposals, but we're repeatedly told that the college system is moving to data-driven decision making, and we just have to adapt to this new process.

Oh great, Six Sigma for education... because we totally need to treat everything like a fucking assembly line.
posted by DigDoug at 10:05 AM on March 21, 2013 [1 favorite]


So much data flowing out, sitting around, yet only a fraction of that can probably be made useful.

Much like there is so much methane flowing out into the atmosphere, yet only a fraction of that too can probably be made useful.

And with both: the implications of making them useful are both OK and frightening.

Yet I must get back to work making pages that display various data sets in the most user friendly way possible.

Interesting times, to say the least.
posted by JoeXIII007 at 10:07 AM on March 21, 2013


more physicists and chemists than biologists

Wildlife/population ecology, bioinformatics, to name a couple.

Geologists too: GIS systems are nothing if not Big Data.

Dealing with Big Data has been one of, if not the, largest change and challenge for the current generation of scientists.
posted by bonehead at 10:13 AM on March 21, 2013


Oh my! There are so many things I'd like to say on this post, but will try to pick a few relevant to the comments; I may not be a 12th-dan data nerd but I've "stolen a few horses" in the data world.

...the first tenet of my personal technical religion is "no magic."

I'll second that and turn it back on analytic techniques: There is no such thing as a "magic wand." I am a 12th-dan genetic programmer, but I will be the first to tell you it's not a magic wand. Same with every other analytic technique (yes, even Bayesian statistics Nate.) There is a theorem in analytics called the NFL Theorem where 'NFL' stands for No Free Lunch. It roughly states: There is no algorithm that will work any better than any other algorithm across all data sets. IE, there is no magic wand.

One of the things I do is what I call "listening to data." By that I mean one should take the attitude that the data is trying to tell you something based on the results of your analysis, but it's not necessarily the answer to the question you asked. Here is an example: For the last 12 years, I have been working in bioinformatics and in particular in analyzing molecular data. I began a long collaboration with a lab at USC studying bladder cancer. They had a set of particularly reliable data and the lead researcher, Dr. Richard Cote, asked me to analyze the data to see if the physical staging of the cancer (T-staging)could be correlated with molecular characteristics, in this case gene expression levels.

The result of our analysis was messy with poor performance, complex models and a general feeling that we had overfit the data (a common problem with Big Data, but particularly biological data). I decided to look at the samples we were misclassifying and discovered that most of them had the characteristic of having local metastases. So I asked a different question: Is there a genomic signature that identifies tumor samples that have metastsized to local lymph nodes? (Note that this is not analyzing nodal tissue, but the original tumor.) The answer was a swift and unequivocal yes. We eventually found a simple 3 gene model that was very accurate at identifying tumors that had metastasized and, as a bonus, 2 of the 3 genes were associated with metastasis in the literature. Extra bonus: the 3rd gene, while well associated with cancer, was not associated with either metastases or with playing a role in the tumors themselves.

Anyway, to make a short story long, this result came entirely from "listening" to what the data was trying to tell us, not to the question we were asking. As a side note, it was also telling us that T-staging has nothing to do with molecular characteristics and that the latter are probably better than T-staging (and probably TNM staging) in assessing the danger of the cancer. Disclaimer: The literature has many examples where researchers have probably overfit the data in drawing a conclusion. Always check for validation, particularly prospective validation.

Anyone interested in reading about this study, it may be found here though it does not talk about how we got to studying from T-staging, only the study of local mets. (Please don't taze me mods!)
posted by BillW at 10:13 AM on March 21, 2013 [17 favorites]


Big Data: when hipsters discovered spreadsheets.
posted by Damienmce at 10:22 AM on March 21, 2013 [5 favorites]


So, I do "Big Data" work in cheminformatics, and while I understand the skepticism in this thread in the context of using data mining to predict when your socks are going to get holes in them or whatever, there really are spaces (like the space of all possible chemical compounds) that are too large for anyone to become an expert at maneuvering their way through them to the interesting bits. The type of analysis we do on predicting compound activity, for example, just beats the pants off of analyses that use more of a first-principles approach, like docking, or SAR -- we've made a lot of screeners very happy. The key to our success, though, is that we're dealing with a relatively low dimensionality, and figuring out which dimensions are relevant, and how to represent them, is where the expert knowledge comes in.
posted by invitapriore at 10:23 AM on March 21, 2013 [3 favorites]



On the subject of smart gadgets, I thought this looked interesting. A Bluetooth-enabled subdermal sensor that could theoretically warn you a few hours before a heart attack.


Hopefully, not enough false positives to give you a heart attack.
posted by KaizenSoze at 10:32 AM on March 21, 2013


If you're not asking the right questions of your data, it doesn't matter how much of it you have.

(Asking the right questions is rly rly rly hard.)
posted by PMdixon at 10:39 AM on March 21, 2013 [6 favorites]


That may be a really cynical way of running an election campaign but, based on what I've been hearing (mostly anecdotally) for the past few years, it works.

Which is amazing considering how stupid the polling can be.

About an hour ago I took a telephone call that was an automated poll. The first question was "How likely are you to vote in the upcoming Republican primary for the special election?" and the only options were "Very Likely", "Moderately Likely", "Somewhat Likely", and "Not Likely". After answering "Not Likely" the poll ended.

I've been trying to figure out what data they're going to take from that. Obviously they bothered calling, so they had some preexisting reason to suspect that I might vote Republican, but does "Not Likely" mean that I'm not voting in the primary because I think Republicans are greedy, sociopathic weasels, or does "Not Likely" mean I'm not excited about the candidates and might change my mind after I receive some campaign calls? They could have included the option "Not Voting Republican" or asked a follow-up question to discern why I'm not a likely voter, but they didn't bother. Though given my political leanings, I can't say I'm upset by this.

Likewise, when I was polled last fall about the assisted suicide ballot measure, the entire poll was yes/no answers with no room to capture even the slightest bit of nuance. Either someone being incredibly stupid and narrow minded when designing it, or they just wanted to force a certain kind of result. Either way should have resulted in terrible information.
posted by RonButNotStupid at 10:46 AM on March 21, 2013


RonButNotStupid, they don't care why you are not likely. They are trying to optimize who they should spend time/money on and someone who self-selects out, for whatever reasons, is not worth spending resources on.

That said, there are usually a couple of stages in this process depending on where you are in the election cycle and/or how much money the caller has. In an earlier stage, someone might engage you to try to find out why you don't want to vote and whether your are persuadable to vote for their candidate. If they become convinced you are not going to vote for their candidate, they will mark you down as such. Later on, if someone talked to you and you convinced them you were not going to vote for their candidate, even if you were going to vote, they may try to actively convince you of the intrinsic awfulness of your choice (negative campaigning) in order to suppress your vote. Still later, in the final weeks, they are working too hard on getting out the vote (GOTV) and at this point all they are trying to do is get the voters who said they were going to vote for their candidate out to the polls.

Whether this is the right algorithm or not is an interesting question, but it is how it is done currently.
posted by BillW at 10:57 AM on March 21, 2013


Invitapriore: I always hear about the absolute necessity of using people's expertise to find good low-dimensionality representations of the data, but at the same time the forefront of research is definitely heading towards unsupervised learning of these representations.
posted by curuinor at 11:16 AM on March 21, 2013


I've been trying to figure out what data they're going to take from that. Obviously they bothered calling, so they had some preexisting reason to suspect that I might vote Republican, but does "Not Likely" mean that I'm not voting in the primary because I think Republicans are greedy, sociopathic weasels, or does "Not Likely" mean I'm not excited about the candidates and might change my mind after I receive some campaign calls? They could have included the option "Not Voting Republican" or asked a follow-up question to discern why I'm not a likely voter, but they didn't bother.

If they are polling to see which candidate is ahead (like the polls you see in the media), and it sounds like that's what the poll was (PPP maybe?), they do not give a flying fuck why you are not voting. Assuming the poll is one of "likely voters" (as most are), once you self-identify as not a "likely voter", you are dead to them. Completely useless.


Likewise, when I was polled last fall about the assisted suicide ballot measure, the entire poll was yes/no answers with no room to capture even the slightest bit of nuance.


Yes/no answers (or multiple-choice with limited options) are pretty much necessary for fast, useful large-scale analysis. Aggregate enough yes/no responses, and you can have an incredibly accurate picture, much more accurate than you can get by reading through unquantifiable "nuance".

"Some" is not a number, "soon" is not a time, and anecdote is not data.
posted by dersins at 11:20 AM on March 21, 2013 [2 favorites]


It's going to really fun when people start poisoning the data to mess with the competition.
posted by KaizenSoze at 11:20 AM on March 21, 2013 [1 favorite]


These links cover so many different topics, but there seems to be a common theme among most of them: We are losing control (and suffering consequences) over the mess we created and should stop, drop and PANIC! Or at least take some action against these NEW and possibly EVIL forces. (In more grounded terms, the idea is that technological determinism is king and we need to do something about it.)

I took a close read of the last link (Stop Treating Cyberspace as if it Exists) earlier this week. It's so full of dumb I can't even.
"The idea that corporations are “invading” a mythical Oz-like kingdom called cyberspace is just as dopey."
He's talking about the stupidity of the "corporations are people" metaphor, right?
"If you’re not convinced by now that the very notion of cyberspace is silly, try substituting “fax” or “telephone” or “telegraph” for “cyber” in words and sentences. The results will be comical."
Comical because those are nouns and "cyber" is an adjective?
"Again, the point is not that telecommunications should not be structured and governed in the public interest, but rather that the debate about the public interest is not well served by the Land of Oz metaphor."
I think we all can agree that "Land of Oz" is a stupid metaphor. Since it's only come into being as of this article, let's let it live out the rest of its days there as well.
"My guess is that cyber-hype is on the way out, for several reasons. For one thing, the novelty of PCs and wireless phones has worn off. They are no longer mystical portals to another dimension, but mere appliances."
In other news, fortune-telling is also passé. Crystal balls no longer telling futures, but can be used as lovely decorations. Also, Mac Cubes make great fishtanks.
"While we can all get smarter merely by dropping the term “cyberspace,” it’s not necessary to get rid of cyberspace itself. There never was any such thing."
This article never existed either. Click your scrollwheels at least three times to promptly return Home page.
posted by iamkimiam at 11:28 AM on March 21, 2013 [2 favorites]


so anyway, Motron (old friend) was right twenty or so years ago (while high on LSD). Figuring out information is like figuring out the weather. It's that big, it's that complex, it's that important in terms of our day to day lives. You can't predict it but you can forecast it ...

And I've certainly noticed that the accuracy of our weather forecasts have improved over the years. But they're never perfect. Which makes our increasing reliance on the forecasting problematic. Because it is so often correct, it really is cloudy for a while in the early morning, then sunny until mid afternoon when a few clouds start rolling in and rain by evening, snow at higher elevations.

But what happens when there's a powerful blow of wind, a sudden thunderstorm that no data crunching saw coming. These are the moments that remind us what we're dealing with. Chaotic systems that work on probabilities as opposed to possibilities. Which is fine as long as you act based on the former but always be prepared for the latter.
posted by philip-random at 11:29 AM on March 21, 2013 [1 favorite]


I was shooting from the hip a little bit there, sorry. This is a good post, with a lot of good links. I don't mean to pick one glaring flaw and blow it out of proportion (as tempting as it is to do...and clearly I have no self-control to resist such urges).

I'm a huge fan of big data. I'm also a huge fan of mixed methods. I think mixed methods are especially important when dealing with huge amounts of quantitative data (as big data is such). You need a qualitative counterbalance to give all of this some context. And as many have suggested above, to ask the right questions, listen to the data, ask new questions, listen to the data again...and on and on. That process is fun.

Anyways, lately I've been super interested in the technological determinism arguments. Which is underlying wild ideas about the things that gadgets are doing to our brains, how social networks are ruining our relationships, what spaces and places are real and not real and how stupid we are for thinking such magical thoughts...and on and on. That is also fun to explore. But tiresome at times too.

In sum, these sorts of things make me both happy and trigger-happy.
posted by iamkimiam at 11:43 AM on March 21, 2013


In 2002 the Oakland A's had two things going for them that people always seem to forget when pointing to Beane's data analysis success - their draft successes. The four biggest cogs in the 2002 A's team were an amateur international free agent turned PED using MVP winner (Tejada) and the trio of Zito, Mulder, and Hudson; all four were drafted by the A's and scouted extensively. Sure, Beane's an incredible intellect that asked the right questions and was able to leverage an inefficiency in MLB front offices to score runs but scouting found them the heart of their talent.

Big data has a lot of potential but ultimately it comes down to understanding what the sandbox is you're playing in before you start drawing inference from the data. It also takes a lot of luck.
posted by playertobenamedlater at 11:52 AM on March 21, 2013 [1 favorite]




Having made it through about 2/3 of the links, I can be reasonably certain that this post was the culmination of a deep-seated marketing effort by Amazon to fill my shopping cart with over $100 worth of books.

One thing that's been on my mind recently is how this data enables us to tell stories that weren't able to be told in the past, or stories that were refuted by conventional thinking. I'm a huge basketball fan, and advanced statistics have changed the way that I analyze and think about the game. Instead of raw numbers (player X grabbed 10 rebounds per game this year), I can discuss stats like rebounding percentage (that is to say, the percentage of available rebounds a player collects), which is a different (and in many respects, superior) assessment of a given player's rebounding ability.

In a similar vein, I can assess a player like Bruce Bowen, who does not fare well in "conventional" advanced statistics (because he is on the court primarily for defense), and rather than judge him based on how many points he scores or steals he makes, I can now see how much he reduces a given opponent's shooting percentage, and how that in turn creates a "win" for Bowen's team. In the past, comparing the value of a defensive ace and that of a scorer would have been comparing apples and oranges, but it really helps advance the understanding of the game for fans who crave a deeper level of analysis.

(I know sports is a trite example, but I've found it a fairly compelling topic among my friends who would rather not discuss how data mining is leading to a corporate-run dystopia)
posted by antonymous at 12:25 PM on March 21, 2013


Which is amazing considering how stupid the polling can be.

It's not necessarily about polling, though. At least, not exclusively. They're pulling data sets from all over the place: demographics, consumer data, spending habits, past voting patterns, etc.

Phone polls play a part in that, but they're increasingly becoming a less important part because no one has a home phone and most phone-bank lists don't have many cell phones included (since you can't just pull them out of the phone book).

This is one of the many reasons why the pollsters were so very wrong in their predictions in a few recent elections here in Canada. They're using methods that no longer give them accurate data. Sure...maybe everyone with a home phone is going to vote the way their polls predicted, but they missed the increasingly large number of households that don't have a home phone (like mine, for example). This is especially true in Alberta (where the polls were almost comically inaccurate), because we've got so many transient workers and recent arrivals who never bothered to get a home phone installed when they moved here.
posted by asnider at 12:35 PM on March 21, 2013


If the only concrete example of an actual instance of successful use of 'big data' the author can come up with is picking a baseball team then I find it very hard to believe that these big data decisions are going to have an impact in many other fields.

I worked in the grocery industry for a long time, and retail supply chain management definitely qualifies as a "big data" industry at this point. Looking at cost, pricing, and sales trends down to the individual store level, and sussing out associative patterns between user demographics, product types, and so on are all big business at this point. It's not quite as sexy, and I'm a bit conflicted because it's also about replacing humans with algorithms, but there you have it.
posted by verb at 12:38 PM on March 21, 2013 [1 favorite]


See also The Philosophy of Data.
posted by mark7570 at 12:47 PM on March 21, 2013


As a professional data wrangler I would like to point out that there is never enough data. This is a false panic.
posted by fshgrl at 2:49 PM on March 21, 2013 [2 favorites]


Data are just data. Without a human to create the framework of assumptions around the analysis and interpretation of that analysis, a dataset is a useless lump.

It is also important to understand the difference between causal inference and prediction. The latter is relatively easy and when predicting what will happen is all you want to do, you don't care about causality. When you care about causal relationships, it gets very complicated and easy to fool yourself. That's when the house comes crashing down around you.
posted by Mental Wimp at 3:41 PM on March 21, 2013


Like aren't there useful pieces of information from before 2011?

When there's failure to account for, blaming it on the computer (or "bad data") has become very popular. On the other hand, credit for success accrues to the Nate Silvers. Same old game, new clothes.
posted by Twang at 3:42 PM on March 21, 2013


bonehead: "more physicists and chemists than biologists
Geologists too: GIS systems are nothing if not Big Data.
Dealing with Big Data has been one of, if not the, largest change and challenge for the current generation of scientists
"

Archaeologists, too! Geospatial data are quite dear to our hearts, and we're starting to acquire a huge amount of data on past behaviors and actions, and where those occurred.

From a social science standpoint, one of the things I'm loving about some of the massive datasets we're accruing (and here, I am thinking primariy of massive geospatial datasets of archaeological site locations, site components, etc.) is that, contrary to entrenched old schoolers' fears, it's not that simple solutions are falling out of the mass of data. If anything, having a huge amount of data actually complicates and recomplicates the picture to the point that the data can seem to support multiple hypotheses and viewpoints. I love it when things get complicated! Of all things "big" data do, at least from my point of view, simplifying answers is not one of them.

In a sense, I kinda feel like it's reminiscent of the Philosophers bit towards the end of Hitchhiker's Guide to the Galaxy (edited a bit for brevity, and apologies for the small text so I don't fill too much space, but):
"Under law the Quest for Ultimate Truth is quite clearly the inalienable prerogative of your working thinkers. Any bloody machine goes and actually finds it and we're straight out of a job aren't we? I mean what's the use of our sitting up half the night arguing that there may or may not be a God if this machine only goes and gives us his bleeding phone number the next morning?" (...)

"Might I make an observation at this point?" inquired Deep Thought. (...) "it occurs to me that running a programme like this is bound to create an enormous amount of popular publicity
for the whole area of philosophy in general. Everyone's going to have their own theories about what answer I'm eventually to come up with, and who better to capitalize on that media market than you yourself? So long as you can keep disagreeing with each other violently enough and slagging each other off in the popular press, you can keep yourself on the gravy train for life. How does that sound?"

The two philosophers gaped at him.

"Bloody hell," said Majikthise, "now that is what I call thinking. Here Vroomfondel, why do we never think of things like that?"

"Dunno," said Vroomfondel in an awed whisper, "think our brains must be too highly trained Majikthise."

So saying, they turned on their heels and walked out of the door and into a lifestyle beyond their wildest dreams.
posted by barnacles at 10:57 PM on March 21, 2013 [3 favorites]


Isn't the bulk of (successful) human innovation based on ignoring our often wrong intuition and actually interpreting the data?

Secondly, the more data there is, the more need there will be for people to interpret it and sift through it intelligently.
posted by gjc at 6:36 AM on March 22, 2013


Isn't the bulk of (successful) human innovation based on ignoring our often wrong intuition and actually interpreting the data?

I don't know about this. I mean, if people are wrong about some fact that there's data about, presumably analyzing the data will reveal that people are wrong, but I'm not sure that's how progress has generally been made.

The dream that quantification will solve our problems has been around for hundreds of years. It always seems to present itself as something brand new that no one's ever thought of before.

"Big Data" is an interesting term -- to me it should be restricted to things like data about the entire internet, or the datasets that particle physicists and gene researchers accumulate. People often seem to use the term when they just mean "Data."
posted by leopard at 7:27 AM on March 22, 2013




Everything We Know About What Data Brokers Know About You

Discussion: How Do Data Brokers Impact You?
posted by homunculus at 10:45 AM on March 23, 2013


Schneier On Security: Our Internet Surveillance State
The Internet is a surveillance state. Whether we admit it to ourselves or not, and whether we like it or not, we're being tracked all the time. Google tracks us, both on its pages and on other pages it has access to. Facebook does the same; it even tracks non-Facebook users. Apple tracks us on our iPhones and iPads. One reporter used a tool called Collusion to track who was tracking him; 105 companies tracked his Internet use during one 36-hour period.

Increasingly, what we do on the Internet is being combined with other data about us. Unmasking Broadwell's identity involved correlating her Internet activity with her hotel stays. Everything we do now involves computers, and computers produce data as a natural by-product. Everything is now being saved and correlated, and many big-data companies make money by building up intimate profiles of our lives from a variety of sources.

Facebook, for example, correlates your online behavior with your purchasing habits offline. And there's more. There's location data from your cell phone, there's a record of your movements from closed-circuit TVs.
posted by the man of twists and turns at 9:08 AM on March 25, 2013 [1 favorite]


And welcome to a world where all of this, and everything else that you do or is done on a computer, is saved, correlated, studied, passed around from company to company without your knowledge or consent; and where the government accesses it at will without a warrant.
Actually, that last part about NSLs may not be true for much longer, depending on what happens with the appeal:

Federal Judge Finds National Security Letters Unconstitutional, Bans Them
posted by homunculus at 9:53 AM on March 25, 2013 [1 favorite]


obligatory
posted by jeffburdges at 6:44 PM on March 26, 2013


An Interview With Writer Jaron Lanier
In the case of a market place, yes. But this is why it is so critical that market places can’t be corrupt and need to be honest. The problem with our cloud software right now is that it does tend to be run by the person with the biggest computer on the network, and serve certain interests more than others. It’s not an honest broker. We are constantly running into a situation where a company like Google is saying: we are being the honest broker. Of course that is ridiculous because they are a commercial concern. So in order for us to be rationally ready to cede control to some cloud software, it really does have to achieve some state of honesty. I believe that should look more like a real market place.
posted by the man of twists and turns at 9:22 PM on March 27, 2013




What Big Data Will Never Explain
The mathematization of subjectivity will founder upon the resplendent fact that we are ambiguous beings. We frequently have mixed feelings, and are divided against ourselves. We use different words to communicate similar thoughts, but those words are not synonyms. Though we dream of exactitude and transparency, our meanings are often approximate and obscure. What algorithm will capture “the feel of not to feel it / when there is none to heal it,” or “half in love with easeful Death”?
posted by the man of twists and turns at 9:49 PM on April 2, 2013 [1 favorite]


What algorithm will capture “the feel of not to feel it / when there is none to heal it,” or “half in love with easeful Death”?

Well, our algorithms do, at least for some of us...
posted by Mental Wimp at 12:01 PM on April 3, 2013




The Real Wisdom Of Crowds - "To Iain Couzin from Princeton University, these stories are a little boring. Everyone is trying to solve a problem, and they do it more accurately together than alone. Whoop-de-doo. By contrast, Couzin has found an example of a more exciting type of collective intelligence—where a group solves a problem that none of its members are even aware of. Simply by moving together, the group gains new abilities that its members lack as individuals."
posted by the man of twists and turns at 6:17 AM on April 5, 2013 [1 favorite]


The Hidden Biases In Big Data, via
Sadly, they can't. Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.
posted by the man of twists and turns at 10:47 AM on April 9, 2013






« Older What Fresh El is This?   |   Daddy, Daddy, Daddy! Newer »


This thread has been archived and is closed to new comments



Post