"About $43,000 a year."
October 13, 2015 5:20 PM   Subscribe

What's the Difference Between Data Science and Statistics?Not long ago, the term "data science" meant nothing to most people-even to those who worked with data. A likely response to the term was: "Isn't that just statistics?" These days, data science is hot. The Harvard Business Review called data scientist the "Sexiest Job of the 21st Century."  So what changed? Why did data science become a distinct term? And what distinguishes data science from statistics?
posted by tonycpsu (37 comments total) 44 users marked this as a favorite
 
My guess was going to be that statistics is all about applying probability theory and data science is "screw probability, we have ALL THE DATA."
posted by If only I had a penguin... at 5:42 PM on October 13, 2015 [3 favorites]


Statistics is a field of study; data science is a job description. Data science combines statistics with programming and business knowledge. In that sense, a data scientist is like a full-stack developer.
posted by I-Write-Essays at 5:43 PM on October 13, 2015 [13 favorites]


A data scientist is like a full-stack developer.

Well at least they can write R and SQL code probably.
posted by GuyZero at 5:44 PM on October 13, 2015 [11 favorites]


The point being that rather than specializing, you're expected to fill many roles, and thereby add more value than a person who "only does one thing."
posted by I-Write-Essays at 5:48 PM on October 13, 2015 [1 favorite]


Where does informatics fit in all of this?
posted by ArbitraryAndCapricious at 5:50 PM on October 13, 2015 [1 favorite]


In my experience, data scientists don't build information systems. They are expected to be the information system, and build whatever (ad hoc) tools they need to answer the questions put to them by business.
posted by I-Write-Essays at 5:54 PM on October 13, 2015 [7 favorites]


Upon being told he's a data scientist, Statistician Jourdian sez "I have been speaking prose all my life, and didn't even know it!".
posted by lalochezia at 5:54 PM on October 13, 2015 [1 favorite]


Relevant link from Stanford statistics professor Rob Tibshirani on the terminology differences between statistics and machine learning.

Honestly, I think statistics' main problem as a field is some combination of AP Statistics and the required statistical methods course most science majors have to take. Between those two experiences it seems most quantitatively minded people convince themselves they hate statistics, and it takes years to convince them otherwise.
posted by town of cats at 6:02 PM on October 13, 2015 [7 favorites]


But seriously, I think the difference is all in the titles - a data scientist conducts experiments, a statistician just analyzes data. Assuming the job responsibilities and the job title have any correlation at all, which is not guaranteed.

But for sure the idea behind data scientists, as far as I understand it, is that they design experiments, get all the pieces in place to conduct the experiment, gather the data and analyze the results. Statistics is just part of it. And they may do a lot of the coding or very little. Logging is probably still handled by dedicated logging people and suddenly I'm having MeFi deja vu - I think we've had this discussion before?
posted by GuyZero at 6:04 PM on October 13, 2015 [1 favorite]


OK let me try:

Statistics: Kanye West
Data Science: Rick Rubin

Statistics: What tea should taste like
Data Science: Improbability Drive

Statistics: Dubya
Data Science: Rumsfeld

AM I DOING IT RIGHT.
posted by jimmythefish at 6:16 PM on October 13, 2015 [5 favorites]


Do they really "conduct experiments"? I'm having trouble, given everything I've read about the field, seeing exactly where the "science" is getting done. It really does just seem like data analysis from the outside.
posted by Steely-eyed Missile Man at 6:20 PM on October 13, 2015


I think statistics' main problem as a field is some combination of AP Statistics and the required statistical methods course most science majors have to take. Between those two experiences it seems most quantitatively minded people convince themselves they hate statistics, and it takes years to convince them otherwise.

I may, uh, know someone who had this experience. Tell me about the process of convincing them otherwise.

Seriously, there were very few of the classes I took in the course of working through a math undergrad that defeated me as repeatedly as stats did. I tried the mathematical stats class and dropped out, I tried the statistical methods class for engineers and dropped out, finally I made it through on the stats class tailored specifically for econ majors (even though I wasn't one). And even there I never felt like I understood what was going on underneath the surface manipulation -- as opposed to a lot of math classes where I could re-derive most of what I might need to remember for a long time afterward. I've wondered if this means I need to get down and dirty with underlying probability math, or try a bayesian tack... or if the old maxim that says "you don't understand this, you just get used to it" applies here.
posted by weston at 6:24 PM on October 13, 2015 [3 favorites]


A statistician can help you figure out the correlations between 100,000 pictures of dogs. A data scientist writes a program that takes 100,000 pictures of dogs and results in a) a program that can tell you if there is probably a dog in a picture and b) a program that turns any image into an amazing and or horrifying mixture of features of dogs.
posted by idiopath at 6:25 PM on October 13, 2015 [17 favorites]


I'm trying to hire a data scientist right now, and just finished a first pass on 80 odd resumés.

The tweet in that post "Person who is better at statistics than any software engineer and better at software engineering than any statistician." is pretty accurate.

The vast majority of my resumés were engineers or comp sci people who've done some more math, or math people that have done a bunch of comp. sci stuff.

My expectation for all applicants is that their stats is wicked good, and to me the differentiating factor is use and familiarity with the variety of 'Data Science' tools. Machine learning, natural language processing, SQL & DB stuff, hopefully a bit of 'big data' tools like hadoop, map reduce etc too.
posted by Sleddog_Afterburn at 6:29 PM on October 13, 2015 [11 favorites]


I think this Venn diagram nicely shows the difference.
posted by I-Write-Essays at 6:50 PM on October 13, 2015 [7 favorites]


A statistician can help you figure out the correlations between 100,000 pictures of dogs. A data scientist writes a program that takes 100,000 pictures of dogs and results in a) a program that can tell you if there is probably a dog in a picture and b) a program that turns any image into an amazing and or horrifying mixture of features of dogs.

No, a data scientist will take a million pictures of dogs, will clean and structure the data into a useable database, apply the correct methodology (text mining? sentiment analysis? who knows??), and tell you where the dogs have been shopping, what sort of kibble they bought and if that correlated to sales or advertising or one of a thousand other variables, and will generate usable reports with an optimised run routine to get the report done in minutes rather than hours.

And they will likely do it all on proprietary software that is not R because no multimillion dogfood company is going to trust its corporate strategy and billions of dollars of revenue to R code that a loose team of devs put together over a couple of years of uni courses and several rounds of beer. They might have an in-house team that develops stuff, but the liability insurance would be huge.
posted by ninazer0 at 6:57 PM on October 13, 2015 [9 favorites]


> no multimillion dogfood company is going to trust its corporate strategy and billions of dollars of revenue to R code that a loose team of devs put together over a couple of years of uni courses and several rounds of beer.

Maybe not a dogfood company, but how about the financial industry?
posted by I-Write-Essays at 7:02 PM on October 13, 2015 [1 favorite]


And what distinguishes data science from statistics?

About $120k/yr.
posted by mhoye at 7:03 PM on October 13, 2015 [1 favorite]


ninazer0: "and tell you where the dogs have been shopping, what sort of kibble they bought and if that correlated to sales or advertising or one of a thousand other variables, and will generate usable reports…"

…that recommend the exact wrong kind of kibble to suggest they try. You know, like Facebook, Amazon and Spotify with apps, books and music, respectively.
posted by signal at 7:10 PM on October 13, 2015 [1 favorite]


Steely-eyed Missile Man: Do they really "conduct experiments"? I'm having trouble, given everything I've read about the field, seeing exactly where the "science" is getting done. It really does just seem like data analysis from the outside.

This is a sort of semantic question that I think lags a bit behind the modern practice of science in some ways. As someone who switched fields from Biochemistry to Computer Science, I've remarked that these days the thing I do that feels most like The Scientific Method is debugging: from observations of my code's behavior, formulate hypotheses of why it's doing that, design tests that will distinguish which of my hypotheses, if any, are correctly describing its internal tangly bits, execute those tests, and interpret their results. Which might require repeating because, say, none of my potential causes are exactly why it's doing that thing but at least I know it can't be because of a dangling pointer over in this data structure, so perhaps… and I'm off with a new round of hypotheses.

But is that science? Not really, it's developing. Codewriting. Whereas, well, I recently learned that to particle physicists, each detector at the LHC is one "experiment"—the ATLAS detector, the ALICE detector, those are two experiments only. But that's not a good count of the number of experimental things being tested by the LHC: it spins its particles around and generates this massive (the world's largest, last I heard) database of results. And only then, physicists trawl through those data looking for connections, interesting statistical anomalies, stuff like that. The LHC is out of the (fast) loop of generating hypotheses and testing them—all the questions physicists come up with are tested by asking the massive data blob.

It's on a second, slower, loop that they start to run into more and more questions that can't be answered by the LHC data blob, and start talking about designing new detectors or upgraded magnets or maybe even building a new accelerator, and make specific choices about what data they want to get that affect how they design the instruments.

But these days, "instruments" made entirely out of math that act as microscopes into a slide made out of database slices are one way the scientific method really is performed.

(Now I want to rename my own work's data analysis code "microtome".)
posted by traveler_ at 7:17 PM on October 13, 2015 [13 favorites]


... If you're paying liability insurance to someone, actually (or if liability is involved at all), there's a much better chance you're a statistician rather than a data scientist.

In reality there is no bright line between the fields commonly considered computer science and statistics, and this causes no end of trouble when computer scientists don't know anything about sampling or when statisticians can't code their way out of a paper hat. Determinism in computation is a funny illusion that only comes out of the drowning in inherent randomness in really, fast digital computation and exceedingly simple systems, and p << n statistics is also a funny illusion that only comes out of the sheer inability to do computation quickly, a definite bed of Procrustes.

My friend D---- doing R simulations for Google would probably have a small beef with R not being production-ready for large corporations. But Python is now basically the language of choice, because you can do actual computations with it and it wasn't designed by people who didn't want to design a language.

weston: I recommend you poke at the philosophical foundations of probability and of probability theory. I think it's a limiting factor in our understanding of data, along with the general inability to compute.
posted by curuinor at 7:19 PM on October 13, 2015 [5 favorites]


Well at least they can write R and SQL code probably.

Node and Mongo when I accidentally ended up interviewing for a "Data scientist" position. Mostly I got the impression they were looking for someone who could translate their big dataset into smaller datasets that could make pretty charts.

...Which I could do, but I would have felt pretty weird taking that title, TBH.
posted by Artw at 7:23 PM on October 13, 2015 [1 favorite]


My friend D---- doing R simulations for Google would probably have a small beef with R not being production-ready for large corporations. But Python is now basically the language of choice, because you can do actual computations with it and it wasn't designed by people who didn't want to design a language.

But R isn't ready for production systems. No one deploys R in production for real work, and the reason for that is abundantly clear to me, as I spent a terrible month this year bootstrapping an offline prediction service (done in R) to an actual API (still.. done in R). Then we rewrote the thing in Node of all things (although models still trained in R) and it's about 10 times faster and less fault-prone.

Everyone uses R to noodle around in offline work, but if you're pushing data through R as part of a larger automated system, you'll generally have a bad time.
posted by TypographicalError at 8:21 PM on October 13, 2015 [3 favorites]


I was told one way to tell when an academic scientist is thinking of making the jump into the real world is that they update their linkedin profile to include "data scientist" (I refuse to confirm or deny whether my linkedin profile currently contains the words data scientist).
posted by drnick at 8:27 PM on October 13, 2015 [8 favorites]


TypographicalError: just because you have a bad time doesn't mean it isn't in production, unfortunately. Even in Google. He hates it, too, if that's any consolation.

Even working on quite normative systems in production, IO is often still the problem because CPU's are so damn fast, so Node may be an OK idea just because of its eventedness, which may overwhelm the advantages lots of other languages have in places where the limit is CPU.

An interesting pattern I've been poking at (not saying that it's true, I've just been poking at it) is that most people who get paid for programming deal with IO limits as their day-to-day computational limit, and most people who get paid for doing computer science research deal with CPU limit as their day-to-day computational limit. The difference between professional and grad student code?
posted by curuinor at 8:48 PM on October 13, 2015 [6 favorites]


Haha, yes, drnick. To me "data scientist" means "someone with a PhD in physics or electrical engineering or some other math-heavy science field who's tired of working for peanuts and/or did a couple postdocs and doesn't see any TT prospects ahead."
posted by town of cats at 8:51 PM on October 13, 2015 [4 favorites]


The kibble was an analogy. A red herring, if you will. :)

Look, if you want a real world example, what about a bank tracking possible fraudulent transactions across thousands and thousands of credit cards. Millions of transactions that need to be filtered and classified with rules written that identify fraudulent, non-typical spending without triggering false-positives. What constitutes typical? How much customer history is useful? How much can you jettison and still have a viable product that doesn't cost the company a serious amount of money in processing time or dollar loss? Can you predict where likely frauds will occur from someone's spending habits (are they shopping online? where? who?) and how badly will your rule perform in the midst of non-typical seasonal shopping?

That's where data science starts to move away from statistics per se.
posted by ninazer0 at 9:00 PM on October 13, 2015 [3 favorites]


I always considered it the difference between theoretical and applied sciences: one enhances our capability to understand the world by developing new models and ideas (or enhancing, changing or replacing current ones) while the other applies these models and ideas to that world to solve a specific problem.
posted by linux at 10:35 PM on October 13, 2015 [1 favorite]


For years upon years of recorded history, mankind lived according to pure market principles. Goods were obtained, needs were met or neglected, and natural law ruled.

After millions of such years, the smartest and most-abled of our illustrious ancestors undertook to invent the insurance company. This invention allowed the most-entitled to place a maximum lower bound on the risk of their investment in exchange for a small penalty on the upper bound of their investment's potential return.

This proved a great boon to those already possessed of wealth, and became an even greater one as the bracket was tightened through statistical means by those they called "actuarial scientists". Much wealth accrued to the investors, and only a small, single-digit percentage accrued to those who ran these companies. Such was the desert of their ingenuity, but compounded in its sheer volume to the makings of great personal and institutional wealth.

Such a system prevailed for hundreds of years, up until the invention of so-called "instruments of personal finance". Credit cards extended to the population as a whole the convenience of a line of credit, at the mere expense to the end-creditor of a month of returns. The instrument was larcenous in its ingenuity, yet unassailable as the de-facto creditor risked alienating custom against their new-found convenience. The interest accruing to this month of returns, as well as fees to the custom, went directly to the issuer of these credit agents.

This proved also a great boon. Card issuers retained their own staff of statisticians to fine-tune fees and rates of interest and default to maximize their own returns. Their risk was distributed to the insurance companies who satisfied themselves with rent-taking and considered themselves too busy to bother getting into such a comparatively high-risk market.

Such a system prevailed for about twenty years, up until credit card companies were too satisfied with extracting a few percent on every transaction ever performed to bother getting into any other market either. There were plenty of opportunities: plenty of extra dollars and cents which the consumer was, very figuratively, "leaving on the table".

The credit card companies were not interested. Rent-taking was a full-time position, a very lucrative one, and not one worth leaving. The extra value left on the table, those few dollars and cents not yet wrung out of the consumer, those were there for the taking of a different, of a new kind of company.

So began the age of the data scientist.
posted by 7segment at 10:54 PM on October 13, 2015 [9 favorites]


I think the difference is that the term data scientist covers a much broader range of job responsibilities than statistician. I could reasonably be termed a data scientist, but performing statistical analyses is a relatively small part of my job. I'm responsible for understanding our company data from input to analysis. This means understanding how several hundred different database tables spread across several different databases all relate to each other, figuring out how exactly to extract and combine data from all those sources into a dataset that can actually be used for the analysis, performing the analysis, and finally presenting the results and making recommendations to company executives. Alongside all of that, I'm building tools, ranging from reports, to complex Excel books with a ton of macros, to help front line and low-level management understand their operations better by exposing them to data that can help them make better business decisions.

The good part is our data is highly structured, and almost completely stored in enterprise databases. The bad part is our data is far more wide than it is tall, which just causes a ton of complexity when it comes to answering the question of the week, where two similar sounding questions could involve two completely different processes be built to extract the data. I don't have much support from our IT data warehouse team mostly because there are so many different ways to combine the data, and the questions I get are so varied that our EDW just doesn't cut it.

The biggest key to success that I've noticed is having a deep understanding of the business and having good relationships with business leaders (in addition to phenomenal data skills). I've seen a few people come through the doors that don't get out and learn the business and try to just suck in data and spit out analyses without truly understanding the context of the data, and it just doesn't work out well in the end. Don't get me wrong, they're some of the best statisticians we've had, but there's more needed than just knowing the statistics.

All that seems more than just what a statistician would do (to me at least), but I also understand that it's also not the Silicon Valley picture of data scientist that many people have (e.g., predictive/prescriptive analytics, machine learning, etc.). If anyone has a better title, I'd love to hear it.
posted by noneuclidean at 6:01 AM on October 14, 2015 [1 favorite]


From my biased perspective:

Statisticians have the luxury of spending a ton of time up front considering how to collect the perfect data set for their analysis. They wouldn't be doing their job if they skimped on this part. The stats required to analyze the data can (and probably should) be chosen and implemented before any samples are collected.

The real world is not like that. Data sets are haphazard, biased, and possibly incomplete. From that, the data scientist must come up with something that will be useful for the company (in its analytical or predictive capabilities) regardless of what academia would say about it. Now.
posted by mantecol at 7:41 AM on October 14, 2015 [2 favorites]


Statistics arose when the main challenge was not having enough data and how to collect it and leverage it to make estimates and evaluate hypotheses, while also measuring the uncertainty of the conclusions.

Data Science arose when the main challenge was Big (really big, humongous) Data that continues to pour in like a tsunami and to discover what surprising things can be learned from it.
posted by Sir Rinse at 7:59 AM on October 14, 2015 [4 favorites]


A statistician is someone willing to spend five years figuring out the minuscule ways in which a quantitative procedure is unreliable.

A data scientist is someone willing to use a procedure they discovered three weeks ago in order to compile a report by the end of the quarter.

It should surprise no one that the former make much less money while the latter draw many more spurious conclusions.
posted by belarius at 7:59 AM on October 14, 2015 [5 favorites]


Lies, damned lies, and data science.
posted by cynical pinnacle at 8:12 AM on October 14, 2015


A statistician tells you what the numbers are.
A data scientist tells you why they are what they are.
posted by Burn_IT at 1:41 PM on October 14, 2015


noneuclidean, that's the type of Data Scientist I'm familiar with. Maybe it's an East-coast vs West-coast thing. There are huge regional differences in IT culture.
posted by I-Write-Essays at 5:03 PM on October 14, 2015


Statistics arose when the main challenge was not having enough data and how to collect it and leverage it to make estimates and evaluate hypotheses, while also measuring the uncertainty of the conclusions.

Data Science arose when the main challenge was Big (really big, humongous) Data that continues to pour in like a tsunami and to discover what surprising things can be learned from it.


I keep hearing all these BIG DATA!!1! pronouncements in Monster Truck Rally announcer voices.

"TUESDAY NIGHT AT THE COLO SITE ... WATCH THE REALTIME LOGS *CRUSH* 2000 CORES"

"WATCH DATASAURUS RIIIIP THROUGH 50 PETABYTES OF CAT VIDEOS!"

Please refrain from having more than one slide in your presentation which salivates over OMG HOW MUCH DATA you have. Especially at a conference of stats/data people.
posted by benzenedream at 11:09 AM on October 15, 2015 [5 favorites]


« Older Are Aliens Building Structures Around KIC 8462852?...   |   "If this was the law of Nature, why waste any time... Newer »


This thread has been archived and is closed to new comments