Papers and More on Data Mining
April 22, 2011 7:15 PM Subscribe

It has applications in health care, pharmaceuticals, facial recognition, economics/related areas, and of course, much much more. Previously, MeFi discussed controversial homeland security applications, and the nexus between social networking and mobile devices that further contributes to the pool. With plenty to dig into, let's talk Data Mining in more detail.

First, some High School Primer on AI previously shown on MeFi.

Second, and more important, introductions to key concepts:

Dimension Reduction: Principle Components Analysis and, for distance (not just geographical) based dimension reduction, Multidimensional Scaling. Algorithms and methods that take multivariate datasets and attempt to find a what's most important/influential within a dataset.
Classification: Linear / Quadratic / Bayesian Quadratic Discriminant Analysis (LDA, QDA, BQDA). Add some paper on PCA vs. LDA. The Naive Bayes Classifier for a 'feature' based approach. Lastly, but certainly not leastly, Support Vector Machines.
Clustering: when you have no labels, make them, with, for example, K-means Clustering.

To complete this, a good book on the topic: Pattern Recognition and Machine Learning by Chris M. Bishop.

posted by JoeXIII007 (14 comments total) 72 users marked this as a favorite

It's considered good form on MetaFilter to warn about PDF download links.
posted by hippybear at 7:37 PM on April 22, 2011 [4 favorites]

Principal components analysis, not principle.

I wrote a piece in Slate about the tension between dimension reduction and clustering.

Also, Cosma Shalizi's course notes are a superb introduction to the many of the ideas discussed in this post.
posted by escabeche at 7:40 PM on April 22, 2011 [11 favorites]

Hippybear: that was the one thought I had after hitting the post button. I apologize for the lack of one. Mods: please slip in a small thing like "(PDF links mostly)." Thanks.

Escabeche: ... mods, you know... please, thanks.

Both of you, thanks for the pointers.

And back to our regularly scheduled enjoyment of data.
posted by JoeXIII007 at 7:58 PM on April 22, 2011

Oh god I spent six months of my life reading that damn book. You'd get more practical knowledge with Wikipedia and a copy of Weka.
posted by miyabo at 8:16 PM on April 22, 2011

The Witten and Frank book on Data Mining is also a good introduction, especially for those of us who are more programmer than statistician.
posted by RobotVoodooPower at 8:35 PM on April 22, 2011 [1 favorite]

Many of these techniques consist of fitting the same linear factor model under different assumptions. Mardia, Kent, and Bibby's Multivariate Analysis presents a unifying approach that cuts through the buzzwords.
posted by drdanger at 8:36 PM on April 22, 2011 [2 favorites]

The Bishop book is great... If you have the background in math and statistics.

Recently O'Reilly publishing came out with a wonderful book Mining the Social Web, which covers data mining from an introductory level. It's got great examples with code for social network analysis, information visualization and NLP / Information Retrieval. For those without math/stats background, I highly recommend it.

Understanding data mining is crucial to understanding the privacy implications of so many of our new technologies. And also, it's going to be key to handling the huge amounts of information we're facing.
posted by formless at 8:51 PM on April 22, 2011 [1 favorite]

Elements of Statistical machine learning can be downloaded for free in PDF format, off the author's site.
posted by delmoi at 8:56 PM on April 22, 2011 [2 favorites]

Nobody better make Data Minecraft or I'm doomed.
posted by Aquaman at 9:43 PM on April 22, 2011 [1 favorite]

Another good (readable and enthusiasm inducing) introductory book from O'Reilly is Programming Collective Intelligence, it even got me into Python (not 'in to a', that's another story entirely).
posted by titus-g at 4:59 AM on April 23, 2011

Great Post! Sigh now, where to start...
posted by stratastar at 2:46 PM on April 23, 2011

The dark side of data mining (I sometimes suspect the only side).
posted by TedW at 3:32 PM on April 23, 2011 [1 favorite]

If you consider data dredging to be the dark side of data mining, you really aren't thinking.

Consider face recognition. Consider why one of the brightest electrical engineers I ever met refused to work on it, no ifs, ands, or buts.

Personally I find dependency models and tensor decompositions interesting, and I could not agree more with the notion that modern biology, medicine, and information retrieval will soon be inextricably wedded to the sorts of models previously associated with AI and machine learning. My recommendation for anyone with a reasonable background in probability and statistics would be to start with Hastie, Tibshirani, and Friedman (Elements of Staistical Learning, NOT Machine Learning... there is an interesting history here, where many of the initially exciting results in machine learning were shown to be simple overfitting, and a tension arose between statisticians and ML researchers, later resolved by the two fields slowly merging, as is currently happening). Like all books, it slowly grows outdated; one of the members of my dissertation committee has been threatening to write an "introductory" version for a while, and I do hope he will. The first edition of the book broke ground in the academic publishing industry by showing that a technical, four-color printing at $80+ could still make money... and then the authors did one better and showed that it could make money even as they gave away the PDF of the second edition for free. Personally I find it endlessly readable.

Another fine resource is Andrew Moore's Statistical Data Mining Tutorials, where he shows that by eliding the "hard" mathematical probability and statistics background required for reproducible, robust models, many beginners shortchange themselves into building clones of whatever fad is hottest.

At the end of the day, data mining is mathematics. Ignore this at your peril. Some of the most beautiful ideas ever loosed on the world stem from seemingly useless mathematical trivia; there are worse fates than to appreciate why this is so, in the midst of trying to learn something "useful" like data mining.
posted by apathy at 4:44 PM on April 24, 2011 [2 favorites]

Oh, and one other thing. PCA and LDA are both special cases of canonical correlation analysis, which is itself a special case of generalized eigendecomposition. You can push this as far as you might like; before he died in 2007, Gene Golub, whose work on the SVD laid the foundation for much of modern computational statistics, was extending the same techniques to tensors, which is to say, stacks of matrices (at least for tensors of order 3; higher-order tensors are harder to comprehend but obey certain mathematical rules that allow the same tools to work on them).

Again, mathematical trivia used only by physicists and psychometricians until the last few years, whereupon the applications have exploded. Worse fates exist than to understand why.
posted by apathy at 4:50 PM on April 24, 2011

« Older we can't be happy underground | Shunned House for Sale Newer »

This thread has been archived and is closed to new comments

MetaFilter

Papers and More on Data Mining
April 22, 2011 7:15 PM Subscribe

Tags

Share

Papers and More on Data Mining April 22, 2011 7:15 PM Subscribe

Tags

Share

Papers and More on Data Mining
April 22, 2011 7:15 PM Subscribe