GaMuSo is an application of BioGraph-based data mining to music, which helps you get recommendations for other musicians. Based on 140K user-defined tags from last.fm that are collected for over 400K artists, results are sorted by the "nearest" or most probable matches for your artist of interest (algorithm described here). [more inside]
Big Data On Campus (NYTimes) “We don’t want to turn into just eHarmony,” says Michael Zimmer, assistant professor in the School of Information Studies at the University of Wisconsin, Milwaukee, where he studies ethical dimensions of new technology. “I’m worried that we’re taking both the richness and the serendipitous aspect of courses and professors and majors — and all the things that are supposed to be university life — and instead translating it into 18 variables that spit out, ‘This is your best fit. So go over here.’ ”
In The Geographic Flow of Music (arxiv), researchers Conrad Lee and Pádraig Cunningham propose a method to use data from the last.fm API to track the world's listening habits by location and time, showing where shifts in musical tastes have originated and subsequently migrated. Results show music trends originating in smaller cities and flowing outward in unexpected ways, contradicting some assumptions in social science about larger cities being more efficient engines of (cultural) invention.
It has applications in health care, pharmaceuticals, facial recognition, economics/related areas, and of course, much much more. Previously, MeFi discussed controversial homeland security applications, and the nexus between social networking and mobile devices that further contributes to the pool. With plenty to dig into, let's talk Data Mining in more detail. [more inside]
MeFi's own Elizabeth Pisani, of The Wisdom of Whores, on Big Data and the End of the Scientific Method (PDF).
Kaggle hosts competitions to glean information from massive data sets, a la the Netflix Prize. Competitors can enter free, while companies with vast stores of impenetrable data pay Kaggle to outsource their difficulties to the world population of freelance data-miners. Kaggle contestants have already developed dozens of chess rating systems which outperform the Elo rating currently in use, and identified genetic markers in HIV associated with a rise in viral load. Right now, you can compete to forecast tourism statistics or predict unknown edges in a social network. Teachers who want to pit their students against each other can host a Kaggle contest free of charge.
According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Data data everywhere and possibly too much to drink?
AOL releases 3-months of queries from 500k users. AOL, either fairly or unfairly, is sometimes considered the internet with training wheels. So while parsing this data, keep that in mind. Some of these queries seem like spam email subjects, don't they? Don't forget, this is the same demographic that brought you the September that didn't end. AOL tried to retract the data, but it's of no use - it's out there, on the web.