♫ "The first and only freely-available, industrial-scale dataset for research on popular music and audio analysis" ♬
October 2, 2011 5:55 PM Subscribe

"The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks." It's about 288 GB but you can download a smaller subset of 10,000 songs selected at random to get a taste. Curious what you'll get? Check out this example track description.

Their FAQ has links to other large music metadata datasets and you can read about other music data mining projects using other datasets. What have people used it for? You may want to check out

- The Natural Language of Playlists [a mashup with Art of the Mix data]
- The the SecondHandSongs dataset, a list of cover songs within the original dataset.
- The MusiXmatch dataset with lyrics provided for songs [in the copyright-nonproblematic bag-of-words format] which Last.fm used to do this. (more info)

And here's the paper by its creators (pdf) explaining why and how they put it together.

posted by jessamyn (27 comments total) 68 users marked this as a favorite

I checked out the example track description and I feel I've just been metarickrolled.
posted by Loudmax at 6:12 PM on October 2, 2011 [10 favorites]

That 'example track description' had to be the most INDIRECT Rick Rolling I have ever been subjected to. I like.
posted by oneswellfoop at 6:13 PM on October 2, 2011 [3 favorites]

D'oh.
posted by oneswellfoop at 6:13 PM on October 2, 2011 [1 favorite]

Yay! It is an awesome project; there is already interesting stuff being done with it (aside from the above, see for example the recent post about the Slow Build experiment from Echo Nest's Paul Lamere), but there's just so much crazy potential for future projects that it just makes me happy knowing it's out there.
posted by cortex at 6:25 PM on October 2, 2011

Holy crap.....
posted by lazaruslong at 6:57 PM on October 2, 2011

If this is only the metadata why would it be so big? 288 gig seems like a lot of data for (only) a million songs.
posted by cjorgensen at 7:03 PM on October 2, 2011

man i love logging onto metafilter late at night and seeing my latex linked up

cortex pointed out slow build, also check out his later search for music by drawing a picture of it (at the handy url searchformusicbydrawingapictureofit.com. Also for a brief tutorial on using AWS EMR to do (almost free) bulk analysis on the dataset, see How to process a million songs in 20 minutes. (Amazon kindly is hosting the MSD for AWS users to mount for free.)

Thierry's got a pretty awesome add-on to the MSD he'll be announcing at ISMIR later this month, which reminds me, I have to go do some work for that....
posted by brianwhitman at 7:03 PM on October 2, 2011 [4 favorites]

cjorgensen: Echo Nest has a ton of data for each song. We've got the pitch and timbre and loudness envelope for every discrete sound within each song, as well as a ton of "contextual" data like blog posts, tags/terms, IDs/URLs on different services etc.
posted by brianwhitman at 7:05 PM on October 2, 2011

Great to see this, as I'm just embarking on a project using the Million Song Dataset as we speak!
posted by escabeche at 7:10 PM on October 2, 2011

Wow - that's a lot of song minutia. I just need to know if it's got a good beat and I can dance to it.
posted by Slack-a-gogo at 7:15 PM on October 2, 2011 [1 favorite]

This is some really neat stuff. Has anyone mentioned this site to fishlike yet?

I'm not sure I get what all is going on with any of the links, but this is the kind of stuff I like to explore. It makes me wish I was more into datawankery.
posted by cjorgensen at 7:20 PM on October 2, 2011 [1 favorite]

Slack-a-gogo, got you covered

http://developer.echonest.com/api/v4/song/search?api_key=N6E4NIOVYMTHNDM8J&format=json&results=10&sort=danceability-desc&max_tempo=142&min_tempo=138&bucket=audio_summary
posted by brianwhitman at 7:22 PM on October 2, 2011

I wish I knew how to munge musical data as well as I can munge text in Python.
posted by mccarty.tim at 7:29 PM on October 2, 2011

By which I mean "Make a crappy markov of every piece of dialoge in My Little Pony: Friendship is Magic."
posted by mccarty.tim at 7:29 PM on October 2, 2011 [1 favorite]

mccarty.tim, at the risk of sounding like i am moderating a thread i didn't even make, you should definitely check out EN remix-- it loads pieces of audio up in a python list intelligently segmented on the beat or onset, and you can natively do stuff to it, including markovian messes of fun. just pretend the items in the segment list are words. surely cortex has tried this on scenes from The Adventures of Pete and Pete.....
posted by brianwhitman at 7:39 PM on October 2, 2011

The data is interesting, I suppose, but (easily) the most interesting piece of data that they could include, but will never, ever include are the lyrics. It would be so utterly trivial to include lyrics in the metadata for every damned song in existence. The extra storage overhead would be virtually nil, and the correlating data you could amalgamate would be outrageously fascinating. Alas, no.

Thanks a lot, copyright.
posted by Civil_Disobedient at 7:46 PM on October 2, 2011 [4 favorites]

I'm sorry to report that, while a very cool concept, "search for music by drawing a picture of it" is broken. I drew a "loud, quiet, loud" profile, and did not get one Pixies song returned in the top 10.
posted by benito.strauss at 7:48 PM on October 2, 2011 [1 favorite]

Check out this example track description.

How does this work exactly?
posted by ovvl at 7:53 PM on October 2, 2011

surely cortex has tried this on scenes from The Adventures of Pete and Pete...

Heh. Actually, I tried at one point to resynthesize one of MLK's speeches using footage of Obama speaking, but my code was pretty hacky and the source material I had kind of sucked so it really didn't go anywhere interesting. But the potential for higher-fidelity resynthesis using a way larger data set (and a million songs is safely in "way larger" territory even if you aggressively reduce it to a subset matching the sort of timbre and content needed) is really interesting and I need to get around to playing with it at some point.

Step one will be setting aside a few minutes at some point to suck down the 288 GB.
posted by cortex at 8:32 PM on October 2, 2011

Dear Jessamyn:

Tim here. Jesus Fucking H. Christ on a stick.

You do this to me on a Sunday night? This is the way you are?

Thanks for that.

Seriously.
posted by timsteil at 8:49 PM on October 2, 2011

This is the way I am. You are welcome.
posted by jessamyn at 8:55 PM on October 2, 2011 [2 favorites]

I wonder if Apple will use any of this for their iCloud foofaraw. I saw it mentioned somewhere that they've got 28 million songs checksummed so far.
posted by furtive at 9:19 PM on October 2, 2011

44,745 unique artists

Amateurs.
posted by meehawl at 9:29 PM on October 2, 2011

I wonder if Apple will use any of this for their iCloud foofaraw. I saw it mentioned somewhere that they've got 28 million songs checksummed so far.

IIRC iTunes uses Rovi. Though wiki claims they only have 13 million tracks.
posted by Artw at 11:46 PM on October 2, 2011

Artw: iTunes does not use rovi for iTunes match.

Civil_Disobedient: there's lyrics in there, not sure what you're referring to. It's mentioned in the post.
posted by brianwhitman at 6:45 AM on October 3, 2011

I'd like to report a bug:

"Never Gonna Give You Up": danceability: 0.0

Helllllllloooooooooo????
posted by victors at 6:46 AM on October 3, 2011

victors: think that might be a MSD display error: we have it as 0.73:

http://developer.echonest.com/api/v4/song/search?api_key=N6E4NIOVYMTHNDM8J&format=json&results=1&artist=rick%20astley&title=never%20gonna&bucket=audio_summary
posted by brianwhitman at 6:50 AM on October 3, 2011

« Older I am surprised they have the time, what with all... | Hello, Sweetie! Newer »

This thread has been archived and is closed to new comments

MetaFilter

♫ "The first and only freely-available, industrial-scale dataset for research on popular music and audio analysis" ♬
October 2, 2011 5:55 PM Subscribe

Tags

Share

♫ "The first and only freely-available, industrial-scale dataset for research on popular music and audio analysis" ♬ October 2, 2011 5:55 PM Subscribe

Tags

Share

♫ "The first and only freely-available, industrial-scale dataset for research on popular music and audio analysis" ♬
October 2, 2011 5:55 PM Subscribe