The data totaled about 20 gigabytes, compressed.
February 19, 2016 10:58 AM   Subscribe

I have a copy of Amazon. Meaning that, on my hard drive there is a massive chunk of Amazon’s product and reviews database—a listing of nine million or so products and 80 million or so reviews taken from 1996 to 2014. The names of all the books in that chunk, their sales ranks, their categories. Every pair of pants for kids, every sock. All the books about Hitler; all the books about snakes. All the different Lego sets. Whatever.
posted by Chrysostom (50 comments total) 18 users marked this as a favorite
 
I never thought I'd see a commercial review dataset as complete as this sounds. This will put a skip in my step this weekend once I've got some free time!
posted by stirred for a bird at 11:03 AM on February 19, 2016 [1 favorite]


the world of the careful sentences

I would read that sci-fi novel.
posted by Greg_Ace at 11:10 AM on February 19, 2016 [7 favorites]


many prices are missing

Aren't some of those algorithmically generated?
posted by ChurchHatesTucker at 11:11 AM on February 19, 2016 [1 favorite]


every pair of pants for hitler. every snake in every pair of pants for hitler. all the books about snakes in Hitler's pants. All the different Lego models of books about snakes in every pair of Hitler's pants. Every different Lego Hitler set of snake pants for kids....
posted by Wolfdog at 11:11 AM on February 19, 2016 [10 favorites]


Hitler's Snake Pants is going to be my new band's name.
posted by briank at 11:15 AM on February 19, 2016 [2 favorites]


Pantsing Snake Hitler, on the other hand, is a story that cries out to be written.
posted by prize bull octorok at 11:19 AM on February 19, 2016 [7 favorites]


Hitler's Snake Pants is going to be my new band's name.

You really don't want the audience that would draw.
posted by dances_with_sneetches at 11:24 AM on February 19, 2016 [2 favorites]


It’s not a perfect copy by any means, but neither is it a pirated one. Rather, it is “spidered” data, culled by automatically visiting Amazon’s web site and copying what is found, adding it up, aggregating it.

Hmmm... I'm not sure this person understand what 'pirated' actually means. I'm not judging him for being a pirate; I'm judging him for not knowing that he is a pirate.
posted by el io at 11:26 AM on February 19, 2016 [2 favorites]


Some things this dataset doesn't have:

per-customer and per-product purchase data (beyond what you can piece together from product reviews and sales rank)
customer paths (what do people look at before ultimately buying something or walking away? what do they search for?)
purchase details such as shipping addresses (you can learn a lot from who buys what for whom)
price history and associated purchase data (what price produces the best margin at what times of the year?)

That kind of stuff is what makes Amazon tick.
posted by jedicus at 11:27 AM on February 19, 2016 [3 favorites]


Screenscraping for academic purposes walks the line of what would be considered piracy. I don't think it's productive to get into a discussion of that one very specific nuance.
posted by schmod at 11:31 AM on February 19, 2016 [1 favorite]


I kept looking and looking but finally I had to admit: I can’t climb this particular mountain. There’s no obvious path through this data.

lucille-bluth-eyeroll.gif

I can't imagine why these 80 million rows of thematically-similar-but-otherwise-completely-unrelated-opinions-by-god-only-knows-who don't reveal an obvious purpose and/or divine intent, once you roll them up in PostgreSql and type "select *".
posted by Mayor West at 11:33 AM on February 19, 2016 [12 favorites]


TLDR: "I downloaded a copy of Amazon and you won't believe what I found out! I found out I'm no good at analyzing data so I have nothing to say about it that 10 minutes browsing Amazon wouldn't get you."
posted by mmoncur at 11:41 AM on February 19, 2016 [24 favorites]


I think there a ton of interesting things you could do with this data.

Can you determine the length of ownership of an item compared to how likely it is to be a positive review? Can you determine the educational level of positive vs negative reviewers (vocabulary, etc). Does amazon allow profanities in reviews? Can you figure out a way to algorithmicly determine if a review is sarcastic or not?
posted by el io at 11:44 AM on February 19, 2016


Reading this article, no surprise that Chris Hughes was the Facebook money guy, rather than the ideas guy.
posted by My Dad at 11:54 AM on February 19, 2016


Can you figure out a way to algorithmicly determine if a review is sarcastic or not?

Given that even many humans can't discern the difference, I'm going to go with "no" on that one.
posted by Greg_Ace at 11:54 AM on February 19, 2016 [5 favorites]


Screenscraping for academic purposes walks the line of what would be considered piracy.

Yar! If screenscraping websites be piracy, then it's a pirate's life for me. Now where's me rum? Some sodomy and the lash be promised me.
posted by Nelson at 12:22 PM on February 19, 2016 [9 favorites]


Screenscraping for academic purposes walks the line of what would be considered piracy.

robots.txt, IP banning / rate-limiting, and terms of service exist for a reason. If a publicly accessible website doesn't want to be scraped, there are technical and legal mechanisms to prevent it.

Whether that applies in this case, I don't know. But a lot of my research involves scraping data from websites, and we are careful to obey robots.txt and check the terms of service.
posted by jedicus at 12:30 PM on February 19, 2016 [2 favorites]


Screenscraping for academic purposes walks the line of what would be considered piracy.

Yar! If screenscraping websites be piracy, then it's a pirate's life for me. Now where's me rum? Some sodomy and the lash be promised me.


I call it the Internet Arrrrrchive for a reason
posted by numaner at 12:36 PM on February 19, 2016 [3 favorites]


the world of the careful sentences

I would read that sci-fi novel.


Wouldn't that really be in the realm of the Fantasy genre though?
posted by BigHeartedGuy at 12:42 PM on February 19, 2016


robots.txt, IP banning / rate-limiting, and terms of service exist for a reason. If a publicly accessible website doesn't want to be scraped, there are technical and legal mechanisms to prevent it.

So if you leave your door unlocked, that means you're cool with me just kind of wandering into your house and taking a few photographs, right?
posted by dersins at 12:49 PM on February 19, 2016


Oh look, terrible metaphors
posted by RustyBrooks at 12:51 PM on February 19, 2016 [20 favorites]


metaphor ≠ analogy
posted by dersins at 12:54 PM on February 19, 2016


Oh look, terrible analogies
posted by 0xFCAF at 1:00 PM on February 19, 2016 [32 favorites]


HITLER'S PANTS SNAKE
posted by quonsar II: smock fishpants and the temple of foon at 1:09 PM on February 19, 2016 [1 favorite]


I'm not sure "leaving the door (to their private residence) unlocked" is a good encapsulation of Amazon's business strategy.

A better door analogy would be assuming Walmart leaving their doors unlocked is an invitation for you to come in and catalog their inventory. And as it turns out I'm willing to bet their corporate policy would be opposed to that, for various reasons.
posted by Mr.Encyclopedia at 1:10 PM on February 19, 2016 [2 favorites]


A better analogy would be if Hitler left his pants open and you wandered in and took a few snakes.
posted by Wolfdog at 1:14 PM on February 19, 2016 [24 favorites]


So if you leave your door unlocked, that means you're cool with me just kind of wandering into your house and taking a few photographs, right?

There is a vast difference in the expectation of privacy in someone's home versus a publicly accessible, searchable website that uses none of the commonly accepted technical or legal mechanisms to indicate that it is less than fully public. There is no need for an analogy or metaphor. We can address the medium of the web directly.

A better door analogy would be assuming Walmart leaving their doors unlocked is an invitation for you to come in and catalog their inventory. And as it turns out I'm willing to bet their corporate policy would be opposed to that, for various reasons.

If someone walked into a Walmart and started cataloging the inventory, Walmart would be fully within their rights to require that person to leave. But up to that point, the person would not have done anything wrong. Similarly, by default, someone is free to scrape data from Amazon's site (subject to limitations imposed by copyright, etc), unless Amazon has used technical or legal measures to indicate that scraping is prohibited.
posted by jedicus at 1:15 PM on February 19, 2016 [9 favorites]


nine million or so products

Not to diminish this entirely, but what he has isn't even close to a "massive chunk" of Amazon's catalog.

Currently there are ~ 503,054,245 products listed on Amazon.com. That includes digital items, but not products that are currently out of stock. It's pretty mind boggling, especially considering the catalog has more than doubled in < 3 years.

Source: I used to write about this sort of thing.
posted by paulcole at 1:17 PM on February 19, 2016 [7 favorites]


The scraping is expressly prohibited by their website terms of use (and likely always has been as such terms have been common for a long time). This presents both a copyright and contractual legal problem for anyone scraping and arguably considering it implicates contracts also prohibits what might otherwise be copyright fair use. The fact that this author felt the need to talk about the issue of piracy yet completely failed to talk to anyone about the true legal issues just makes me hate on "journalists." Sloppy.


Subject to your compliance with these Conditions of Use and your payment of any applicable fees, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable license to access and make personal and non-commercial use of the Amazon Services. This license does not include any resale or commercial use of any Amazon Service, or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of any Amazon Service or its contents; any downloading, copying, or other use of account information for the benefit of any third party; or any use of data mining, robots, or similar data gathering and extraction tools. All rights not expressly granted to you in these Conditions of Use or any Service Terms are reserved and retained by Amazon or its licensors, suppliers, publishers, rightsholders, or other content providers. No Amazon Service, nor any part of any Amazon Service, may be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon.

posted by Muddler at 1:18 PM on February 19, 2016 [2 favorites]


I feel like this is possibly the least-interesting argument we could possibly have about this topic, but I'll throw in a "yeah, if somebody catalogs Wal-Mart's inventory and makes the data available, good on 'em". Also, Amazon is a corporation predicated in large part on consuming oceans of data about its users and people who sell stuff by way of Amazon as a mediator/marketplace and then eating swaths of the economy like some kind of fucking algorithmic plague monster, so fuck them and I hope it all burns, and I guess I am not super into worrying about their intellectual property.
posted by brennen at 1:21 PM on February 19, 2016 [9 favorites]


Well, brennen, to you it is not interesting, but to a ton of e-commerce companies out there, this is hugely important. On one side you have companies like Amazon that really don't want you digging for a wide range of reasons including their personal benefit in being the one with all that wealth of data, and on the other a huge number of startups that are scraping data to build their own product and make money. The point isn't so much worrying about their IP, but worrying about whether they have any right to stop people from doing just what is happening here. If you like digging through such databases, best thing that could happen would be a series of cases that allow it and then you get such data crawled from all over the place and bunch of startups basing their new product off of other existing works. Otherwise, you see this driven underground. We saw a similar ebb and flow with news aggregators and many social media apps and the like are scraping lots of places for event information and the like. I personally would have found a FPP on scraping in general more interesting.
posted by Muddler at 1:48 PM on February 19, 2016


Yeah, I mean, I have spent most of my career (such as it is) working in e-commerce, and questions about data scraping have been sort of central for me more than a few times, from both ends of the experience.

There are thorny ethical questions, sure. Because databases. I guess I just can't muster a lot of moral dudgeon on this one. The train of thought in the actual essay is more interesting than the notion of "piracy" here. Like, Ford is right: We really, really don't know how to read databases. And databases, like the organizations that exist on top of them, are kind of a massive determining factor in the environment we now occupy.
posted by brennen at 2:08 PM on February 19, 2016 [5 favorites]


Snake pants wouldn't have any leg holes, would they? WOULD THEY?!?
posted by Chitownfats at 2:23 PM on February 19, 2016 [1 favorite]


If you like huge datasets, you might like the StackExchange data dump.
posted by grobstein at 2:23 PM on February 19, 2016


Snake pants wouldn't have any leg holes, would they?

In fact they'd be all one leg, period.
posted by Greg_Ace at 2:42 PM on February 19, 2016 [1 favorite]


I would think Amazon has a technical method which could keep people from systematically scraping the site. A scraper reads a lot of pages, and unless those requests were being spread out among a wide range of IP addresses or over time, wouldn't they just be able to shut it down at the firewall? I know it is a huge cloud organization with a complex network, but if they really cared, they could stop it. Correct me if I'm wrong. The only experience I have with web scraping is a program I wrote for a university course, which was basic, at best. We were required to process robots.txt.

I don't have a lot of sympathy for Amazon, even though I buy a lot of stuff through them. If the data was originally downloaded for analysis to create a paper for a conference, that seems like a reasonable enough use to me.

And, how do you think Amazon is coming up with the algorithmic pricing? I would bet they've got plenty of scrapers looking at the competition's websites for pricing regardless of what the competition's TOS says.
posted by Roger Dodger at 3:05 PM on February 19, 2016 [1 favorite]


Screenscraper Pirates II: Hitler's Pants Snake vs. The Fucking Algorithmic Plague Monster.

A Novel of the World of the Careful Sentences Cycle.
posted by nubs at 3:18 PM on February 19, 2016 [7 favorites]


If Amazon allowed Google to crawl and index, but not Bing or Wolphram Alpha, would that present legal issues like monopolistic collusion?
posted by anotherpanacea at 3:29 PM on February 19, 2016 [1 favorite]


anotherpanacea: "If Amazon allowed Google to crawl and index, but not Bing or Wolphram Alpha, would that present legal issues like monopolistic collusion?"

You don't have to wonder what Amazon's robots.txt contains. It's publicly available as part of the website. Here it is.

It has a long list of what a generic bot is allowed and (much more frequently) not allowed to index. The list is even more restrictive for googlebot:
$ diff generic_robot googlebot 
23a24
> Disallow: */sim/B001132UEE
43a45
> Disallow: /gp/aw/cr/
47a50
> Disallow: /gp/cdp/member-reviews/
72a76,77
> Disallow: /gp/pdp/profile/
> Disallow: /gp/pdp/rss/*/reviews
107a113
> Disallow: /rss/people/*/reviews

Here's the entire entry for bingbot:
User-agent: Bingbot
Disallow: /gp/socialmedia/giveaways
So they are treated differently.

eTao's spider gets no love at all:
User-agent: EtaoSpider
Disallow: /
Which is no surprise since eTao is a Chinese competitor.
posted by double block and bleed at 5:14 PM on February 19, 2016 [1 favorite]


As far as I can tell, that link to the UCSD researcher's site with the data has never been indexed by Google, so I don't know that the researcher appreciates that. Or maybe, whatever. 2014 is ancient history as far as market data is concerned, as is 1996. This data will provide a larf or two for the alien archaeologists that exhume our civilization.
posted by RobotVoodooPower at 5:17 PM on February 19, 2016


bah, this is why we have camelcamelcamel
posted by dorian at 5:59 PM on February 19, 2016 [4 favorites]


Also, Amazon is a corporation predicated in large part on consuming oceans of data about its users and people who sell stuff by way of Amazon as a mediator/marketplace and then eating swaths of the economy like some kind of fucking algorithmic plague monster, so fuck them and I hope it all burns, and I guess I am not super into worrying about their intellectual property.

Hey, I'm all about 'the music industry fucks the artist over, so feel free to pirate the major labels works' as an argument, but lets not pretend that position a legal one. Maybe one can gauge the moral character of a content creator before deciding if you want to honor their copyrights, but don't pretend that you are fully compliant with the law. And maybe this is fair use, but waving it away as 'it was spidered, so its fine' is just absurd.

Amazon certainly can create technical measures to prevent people from *easily* scraping their site (just as netflix can engage DRM to prevent people from *easily* pirating stuff from them), but blaming them for not creating these measures doesn't seem to be a reasonable position to take.

If the data was originally downloaded for analysis to create a paper for a conference, that seems like a reasonable enough use to me.

Yeah, that sounds reasonable to me as well. But wait, that was the original use... What is the current use?

It's frustrating that a tech author would have the position that because the data was grabbed via spidering content that it is magically not piracy, and fully legal (for any use?).

I had a friend that wrote a book on cyberethics... And yet he had another book that contained a copyrighted image taken by a professional photographer. When I asked him if he got permission to use that photo, his response was "I found it on the internet".

"I found it on the internet" doesn't make it legal or ethical. Other things may (fair use, for example), but there are plenty examples of web sites that have been ripped on piratebay that are obviously piracy by anyone's definitions.
posted by el io at 9:48 PM on February 19, 2016 [1 favorite]


Amazon is a corporation predicated in large part on consuming oceans of data...and then eating swaths of the economy like some kind of fucking algorithmic plague monster, so fuck them

Amen. I wish people who "don't have a lot of sympathy for Amazon" would try to find other sources for the stuff they buy instead of continuing to throw their money into Amazon's insatiable maw.
posted by Johnny Wallflower at 10:09 PM on February 19, 2016 [1 favorite]


robots.txt files advise automated crawlers that certain URLs may not be useful to index. They sure as hell aren't a security measure.
posted by save alive nothing that breatheth at 10:43 PM on February 19, 2016 [3 favorites]


I like huge datasets and I cannot lie.
posted by I_Love_Bananas at 3:49 AM on February 20, 2016 [1 favorite]


I really hope this guy used AWS to spin up the resources he used to hit all the pages and store the data.
posted by DigDoug at 5:20 AM on February 20, 2016 [1 favorite]


I should have been more clear about how robots.txt provides no security at all, but I think it's safe to assume google and bing program their bots to honor them.
posted by double block and bleed at 8:15 AM on February 20, 2016


MeFi's own
posted by one weird trick at 2:28 AM on February 21, 2016 [1 favorite]


I think it's safe to assume google and bing program their bots to honor them

...Amazon's Chinese competitors and Amazon's robots.txt, perhaps not so much?
posted by flabdablet at 6:03 AM on February 21, 2016


Hitler's Snake Pants is going to be my new band's name.

Maybe you can open for these guys.
posted by homunculus at 8:10 PM on February 22, 2016


« Older bad roomie (NSFW)   |   Spending, Use of Services, Prices, and Health in... Newer »


This thread has been archived and is closed to new comments