Earlier this year, Chris Whong made a FOIL request to the New York City Taxi and Limousine Commission, receiving fare and trip data for all licensed cabs in New York in 2013. (previously) The data was anonymised, but as Vijay Pandurangan realised, only partially. [more inside]
Despite the comment collecting engine crashing on the last day to submit comments on the very popular topic of Network Neutrality, the system worked well enough to collect 1.1 million comments, which the FCC has made available to the general public as six XML files, totaling over 1.4 gigs of raw data. Mailed comments postmarked prior to July 18 are still being scanned and entered, so this isn't everything, but it's a lot of data. TechCrunch graphed the frequency of certain words, with the high score going to Comcast, with 4,613 mentions. NPR shared the visualized results of Quid's analysis of a sample of 250,000 comments, and Quid's analysis of a sample of 317,000 comments to map geographic sources of the public comments and adjusted them based on state populations to depict which states care more about net neutrality, while The Verge dug deeper, mapping comments by zip code.
Burritos provide a good way to experiment precisely because they represent a relatively narrow range of experience. There are different burrito styles across the country — more than you might gather if your burrito-eating ambitions have never ventured beyond Taco Bell. But there are fewer parameters to control for when rating burritos than when comparing movies, or doctors, or colleges.
Nathaniel Read ("Nate") Silver is launching a national, 64-restaurant Burrito Bracket
Nathaniel Read ("Nate") Silver is launching a national, 64-restaurant Burrito Bracket
From a small town in Romania, Guccifer skewered and glorified the power elite.
If Snowden perfectly fit the profile of geek crusader, Lehel, a stone-faced, disheveled man in a tight leather jacket, seemed an odd candidate for one of the world’s most notorious hackers. But Guccifer is to hacking what the Beatles are to rock and roll. He had predecessors, 4Chan cowboys like Anonymous and Sabu of LulzSec, but he’s changed the nature of hacking fame. Guccifer rose by exploiting the connections people make online to infiltrate the private lives of some of the most powerful people on Earth. He served up the results to the media, irresistible high-low raw material for an online news cycle driven by leaks and voyeurism and racked by anxiety over privacy.What Is A Guccifer? [more inside]
GaMuSo is an application of BioGraph-based data mining to music, which helps you get recommendations for other musicians. Based on 140K user-defined tags from last.fm that are collected for over 400K artists, results are sorted by the "nearest" or most probable matches for your artist of interest (algorithm described here). [more inside]
The ACLU reports that the IRS claims in an internal document that it has the authority to access citizens' online communications without a warrant. The IRS claimed in a 2009 document that "the Fourth Amendment does not protect communications held in electronic storage, such as email messages stored on a server, because internet users do not have a reasonable expectation of privacy in such communications." It still retains that position even after the 2010 case of US v Warshak which determined that citizens have a reasonable expectation of privacy in such communications. [more inside]
Why Privacy Matters, Even If You Have Nothing To Hide, by Daniel J. Solove
The nothing-to-hide argument pervades discussions about privacy. The data-security expert Bruce Schneier calls it the "most common retort against privacy advocates." ... To evaluate the nothing-to-hide argument, we should begin by looking at how its adherents understand privacy. Nearly every law or policy involving privacy depends upon a particular understanding of what privacy is. The way problems are conceived has a tremendous impact on the legal and policy solutions used to solve them.[more inside]
Schools in Missouri, Maryland, and other states are using fingerprint scans and RFID chips to track students as a means to speed up service in the cafeteria and to track student whereabouts in and around school. [more inside]
The Justice Department, after a legal battle with the ACLU to avoid having to admit it, recently released documents showing that the federal government’s use of warrantless “pen register” and “tap and trace” surveillance has multiplied over the past decade. But the Justice Department is small potatoes. Every day, the NSA intercepts and stores 1.7 billion emails, phone calls, texts, and other electronic communications. [more inside]
Supermarkets are attempting to customize prices for different shoppers. At a Safeway in Denver, a 24-pack of Refreshe bottled water costs $2.71 for Jennie Sanford, a project manager. For Emily Vanek, a blogger, the price is $3.69. [more inside]
The Touch-point Collective: Crowd Contouring on the Casino Floor - 'Historically, casinos have been eager adopters of technologies that help them to gather knowledge about their customers. The knowledge-gathering repertoire of the modern casino has shifted from telephone surveys, focus groups, and rudimentary datasets to complex feats of reconnaissance and analysis enabled by player tracking systems, data visualization tools, and behavioral intelligence software suites. Many surveillance techniques first applied in casinos were only later adapted to other domains—airports, financial trading floors, shopping malls, banks, and government agencies.' There are some large, embedded .avi files in the page, be careful. [more inside]
Big Data On Campus (NYTimes) “We don’t want to turn into just eHarmony,” says Michael Zimmer, assistant professor in the School of Information Studies at the University of Wisconsin, Milwaukee, where he studies ethical dimensions of new technology. “I’m worried that we’re taking both the richness and the serendipitous aspect of courses and professors and majors — and all the things that are supposed to be university life — and instead translating it into 18 variables that spit out, ‘This is your best fit. So go over here.’ ”
In 2011 Malaysia Airlines introduced what is believed to be the world's first airline integration with Facebook. In February Air France KLM announced its Meet And Seat program, allowing customers to scan other passengers' social media profiles. to select or reject seatmates. (Previously). It prompted safety and privacy concerns, while others said it showed how a company "gets" social media. In June airBaltic announced it would trial SeatBuddy to make trips more pleasant by seating like-minded people next to each other. Now, British Airways has decided to use the Internet to create dossiers on its customers, including using Google images to find pictures of passengers so that staff can approach them as they arrive at the terminal or plane. The Know Me service will initially be limited to first class passengers and other 'captains of industry'. So-called 'social seating' is part of an emerging trend to marry data-mining with customer service.
In The Geographic Flow of Music (arxiv), researchers Conrad Lee and Pádraig Cunningham propose a method to use data from the last.fm API to track the world's listening habits by location and time, showing where shifts in musical tastes have originated and subsequently migrated. Results show music trends originating in smaller cities and flowing outward in unexpected ways, contradicting some assumptions in social science about larger cities being more efficient engines of (cultural) invention.
"And with millions of chicks checking in daily, there's never been a better time to be on the hunt...."
A column by John Brownlee over at Cult of Mac yesterday highlighted his privacy concerns about the app Girls Around Me -- which used a mashup of FourSquare check-ins, Google Maps and Facebook public profile information to show the user women who were nearby. In response to the story, Foursquare cut off the app's API access to their data, effectively knocking it out of commission. CNET: How to prevent friends checking you into locations at Facebook Places. [more inside]
With a “chief scientist” specializing in consumer behavior, an “analytics department” monitoring voter trends, and a squad of dozens huddled at computer screens editing video or writing code, the sprawling office complex inside One Prudential Plaza looks like a corporate research and development lab — Ping-Pong table and all. But it is home to the largely secret engine of President Obama’s re-election campaign, where scores of political strategists, data analysts, corporate marketers and Web producers are sifting through information gleaned from Facebook, voter logs and hundreds of thousands of telephone or in-person conversations to reassemble and re-energize the scattered coalition of supporters who swept Mr. Obama into the White House four years ago.
How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did (excerpt from How Companies Learn Your Secrets (single page))
Datamining Shakespeare --- Othello is a Shakespearean tragedy: when the hero makes a terrible mistake of judgment, his once promising world is led into ruin. Computer analysis of the play, however, suggests that the play is a comedy or, at least, that it does the same things with words that comedies usually do. On October 26, 2011, Folger Shakespeare Library Director Michael Witmore discussed his recent work in Shakespeare studies which combines computer analysis of texts, linguistics, and traditional literary history. Taking the case of Shakespeare's genres as a starting point, Witmore shows how subtle human judgments about the kinds of plays Shakespeare wrote — were they comedies, histories or tragedies? — are connected to frequent, widely distributed features in the playwright's syntax, vocabulary, and diction. (approx. 30 minute lecture.) [more inside]
Oren Etzioni is a renowned data mining expert who sold Farecast, his airline-ticket price predictor to Microsoft for $115 million. Now he's turned his focus to the general problem of finding when the best shopping bargains occur. Punch in a consumer electronics item and his website will tell you whether to buy now or to wait. Over time he'll be adding more product categories. In any case, he can tell you right now the best prices for most things aren't on Black Friday or Cyber Monday.
YouTube Insult Generator. Enter a keyword or phrase, and the Insult Generator will trawl YouTube for relevant videos, and pull insults from those videos. Wired write-up. [via]
By processing a million songs in twenty minutes, and using the Stairway detector Paul discovered many songs that Slow Build "more" (up to 29) than Stairway to Heaven (which gets only a 9). [via] [more inside]
The logical conclusion of our relationship to computers: expectantly to type “what is the meaning of my life” into Google.
It’s for your own good—that is Google’s cherished belief. If we want the best possible search results, and if we want advertisements suited to our needs and desires, we must let them into our souls. James Gleick writes about 'How Google Dominates Us' for the New York Review of Books. [more inside]
On not reading books. Franco Moretti, author of the controversial Graphs, Maps, Trees: Abstract Models for a Literary History, proposes that literary study needs to abandon "close reading" for "distant reading": "understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data." He is co-founder of the Stanford Literary Lab, where he and like-minded colleagues have published studies on programming computers to use statistical analysis to identify a novel's genre(PDF) and analyzing plots as networks(PDF). Similar projects are on the way.
It has applications in health care, pharmaceuticals, facial recognition, economics/related areas, and of course, much much more. Previously, MeFi discussed controversial homeland security applications, and the nexus between social networking and mobile devices that further contributes to the pool. With plenty to dig into, let's talk Data Mining in more detail. [more inside]
"The results were astounding. In a six-month period — from Aug 31, 2009, to Feb. 28, 2010, Deutsche Telekom had recorded and saved his longitude and latitude coordinates more than 35,000 times. It traced him from a train on the way to Erlangen at the start through to that last night, when he was home in Berlin. Mr. Spitz has provided a rare glimpse — an unprecedented one, privacy experts say — of what is being collected as we walk around with our phones."
"In many places the concentration [of convicted residents] is so dense that states are spending in excess of a million dollars a year to incarcerate the residents of single city blocks."
Using rarely accessible data from the criminal justice system, the Spatial Information Design Lab and the Justice Mapping Center have created maps of these “million dollar blocks” and of the city-prison-city-prison migration flow for five of the nation’s cities. The maps suggest that the criminal justice system has become the predominant government institution in these communities and that public investment in this system has resulted in significant costs to other elements of our civic infrastructure — education, housing, health, and family. Prisons and jails form the distant exostructure of many American cities today.See the several linked pdfs.
MeFi's own Elizabeth Pisani, of The Wisdom of Whores, on Big Data and the End of the Scientific Method (PDF).
Kaggle hosts competitions to glean information from massive data sets, a la the Netflix Prize. Competitors can enter free, while companies with vast stores of impenetrable data pay Kaggle to outsource their difficulties to the world population of freelance data-miners. Kaggle contestants have already developed dozens of chess rating systems which outperform the Elo rating currently in use, and identified genetic markers in HIV associated with a rise in viral load. Right now, you can compete to forecast tourism statistics or predict unknown edges in a social network. Teachers who want to pit their students against each other can host a Kaggle contest free of charge.
Social Networks and Data Mining: Where it is and Where it's Going
Telecoms operators naturally prize mobile-phone subscribers who spend a lot, but some thriftier customers, it turns out, are actually more valuable. Known as “influencers”, these subscribers frequently persuade their friends, family and colleagues to follow them when they switch to a rival operator. The trick, then, is to identify such trendsetting subscribers and keep them on board with special discounts and promotions. People at the top of the office or social pecking order often receive quick callbacks, do not worry about calling other people late at night and tend to get more calls at times when social events are most often organised, such as Friday afternoons. Influential customers also reveal their clout by making long calls, while the calls they receive are generally short. Companies can spot these influencers, and work out all sorts of other things about their customers, by crunching vast quantities of calling data with sophisticated “network analysis” software. Instead of looking at the call records of a single customer at a time, it looks at customers within the context of their social network.
According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Data data everywhere and possibly too much to drink?
How (not) to write an online-dating message, based on a sample of 500,000 "first contact" messages. [more inside]
The National Security Agency is building a data center in San Antonio that’s the size of the Alamodome. Microsoft has opened an 11-acre data center a few miles away. Coincidence? Not according to author James Bamford, who probably knows more about the NSA than any outsider. Bamford's new book reports that the biggest U.S. spy agency wanted assurances that Microsoft would be in San Antonio before it moved ahead with the Texas Cryptology Center. Bamford notes that under current law, the NSA could legally tap into Microsoft’s data without a court order. Whatever you do, don't take pictures of it the spy building unless you want to be taken in for questioning.
Worried about social-network data mining? Facebook hires Ted Ullyot, former right-hand man to former Attorney General Alberto Gonzales, as its general counsel. Tapping Ullyot, who worked on the infamous torture memo and other illustrious projects, is a sign that the burgeoning Scrabble platform "is a little more grown-up," says Facebook public-policy VP Elliot Schrage.
The idea was that a spike in, say, falafel sales, combined with other data, would lead to Iranian secret agents in the south San Francisco-San Jose area. I've read this article twice now because I was laughing too hard the first time. If I were more paranoid I might actually seriously ask what sort of data mining the FBI is doing, but... falafel sales! via. [more inside]
Arguing Against Datamining MySpace in search of Pedophiles. In certain circles, MySpace has become the villain de jour for all sorts of debauchery (threatening the President, phishing , dismembered women , etc.), as well as being fertile hunting grounds for the pedophile. Given the huge size of MySpace, reported as 100 million accounts (although estimates of active accounts are far lower, at approximately 43 million ), and an hypothetical and absurdly low natural incidence of pedophiles and pedarasts (let's say just 1%), one could assume that there could be as many as 430,000 to 1,000,000 of them out there. Wired contributor and reformed hacker (Kevin Poulson) has developed a script to weed out the bad seeds [via]. His script was effective, although it took several months of sifting and refining, as well as numerous false positives - 744 registered sex offenders, 497 with convictions for crimes against children. While such an experiment has merit, how much time, resources, and law enforcement manpower will be wasted chasing down the ""high-cost "false positives", and what will be neglected and sacrificed for that effort?
AOL releases 3-months of queries from 500k users. AOL, either fairly or unfairly, is sometimes considered the internet with training wheels. So while parsing this data, keep that in mind. Some of these queries seem like spam email subjects, don't they? Don't forget, this is the same demographic that brought you the September that didn't end. AOL tried to retract the data, but it's of no use - it's out there, on the web.
The Secret History of Able Danger The WP may have have the goods on Able Danger. The Pentagon and Intel officials are mum on the data mining project because it could have been illegal.
Exploring enron -- A breathtaking web of conspiratorial email messages. How often did Jeff Skilling email Ken Lay? How often were those emails about company business? Internal alliances? The company's allegiance? The California energy crisis? Who else was talking about it? Who wasn't? Temptingly complete with software download and MySQL tables for your own tinfoil hat explorations.
Docusearch settles claim for 75K with family whose daughter was killed by a stalker who purchased her personal information from them -- a killer whose intentions were described on a Googleable website. The NH Supreme Court determined last year that Docusearch, the company who sold Amy Boyer's work address and SSN to her killer could be held liable for her death, even though some of that information was publicly available. An "Amy Boyer's Law" intended to increase privacy by restricting the display, sale or use of SSNs received negative reviews by privacy organizations and ultimately was removed from an appropriations bill. In a statement, Amy's parents encourage others to use the Internet to keep track of who may be keeping track of their kids. "If only we had typed our daughter's name into any search engine, the Amy Boyer Web site that was posted by her killer would have come up, and we could have called the police...This may never have happened."
That U.S. intelligence agencies confuse terrorists with children on passenger jets is a reminder that data collection is easy, but data analysis is hard. That must be why the six-year-old daughter of one of Boing Boing's co-founders is on the CAPPS list as a security risk. All this is also a reminder that we need privacy safeguards for these data mining programs.
The Patriots didn't win; Britney did. TiVo analyzed their viewers behavior during the Superbowl and they came up with some pretty interesting results. How soon till TV programming adapts to viewer behavior?