Despite the comment collecting engine crashing
on the last day to submit comments on the very popular topic of Network Neutrality
, the system worked well enough to collect 1.1 million comments, which the FCC has made available to the general public
as six XML files
, totaling over 1.4 gigs of raw data. Mailed comments postmarked prior to July 18 are still being scanned and entered, so this isn't everything, but it's a lot of data. TechCrunch graphed the frequency of certain words
, with the high score going to Comcast, with 4,613 mentions. NPR shared the visualized results of Quid's analysis of a sample of 250,000 comments
, and Quid's analysis of a sample of 317,000 comments to map geographic sources of the public comments and adjusted them based on state populations to depict which states care more about net neutrality
, while The Verge dug deeper, mapping comments by zip code
Burritos provide a good way to experiment precisely because they represent a relatively narrow range of experience. There are different burrito styles across the country — more than you might gather if your burrito-eating ambitions have never ventured beyond Taco Bell. But there are fewer parameters to control for when rating burritos than when comparing movies, or doctors, or colleges.
Nathaniel Read ("Nate") Silver is launching a national, 64-restaurant Burrito Bracket
From a small town in Romania, Guccifer skewered and glorified the power elite.
If Snowden perfectly fit the profile of geek crusader, Lehel, a stone-faced, disheveled man in a tight leather jacket, seemed an odd candidate for one of the world’s most notorious hackers. But Guccifer is to hacking what the Beatles are to rock and roll. He had predecessors, 4Chan cowboys like Anonymous and Sabu of LulzSec, but he’s changed the nature of hacking fame. Guccifer rose by exploiting the connections people make online to infiltrate the private lives of some of the most powerful people on Earth. He served up the results to the media, irresistible high-low raw material for an online news cycle driven by leaks and voyeurism and racked by anxiety over privacy.
What Is A Guccifer? [more inside]
is an application of BioGraph
-based data mining to music, which helps you get recommendations for other musicians. Based on 140K user-defined tags from last.fm
that are collected for over 400K artists, results are sorted by the "nearest" or most probable matches for your artist of interest (algorithm described here
). [more inside]
The ACLU reports that the IRS claims in an internal document that it has the authority to access citizens' online communications without a warrant.
The IRS claimed in a 2009 document that "the Fourth Amendment does not protect communications held in electronic storage, such as email messages stored on a server, because internet users do not have a reasonable expectation of privacy in such communications." It still retains that position even after the 2010 case of US v Warshak
which determined that citizens have a reasonable expectation of privacy in such communications. [more inside]
Why Privacy Matters, Even If You Have Nothing To Hide
, by Daniel J. Solove
The nothing-to-hide argument pervades discussions about privacy. The data-security expert Bruce Schneier calls it the "most common retort against privacy advocates." ... To evaluate the nothing-to-hide argument, we should begin by looking at how its adherents understand privacy. Nearly every law or policy involving privacy depends upon a particular understanding of what privacy is. The way problems are conceived has a tremendous impact on the legal and policy solutions used to solve them. [more inside]
Schools in Missouri
, and other states are using fingerprint scans
and RFID chips to track students as a means to speed up service in the cafeteria and to track student whereabouts
in and around school. [more inside]
The Justice Department, after a legal battle with the ACLU to avoid having to admit it, recently released documents
showing that the federal government’s use of warrantless “pen register” and “tap and trace” surveillance has multiplied over the past decade. But the Justice Department is small potatoes. Every day, the NSA intercepts and stores 1.7 billion emails, phone calls, texts, and other electronic communications. [more inside]
Supermarkets are attempting to customize prices for different shoppers. At a Safeway in Denver, a 24-pack of Refreshe bottled water costs $2.71 for Jennie Sanford, a project manager. For Emily Vanek, a blogger, the price is $3.69. [more inside]
The Touch-point Collective
: Crowd Contouring on the Casino Floor -
'Historically, casinos have been eager adopters of technologies that help them to gather knowledge about their customers. The knowledge-gathering repertoire of the modern casino has shifted from telephone surveys, focus groups, and rudimentary datasets to complex feats of reconnaissance and analysis enabled by player tracking systems, data visualization tools, and behavioral intelligence software suites. Many surveillance techniques first applied in casinos were only later adapted to other domains—airports, financial trading floors, shopping malls, banks, and government agencies.' There are some large, embedded .avi files in the page, be careful. [more inside]
Big Data On Campus (NYTimes) “We don’t want to turn into just eHarmony,” says Michael Zimmer, assistant professor in the School of Information Studies at the University of Wisconsin, Milwaukee, where he studies ethical dimensions of new technology. “I’m worried that we’re taking both the richness and the serendipitous aspect of courses and professors and majors — and all the things that are supposed to be university life — and instead translating it into 18 variables that spit out, ‘This is your best fit. So go over here.’ ”
In 2011 Malaysia Airlines
introduced what is believed to be the world's first airline integration with Facebook
In February Air France KLM
announced its Meet And Seat
program, allowing customers to scan other passengers' social media profiles.
to select or reject seatmates. (Previously)
It prompted safety and privacy concerns
, while others said it showed how a company "gets" social media
In June airBaltic announced it would trial SeatBuddy
to make trips more pleasant by seating like-minded people next to each other.
Now, British Airways has decided to use the Internet to create dossiers on its customers
, including using Google images
to find pictures of passengers so that staff can approach them as they arrive at the terminal or plane. The Know Me
service will initially be limited to first class passengers and other 'captains of industry'.
So-called 'social seating' is part of an emerging trend
to marry data-mining with customer service.
In The Geographic Flow of Music
), researchers Conrad Lee and Pádraig Cunningham propose a method to use data from the last.fm API
to track the world's listening habits by location and time, showing where shifts in musical tastes have originated and subsequently migrated. Results show music trends originating in smaller cities and flowing outward in unexpected ways, contradicting some assumptions in social science about larger cities being more efficient engines of (cultural) invention.
A column by John Brownlee over at Cult of Mac
yesterday highlighted his privacy concerns about the app Girls Around Me
-- which used a mashup of FourSquare check-ins, Google Maps and Facebook public profile information to show the user women who were nearby. In response to the story, Foursquare cut off the app's API access to their data
, effectively knocking it out of commission. CNET: How to prevent friends checking you into locations at Facebook Places
. [more inside]
With a “chief scientist” specializing in consumer behavior, an “analytics department” monitoring voter trends, and a squad of dozens huddled at computer screens editing video or writing code, the sprawling office complex inside One Prudential Plaza looks like a corporate research and development lab — Ping-Pong table and all. But it is home to the largely secret engine of President Obama’s re-election campaign, where scores of political strategists, data analysts, corporate marketers and Web producers are sifting through information gleaned from Facebook, voter logs and hundreds of thousands of telephone or in-person conversations to reassemble and re-energize the scattered coalition of supporters who swept Mr. Obama into the White House four years ago.
--- Othello is a Shakespearean tragedy: when the hero makes a terrible mistake of judgment, his once promising world is led into ruin. Computer analysis of the play, however, suggests that the play is a comedy or, at least, that it does the same things with words that comedies usually do.
On October 26, 2011, Folger Shakespeare Library
Director Michael Witmore
discussed his recent work in Shakespeare studies which combines computer analysis of texts, linguistics, and traditional literary history. Taking the case of Shakespeare's genres as a starting point, Witmore shows how subtle human judgments about the kinds of plays Shakespeare wrote — were they comedies
? — are connected to frequent, widely distributed features in the playwright's syntax, vocabulary, and diction. (approx. 30 minute lecture.) [more inside]
Oren Etzioni is a renowned data mining expert who sold Farecast, his airline-ticket price predictor to Microsoft for $115 million. Now he's turned his focus to the general problem of finding when the best shopping bargains occur. Punch in a consumer electronics item and his website
will tell you whether to buy now or to wait. Over time he'll be adding more product categories. In any case, he can tell you right now the best prices for most things aren't on Black Friday or Cyber Monday
YouTube Insult Generator.
Enter a keyword or phrase, and the Insult Generator will trawl YouTube for relevant videos, and pull insults from those videos. Wired write-up
By processing a million songs in twenty minutes
, and using the Stairway detector
that Slow Build "more" (up to 29) than Stairway to Heaven (which gets only a 9). [via] [more inside]
It’s for your own good—that is Google’s cherished belief. If we want the best possible search results, and if we want advertisements suited to our needs and desires, we must let them into our souls. James Gleick
writes about 'How Google Dominates Us' for the New York Review of Books. [more inside]
On not reading books. Franco Moretti
, author of the controversial Graphs, Maps, Trees: Abstract Models for a Literary History
, proposes that literary study needs to abandon "close reading" for "distant reading": "understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data." He is co-founder of the Stanford Literary Lab
, where he and like-minded colleagues have published studies on programming computers to use statistical analysis to identify a novel's genre
(PDF) and analyzing plots as networks
(PDF). Similar projects
are on the way.
"The results were astounding. In a six-month period — from Aug 31, 2009, to Feb. 28, 2010, Deutsche Telekom had recorded and saved his longitude and latitude coordinates more than 35,000 times. It traced him from a train on the way to Erlangen at the start through to that last night, when he was home in Berlin. Mr. Spitz has provided a rare glimpse — an unprecedented one, privacy experts say — of what is being collected as we walk around with our phones
"In many places the concentration [of convicted residents] is so dense that states are spending in excess of a million dollars a year to incarcerate the residents of single city blocks."
Using rarely accessible data from the criminal justice system, the Spatial Information Design Lab and the Justice Mapping Center have created maps of these “million dollar blocks” and of the city-prison-city-prison migration flow for five of the nation’s cities. The maps suggest that the criminal justice system has become the predominant government institution in these communities and that public investment in this system has resulted in significant costs to other elements of our civic infrastructure — education, housing, health, and family. Prisons and jails form the distant exostructure of many American cities today.
See the several linked pdfs.
hosts competitions to glean information from massive data sets, a la the Netflix Prize
. Competitors can enter free, while companies with vast stores of impenetrable data pay Kaggle to outsource their difficulties to the world population of freelance data-miners. Kaggle contestants have already developed dozens of chess rating systems which outperform the Elo rating currently in use
, and identified genetic markers in HIV associated with a rise in viral load
. Right now, you can compete to forecast tourism statistics
or predict unknown edges in a social network
. Teachers who want to pit their students against each other can host a Kaggle contest free of charge
Social Networks and Data Mining: Where it is and Where it's Going
Telecoms operators naturally prize mobile-phone subscribers who spend a lot, but some thriftier customers, it turns out, are actually more valuable. Known as “influencers”, these subscribers frequently persuade their friends, family and colleagues to follow them when they switch to a rival operator. The trick, then, is to identify such trendsetting subscribers and keep them on board with special discounts and promotions. People at the top of the office or social pecking order often receive quick callbacks, do not worry about calling other people late at night and tend to get more calls at times when social events are most often organised, such as Friday afternoons. Influential customers also reveal their clout by making long calls, while the calls they receive are generally short. Companies can spot these influencers, and work out all sorts of other things about their customers, by crunching vast quantities of calling data with sophisticated “network analysis” software. Instead of looking at the call records of a single customer at a time, it looks at customers within the context of their social network.
According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Data data everywhere
and possibly too much to drink
How (not) to write an online-dating message,
based on a sample of 500,000 "first contact" messages. [more inside]
The National Security Agency is building a data center
in San Antonio that’s the size of the Alamodome. Microsoft has opened an 11-acre data center
a few miles away. Coincidence? Not according to author James Bamford
, who probably knows more about the NSA than any outsider. Bamford's new book
reports that the biggest U.S. spy agency wanted assurances that Microsoft would be in San Antonio before it moved ahead with the Texas Cryptology Center
. Bamford notes that under current law, the NSA could legally tap into Microsoft’s data without a court order. Whatever you do, don't take pictures of it the spy building unless you want to be taken in for questioning.
Worried about social-network data mining? Facebook hires Ted Ullyot, former right-hand man to former Attorney General Alberto Gonzales
, as its general counsel. Tapping Ullyot, who worked on the infamous torture memo
and other illustrious projects
, is a sign that the burgeoning Scrabble platform "is a little more grown-up,"
says Facebook public-policy VP Elliot Schrage.
is a way-cool Firefox extension that automagically summarises
Amazon product reviews.
The idea was that a spike in, say, falafel sales, combined with other data, would lead to Iranian secret agents in the south San Francisco-San Jose area.
I've read this article twice now because I was laughing too hard the first time. If I were more paranoid I might actually seriously ask what sort of data mining the FBI is doing, but... falafel sales! via
. [more inside]
Arguing Against Datamining MySpace in search of Pedophiles.
In certain circles,
has become the villain de jour
for all sorts of debauchery
, etc.), as well as being fertile hunting grounds for the
pedophile. Given the
size of MySpace, reported as 100 million accounts
of active accounts are far lower, at approximately 43 million
), and an
hypothetical and absurdly low natural incidence of pedophiles and pedarasts
(let's say just 1%), one could assume that there could be as many as 430,000
to 1,000,000 of them out there. Wired
contributor and reformed hacker (Kevin Poulson) has developed a script to weed
out the bad seeds
His script was effective, although it took several months of sifting and
refining, as well as numerous false positives - 744 registered sex offenders,
497 with convictions for crimes against children. While such an
experiment has merit, how much time, resources, and law enforcement manpower
will be wasted chasing down the
, and what will be neglected and sacrificed for that
AOL releases 3-months of queries from 500k users.
AOL, either fairly or unfairly, is sometimes considered the internet with training wheels. So while parsing this data, keep that in mind. Some of these queries seem like spam email subjects, don't they? Don't forget, this is the same demographic that brought you the September that didn't end
. AOL tried to retract the data, but it's of no use - it's out there, on the web.
The Secret History of Able Danger
may have have the goods on Able Danger. The Pentagon and Intel officials are mum on the data mining project because it could have been illegal
-- A breathtaking web of conspiratorial email messages. How often did Jeff Skilling email Ken Lay? How often were those emails about company business? Internal alliances? The company's allegiance? The California energy crisis? Who else was talking about it? Who wasn't?
Temptingly complete with software download and MySQL tables for your own tinfoil hat explorations.
Docusearch settles claim for 75K
with family whose daughter was killed
by a stalker
her personal information from them -- a killer whose intentions were described on a Googleable website. The NH Supreme Court
determined last year
, the company who sold Amy Boyer's
work address and SSN to her killer could be held liable
for her death, even though some of that information was publicly available. An "Amy Boyer's Law" intended to increase privacy by restricting the display, sale or use of SSNs received negative reviews
by privacy organizations and ultimately was removed
from an appropriations bill. In a statement, Amy's parents encourage others to use the Internet to keep track of who may be keeping track of their kids. "If only we had typed our daughter's name into any search engine, the Amy Boyer Web site that was posted by her killer would have come up, and we could have called the police...This may never have happened.
That U.S. intelligence agencies confuse terrorists with children
on passenger jets is a reminder that data collection is easy, but data analysis is hard. That must be why the six-year-old daughter of one of Boing Boing's co-founders is on the CAPPS list as a security risk
. All this is also a reminder that we need privacy safeguards for these data mining programs
The Patriots didn't win; Britney did.
TiVo analyzed their viewers behavior during the Superbowl and they came up with some pretty interesting results. How soon till TV programming adapts to viewer behavior?