Wrangler
February 4, 2011 8:18 AM   Subscribe

Stanford's Visualization Group has produced a data cleanup web app called Wrangler that works like straight up magic.
posted by chunking express (32 comments total) 109 users marked this as a favorite
 
The collective hours I spent cleaning up data for my lab reports back in college are now crying. THANKS A LOT, STANFORD.
posted by phunniemee at 8:24 AM on February 4, 2011 [2 favorites]


Finally, someone invented awk, sed, sort and the rest!
posted by DU at 8:26 AM on February 4, 2011 [10 favorites]


I'M A REAL DJ DON'T NEED NO TRANSFORM SCRIPT
posted by mkb at 8:31 AM on February 4, 2011 [3 favorites]


the insightful part is that this is a script generation tool - the ability to drop to script to tweak the action that the visuals don't cover makes it actually useful. well done.
posted by victors at 8:38 AM on February 4, 2011 [1 favorite]


Finally, someone invented awk, sed, sort and the rest!

Indeed. If only there was some language...a practical language. For extracting, reporting and listing. That'd be seriously useful.

(Still, it's a nifty as hell demo. The generated script at the very end is pretty cool stuff).
posted by jquinby at 8:40 AM on February 4, 2011 [7 favorites]


Google refine (formerly Freebase GridWorks) is a similar tool.

And don't be snarky: these kinds of tools are useful even if you know awk/sed/sort/perl/ruby etc.
posted by willF at 8:45 AM on February 4, 2011 [13 favorites]


Damn, Stanford. Way to go!
posted by blucevalo at 8:46 AM on February 4, 2011


willF, to be sure, you're correct. The learning curve for the shell text-utils can be steep. This is Good Stuff.
posted by jquinby at 8:51 AM on February 4, 2011 [2 favorites]


Yes, we should only allow programmers and those who know already how to use shell utilities to work with data! This will ensure that programmers can stay grumpy doing simple tasks for people who may have interesting insights based on their substantive knowledge of a task.
posted by proj at 9:07 AM on February 4, 2011 [25 favorites]


The learning curve for the shell text-utils can be steep.

No kidding. Many, many people who do datacrunching are visual thinkers. They don't want to spend a year or more becoming comfortable with a bunch of idosycratic interfaces which are at a remove from their data. This isn't true for everyone---there are a significant minority of scientific programmers---but the majority don't want to have anything to do with programming.

Excel, for all it's warts, inaccuracies and limitations, continues to be the primary tool for data crunching because it's basic organizational structure is visual and immediate. I've known many people who would prefer to write elaborate excel macros rather than learn perl/python/matlab/R and solve their problem "properly". They don't want to give up that friendly interface.

Data formatting is one of the most unpleasant joe-jobs in science or other research. This group has come up with a tool which allows those visual thinkers who love excel so much to have it both ways. Wranger uses the immediate, visual interface of excel to "record" into a scripting language.

They haven't revinented awk, sed, sort and the rest, they've made them useful for the majority of analysts who find procedural, descriptive programming impossible.
posted by bonehead at 9:08 AM on February 4, 2011 [13 favorites]


I do most of my data cleanup in vim using the macro and regex capabilities. That's always been efficient enough (and I deal with sloppy data seldom enough) that I've been too lazy to learn sed/awk/etc. Plus there's the undo command for when you mess up....

But this is faster, and it automagically does stuff I'd have to script. I like.
posted by richyoung at 9:12 AM on February 4, 2011


Even if you know text processing languages, this allows for emergent analysis in a way that those don't. You have to be deliberate with them; you have to know what you're going to do, and then make it happen. This allows you to "play with" the data in a way that programming languages don't, and is therefore valuable.
posted by sonic meat machine at 9:39 AM on February 4, 2011 [2 favorites]


I do my data cleanup in MS Paint.

Amateurs.
posted by zippy at 9:42 AM on February 4, 2011 [8 favorites]


Good heavens, I've been writing sed/awk/perl/python scripts for 20 years to do this kind of thing. And I'd absolutely kill for an interactive visual tool like this to do the work instead. Working with free text is a huge pain in the ass. Scripting is tedious, error prone, and always involves some by-hand cleanup to handle the special cases you didn't feel like typing four extra lines of code to solve. Doing this kind of task in a GUI with live preview is amazing.

At one time I knew Wordstar formatting codes, too. Who needs a WYSIWYG editor?

BTW, the Stanford Visualization Group is also behind Protovis, a really fantastic Javascript library for data visualization.
posted by Nelson at 9:42 AM on February 4, 2011 [4 favorites]


Yeah I've been doing awk/sed so long, but it's a battle to get the esoteric syntax right and it quickly becomes a big job if the data is complex.
posted by stbalbach at 9:46 AM on February 4, 2011


In much the same way as a clever, unexpected solution to a problem is called a "hack," cleverly-made tools like this ought to be referred to as "spells".

It's magic because it's awesome.
posted by LogicalDash at 9:48 AM on February 4, 2011 [4 favorites]


I think this is awesome. Most people (from the receptionist hired by a temp agency to mid-level managers) aren't going to learn awk/sed (whatever those are - seriously, I don't even know.) but they will be given a crappy word doc of data that their boss wants turned into a pie chart! "And this looks easy to use/learn.
posted by vespabelle at 10:10 AM on February 4, 2011


The following line is provided for your "fixed that for you" needs:

I understand that smart people go to Stanford.

(One entry per household. Void where prohibited.)
posted by spock at 10:12 AM on February 4, 2011


Why oh why is it a web browser tool that they host? That means I can't use it on confidential data, boo.
posted by mkb at 10:27 AM on February 4, 2011 [1 favorite]


It's a locally-run server, mkb. The data never leaves your computer.
posted by Xoder at 10:43 AM on February 4, 2011


Actually, I take that back! I thought so by watching the demo with the classic http://localhost/ but it was just the demo... no download link anywhere.
posted by Xoder at 11:41 AM on February 4, 2011


I think this is awesome. Most people (from the receptionist hired by a temp agency to mid-level managers) aren't going to learn awk/sed (whatever those are - seriously, I don't even know.

They are very similar open source command-line text manipulation utilities that have about 90% syntax overlap and 10% critical fuck you up differences that means that only people who actively maintain the syntax in their working memory find them useful for quick data analysis.

I've used Google Refine and for getting a quick handle on your data - particularly the feature that gives grouping totals for similar open ended text entry it is pure gold. It can optionally steamroll right over spelling mistakes, regional differences, umlauts, etc... in precisely the ways I want and need . Doing that in Sed or Awk was a nightmare that I have lived. The joke about thinking that the solution to your problem was a regular expression meaning you now had two problems underestimated the complications of shell and command-line utility syntax agony by at least a factor of 3.

If you have to deal with messy user entered data do yourself a solid and check out Google Refine. It is worth the time the you will invest figuring it out (maybe 20 minutes). Even you are going to then just dump it into excel. Trust me. It is like a drive through car wash for your data.
posted by srboisvert at 12:05 PM on February 4, 2011 [3 favorites]


MKB > Think the solution would be to create dummy datasets...do the transformations you need and then use the javascript that Wrangler gives you at the end?
posted by lslelel at 12:15 PM on February 4, 2011


I haven't checked this out fully yet.. but is it like dabbledb.com (now acquired and killed off by Twitter, I think)...?
posted by mhh5 at 12:16 PM on February 4, 2011


Finally, someone invented awk, sed, sort and the rest!
posted by DU at 8:26 AM on February 4


I was writing awk scripts when a lot people here where still shitt'n yellow, and I can't wait to try it (thinking back in horror about trying to abstract atom coordinates from a xray crystal structure file - space delimited until you have both a dimension in the hundreds place and negative).
posted by 445supermag at 12:48 PM on February 4, 2011


The dabbledb team is at Twitter, yes. I think Twitter was interested in their site analytics tool more than their super-cool databse.
posted by zippy at 1:11 PM on February 4, 2011


Finally, someone invented awk, sed, sort and the rest!

My horse buggy gets 48 hogsheads to the yard, but maybe that car over there is an easier form of transportation.
posted by Blazecock Pileon at 1:17 PM on February 4, 2011 [4 favorites]


Yeah, um sorry - this is not awk, sed, and sort (or in my case
DATA {OMG!};
INPUT {GAH!};
LABEL {WTF!};
PROC FORMAT {CHUCKLEHEAD!};
RUN;
). If every dataset that I worked with was 75 lines, awk sed and sort would be valid and easy to use. Difficult formatted data from the US Census or IPUMs routinely drives me crazy, requiring at least three handstands, a skippty jump, and some jazz hands before it is useful. If this is truly as easy to use as it seemed to indicate, and - if it can scale up to a good sized data set - well then - how much $? I have a purchase order in hand - because, this is awesome.
posted by Nanukthedog at 2:24 PM on February 4, 2011 [4 favorites]


I would love to play with this, but cant seem to get it to work in firefox :( Anyone else have this problem and find a solution.
posted by batou_ at 3:54 PM on February 4, 2011


Metafilter: three handstands, a skippty jump, and some jazz hands.
posted by jquinby at 4:53 PM on February 4, 2011 [2 favorites]


If it can do decent date recognition and transforms, I'm in.
posted by underflow at 8:19 PM on February 4, 2011


I like it, but I've discovered a bug already: Wrangler can't recognize letters with unusual accent marks, such as Ł or ą, as letters, and therefore it doesn't see them as contiguous with the rest of the word they're a part of.
posted by Asparagirl at 1:26 AM on February 5, 2011


« Older At least not everybody is totally disappointed   |   Religion, freedom and democracy in Egypt Newer »


This thread has been archived and is closed to new comments