Coding for Journalists 101
August 29, 2014 5:29 AM   Subscribe

So a little while ago, I set out to write some tutorials that would guide the non-coding-but-computer-savvy journalist through enough programming fundamentals so that he/she could write a web scraper to collect data from public websites.
posted by postcommunism (40 comments total) 138 users marked this as a favorite
 
(Forgot to add: via AskMe)
posted by postcommunism at 5:33 AM on August 29, 2014 [24 favorites]




That's rather nicely done; thank you. I do some details differently, and it could use a few updates as tools have changed (poppler's fork of pdftotext, f'rinstance, has some very powerful features; tesseract is now quite usable) or moved (Google Refine → OpenRefine), but it's quite clever and well paced.
posted by scruss at 6:20 AM on August 29, 2014 [3 favorites]


I've done some web scraping for non-journalism work previously using Perl. That's the language I'm already familiar with, and of course Perl has modules already available for web-scraping, completing online forms, etc.

Can anyone comment on how Ruby compares to Perl in terms of ease of programming for this type of thing? I've never touched Ruby, but I get that it's used pretty widely now.
posted by blakewest at 6:28 AM on August 29, 2014


My first thought on this when seeing it was "huh, Ruby and not Perl?," mainly because Perl has done 100% of this pretty well for 15+ years. But one thing Ruby does perhaps slightly better than Perl is that the code reads more like written word, so in the context of teaching journalists, it's fairly appropriate.

For example, in Perl you might iterate like this:
   for my $i (0..4) { some_code(); }
And in Ruby it would be:
   5.times do some_code end

Also Ruby is what all the cool kids are doing, and Perl is relegated to grumpy curmudgeons like myself.

Ruby was a fine choice though, and I think this is a cool article in that journalists, while perhaps not interested in code specifically, should always be generically interested in something that say pushes them interesting buried criminal logs from public websites. Because that's part of journalism.
posted by mcstayinskool at 6:38 AM on August 29, 2014 [3 favorites]


Grumpy curmudgeons role-call!

I would say, w.r.t. languages, they're the same, but different.
posted by mikelieman at 6:40 AM on August 29, 2014


Re Ruby vs Perl: I think readability/accessibility is huge here. Even among people who sling code for a living, Perl has a reputation for being obtuse and hard to work with (see: "read-only language", "Swiss army chainsaw"). When you're trying to reach out to people with no experience coding and who aren't looking to become professional programmers, one look at Perl syntax could wind up sending them running for the hills. Ruby seems like a much wiser choice.
posted by Itaxpica at 6:45 AM on August 29, 2014 [1 favorite]


Also Ruby is what all the cool kids are doing, and Perl is relegated to grumpy curmudgeons like myself.

I've witnessed three middle aged sysadmins get laid off because they stuck to Perl.
posted by ocschwar at 6:52 AM on August 29, 2014 [2 favorites]


This is awesome. Awesome awesome awesome. I'm going to learn this shit ASAP.
posted by Potomac Avenue at 6:53 AM on August 29, 2014


Related, I produced this image as a key for some (rather good) attorneys to understand what specifically we meant when we say something is patched.
posted by atbash at 6:54 AM on August 29, 2014 [2 favorites]


I've witnessed three middle aged sysadmins get laid off because they stuck to Perl.

I've witnessed three middle aged sysadmins get laid off because they were middle aged.
posted by mcstayinskool at 6:59 AM on August 29, 2014 [21 favorites]


I use Beautiful Soup for this kind of thing in Python. It may just be that the author of these tutorials loves using xpath searches for everything, but the usual Beautiful Soup way of doing things like "for link in soup.find_all('a'): print(link.get('href'))" seems more intuitive to me. The xpath searches seem more messy and harder to read, kind of like using regexes all the time.
posted by burnmp3s at 7:01 AM on August 29, 2014 [6 favorites]


Rad! I've been looking for a Ruby project to play around with -- think I'll try the Pfizer Dollars-to-Doctors scrape.

These examples use public websites and (theoretically) publicly available data. Are there any protections for journalists and others who want to scrape data that is freely accessible but copyright protected? I'm thinking if somebody wanted to write a story about, say, therapists in Psychology Today's Therapist Finder tool, would they have any protections for scraping the data to generate statistics? What about releasing a publicly available tool for users to scrape data and generate visualizations themselves on the fly?
posted by elephantsvanish at 7:14 AM on August 29, 2014


The best part is that whenever a web page changes its branding, you get to rewrite the scraper!
posted by blue_beetle at 7:41 AM on August 29, 2014 [3 favorites]


For example, in Perl you might iterate like this:
for my $i (0..4) { some_code(); }
And in Ruby it would be:
5.times do some_code end


Wow, really? I may have to learn Ruby. That's terrific.
posted by Elementary Penguin at 7:51 AM on August 29, 2014


I would use Awk and some shell (I like tcsh). It's 1980s technology and proud of it. I don't worry about getting laid off, just having fun (Awk is a rocking language).
posted by stbalbach at 8:05 AM on August 29, 2014 [2 favorites]


You people with your fancy technovelties. I just photocopy my iPad screen and hand the intern some scissors.
posted by oulipian at 8:35 AM on August 29, 2014 [9 favorites]


blue_beetle, you're very right about that, and any web scraping endeavor should be done with the knowledge that web scraping is a fool's errand for frequently redesigned websites.

However, most of this deals with scraping government entities' websites, and they never have any money to do anything but barely maintain their data, so their websites rarely change. So, at least for this application, web scraping not so bad.

An aside: about 15 years ago I wrote a web scraper for publicradiofan.com, basically to pull out URLs for favorite shows and create an opml file where my squeezebox radio could access them for "on-demand"(ish) NPR shows. Scraper still works 15 years later, it actually outlasted my use for the scraper (now that most of those shows are available via podcast).
posted by mcstayinskool at 8:40 AM on August 29, 2014


Why though? This 'hey everyone could/should be a coder' thing really gets on my tits. It's no more sensible than everyone being a car mechanic or concert pianist.
posted by GallonOfAlan at 8:54 AM on August 29, 2014


I stayed away from mentioning the language used, because you end up using what you know. I use an unholy mess of almost everything mentioned here (except Ruby, as I haven't found a library for it that does anything better that BeautifulSoup), plus my own set of horrors (tr '<' '\012' gets more use than I should admit).

I like the ideas that PDF Liberation has for dealing with complex PDFs: deal with them a series of georeferenced objects, so you can use spatial queries to pull out nearby terms. I don't know if they've ever quite got it all working, but the concept is compelling.

> What about releasing a publicly available tool for users to scrape data and generate visualizations themselves on the fly?

A general tool would be hard, because all web pages are structured differently. I kind of thought that ScraperWiki would become that tool, but it scurried off in its own commercial direction. OpenRefine is kind of that tool, but it shows that the task is never trivial.
posted by scruss at 8:57 AM on August 29, 2014 [1 favorite]


> Why though? This 'hey everyone could/should be a coder' thing really gets on my tits. It's no more sensible than everyone being a car mechanic or concert pianist.

I don't think everyone should be a chef, but being able to cook a little bit is a lot cheaper and easier than eating out every night.
posted by svenx at 9:10 AM on August 29, 2014 [7 favorites]


This is the future of journalism. Take notes.
posted by oceanjesse at 9:17 AM on August 29, 2014 [3 favorites]


This 'hey everyone could/should be a coder' thing really gets on my tits. It's no more sensible than everyone being a car mechanic or concert pianist.

I agree with that sentiment in general, because not everyone is really interested in learning how to code anything beyond very simple projects. But I think this is one of those cases where it's a very basic but tedious task where knowing the bare minimum of coding skills can help you automate that task. In that kind of a situation, learning how to scrape a simple web page is not really that different than learning how to use Excel to automate some complex calculations, it just happens to involve using a programming language instead of some other method.
posted by burnmp3s at 9:26 AM on August 29, 2014 [2 favorites]


Not everyone needs to be a car mechanic, but it sure is handy to be able to change a flat tire.

Not everyone needs to be a concert pianist, but it sure is nice to be able to screw around with a ukelele or a harmonica with friends.
posted by jenkinsEar at 9:30 AM on August 29, 2014


Can anyone comment on how Ruby compares to Perl in terms of ease of programming for this type of thing?

With Ruby you can understand what you did one week later.
posted by srboisvert at 9:38 AM on August 29, 2014 [10 favorites]


Why though? This 'hey everyone could/should be a coder' thing really gets on my tits.

Because a journalist typically has exactly zero professional programmers available to them to do this kind of work? And, if they are doing an investigative article on racial demographics of incarcerations or the like, learning to do this will help them get information quicker (and in many cases in volume unattainable through manual means) and thus do better journalism. Which is their job.
posted by mcstayinskool at 9:48 AM on August 29, 2014 [1 favorite]


I would hope journalists would learn to scrape sites for data driven stories. Similarly I would hope that even non photojournalists would learn how to focus and expose a photo.
posted by Monochrome at 9:48 AM on August 29, 2014


>> Can anyone comment on how Ruby compares to Perl in terms of ease of programming for this type of thing?

> With Ruby you can understand what you did one week later.


(Wow, what's with all the Perl hate? I still have Perl code from - let's see - 1998, and I understand what it does. I think. No, wait... yes, I see what it did.)

For my most recent foray into writing what was effectively a web scraper, I used Python with Beautiful Soup, and yes, it was much nicer than parsing each element by hand. Ironically(?), I was parsing a web page I had written myself as a grad student, using Perl to format a tediously long table...

I agree with all the comments above - one should know how to cook a simple meal, change a tire, and read simple HTML. For modern journalists, this is an essential skillset.
posted by RedOrGreen at 10:08 AM on August 29, 2014 [4 favorites]


With Ruby you can understand what you did one week later.

Yeah, this statement was pretty much bullshit, but any time you start talking about Perl someone is going to pipe up in some way like this. Perl gives you the freedom to write shitty code, which is what flavors these opinions. But really, you can write shitty code in most languages, including Ruby.

I use Perl I wrote 10+ years ago quite frequently, and while my coding style has changed, it's absolutely understandable.
posted by mcstayinskool at 10:12 AM on August 29, 2014 [2 favorites]


And don't forget Hacks and Hackers which I highly recommend.
posted by fallingbadgers at 11:06 AM on August 29, 2014 [1 favorite]


I learned web-scraping and quite a bit of useful code from Dan's "Bastard's Book of Ruby" last year over the course of about 2 months. I was coming from years of *nix but very little actual programming experience beyond shell-scripting. After looking at several other tutorials in various languages, I stuck with Ruby largely because BBofR made it seem like the quickest and most intuitive way to move beyond wget to actually scraping complex data from intricate sites.

For anyone interested, this is a great way to gain a lot more coding/web-wrangling experience with concrete examples; a novice will come away from these projects with a much better understanding of how websites work, and dip toes into regular expressions, database management, and all sorts of fun codemonkey goodness.

And yeah, although Dan's guide makes the language/process appealing, there's no reason it has to be Ruby. Most of the major programming languages have web/data-scraping libraries (R and Rawler is a good option for you econ/stem/social science types). Ruby has another superbly useful gem called "Watir" (based on Selenium IIRC) that's great for web automation; I often use it to scrape data from sites with tons of javascript or modals, which can be tricky with mechanize.
posted by aspersioncast at 11:21 AM on August 29, 2014 [1 favorite]


No experience with Watir but I use Selenium for web scraping all the time. The main data that I scrape is fetched from the backend DB with javascript and I need the browser in order for that to run.

Perl is my go to language for anything really because I've been writing code in it for so long (and I too have 10+ year old code of my own and others that I have no problem reading) but Ruby is lovely and works well with Selenium.

If Perl falls easily out of your head then Ruby will too. Personally I'd have been very happy if Ruby had been called Perl 6. It's not the same but it's not a million miles away either.
posted by vbfg at 11:40 AM on August 29, 2014 [1 favorite]


Why though? This 'hey everyone could/should be a coder' thing really gets on my tits. It's no more sensible than everyone being a car mechanic or concert pianist.

It's not about should, it's about being empowered in the modern world. I'm not a coder, but I know how to code a little bit, the same way that I am not a mechanic, but I have tools and can fix most things, and I am not a writer, but I can communicate with the written word when I need to.

Being empowered in the world regularly saves my ass. (It also saves me a shit ton of time and money, raising my standard of living)

I see people who go the other way, and I don't like the results. In a world in which you are far less empowered, figuratively (or literally) illiterate, you become so dependent on so many weak links, life is more stressful, so many more ways for things to go wrong, when things do go wrong they more often compound and spiral into serious problems. It doesn't look like a good way to live.

If you reliably earn mid-6-figures or more, then fuck it, you can afford to let the money step in and take care of you. But otherwise, it's good to be able to take care of things yourself.
posted by anonymisc at 3:15 PM on August 29, 2014 [4 favorites]


Personally I'd have been very happy if Ruby had been called Perl 6.

Can't we still just do this? PLEASE!
posted by mikelieman at 7:08 PM on August 29, 2014


I coded in Perl at various times between 1995 and 2001; I've been coding mostly in Ruby since 2010. I much prefer Ruby, although I know Perl has gone through some changes since I left it.

Ruby is a perfectly good language for scraping; the other one I would recommend for beginners is Python but I'm not familiar with its facilities for downloading and parsing HTML. I have written scrapers using Mechanize and Nokogiri in Ruby and it's been good for that, and I agree that as long as you limit your xpath use the resulting code will be sane looking.

I prefer Ruby to Python but I think it's slightly harder to pick up. The syntax is a bit odd. You say "my_array.each do |x|" instead of "for x in my_array" in Python. But the thing that Ruby has is blocks, which is why it uses ".each" instead of the for loop syntax that nearly every other language has. Blocks are a very powerful thing. Python has list comprehensions (eg "[x for x in my_list where x > 10]"), but Ruby accomplishes the same thing with blocks (the corresponding example would be "my_list.select { |x| x > 10 }"). Hopefully my parenthetic examples show how Ruby's syntax is a bit terser but just as effective.

The thing about blocks is that they offer a way to extend the language a bit. Python offers map() and reduce() methods for lists, and Ruby does too, but even if it didn't, you could add your own versions of these methods and use them fairly naturally. For instance, if you wanted to write a method that took a list and applied some arbitrary code to it, you could do something like:
def map(input_array, &block)
  output_array = []
  input_array.each do |x|
    output_array.push(yield x)
  end
  return output_array
end
Then you can use your hand-rolled map to square elements of a list thusly:
def squared(x)
  return x * x
end

map([1,2,3], &lambda { |x| squared(x) })
(which would return [1,4,9] if I've written all this correctly.)

This ability to write methods which will run arbitrary code inside, while a bit scary, is a limited version of #8 in Paul Graham's list of 9 things that made Lisp different. But even though limited, it was powerful enough to make one hacker state that Ruby is an acceptable Lisp.

One other random nice thing about Ruby is that it was fairly eclectic in what it picked up from other languages. People will tell you that it's Smalltalk-based, and there is truth to that, but one thing it apparently took from Perl is the "if/unless" constructions. So not only can you say
unless foo
  do_this_other_thing()
end
but you can also say
do_this_one_thing() if foo

do_this_other_thing() unless bar
which is straight out of Perl and one reason I like Ruby.

OK, done with the Ruby proselytizing for the night. :-)
posted by A dead Quaker at 7:10 PM on August 29, 2014 [1 favorite]


Why though? This 'hey everyone could/should be a coder' thing really gets on my tits. It's no more sensible than everyone being a car mechanic or concert pianist.

I agree that in general it's stupid - there's not reason at all for everyone to learn. But journalists as a class are collectively panicking about how they're going to be able to do good work while also being able to pay the bills while most of the traditional businesses who pay for journalism die.

With a bit of coding competence they can expand their demonstrable list of things-what-they-can-do and distinguish themselves from their non-tech minded peers in the eyes of the people with the cash money. See also: Librarians.
posted by coleboptera at 7:40 PM on August 29, 2014


My husband is teaching a journalism course now that includes a very brief exploration into data scraping and his current advice is find a coder friend and get them to help you do this, but learn statistics and info graphics yourself. I've just sent him this, super helpful thanks! It is becoming an interesting new skill for journalists looking at big trends.
posted by viggorlijah at 8:50 PM on August 29, 2014 [1 favorite]


There are some great tools you can use to scrape without needing to write code. See:

Kimono

import.io
posted by victory_laser at 9:36 PM on August 29, 2014


Why though? This 'hey everyone could/should be a coder' thing really gets on my tits.

The computer is the universal machine: it is everywhere and it can do just about anything. Everyone (and especially knowledge workers) should know how to command the universal machine, at least to the point that they understand what programming means and what a computer can and cannot easily do. If the computer is still a magic box to you, you should learn a little more.
posted by pracowity at 1:46 AM on August 30, 2014


More on the "Why bother learning...?" aspect.. this is about journalists, presumably those reporting on computers and technology. It shouldn't take much imagination to see how that could be beneficial to all parties involved: the reporter, the subjects, the readers. Journalists are not generic. The good ones tend to have good domain knowledge.

As far as everyone else goes. sure you don't *need* to know how to program a computer to use it, just as you don't *need* to know how to change the oil in your car. On the other hand, if you do know basic auto maintenance and know how your car works, you can recognize small problems before they become big ones, and will probably be a better driver and owner for it. Similarly, knowing how a computer works, how the internet works, and how computers execute instructions (programming), will undoubtedly make you a more effective user of a very powerful communications and analytical tool.
posted by scelerat at 3:17 AM on August 30, 2014 [2 favorites]


« Older The origins of that stereotypical Chinese...   |   All that cardboard! Newer »


This thread has been archived and is closed to new comments