Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty.
... Even Jon Skeet cannot parse HTML using regular expressions ...awesome
While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.
The problem is that these days when most folks say regex they mean Perl (or PCRE) regexTrue. Perl's regexps haven't been regexps for a long while. And it is, in fact, possible to parse HTML fairly reliably using a perl regexp, as in the following example:
$foo =~ m/(?{ require HTML::Parser; HTML::Parser->new()->parse($_)->eof; })/;
<([a-z]+) *[^/]*?> then you are probably ill equipped to decide for yourself when one of those times is.pure programmer Not-Invented-Here syndrome. It's ridiculously easy to use a library that can take in XPATH, people just think they have big dicks when they can write their own code.for data extraction, there's only one right way to do things: Run an actual browser, and then interrogate its DOM.
////////////////////////////////////////////////////
// Available macro substitutions:
//
// Simple substitutions:
// $$ Literal dollar sign
//
// ${name} Vars[name] if defined, else "".
// $a Same, but for single letter names only.
//
// ${name[ix]} Vars[name].split( '|')[ix]
// ${name<jx>} Vars[name].split('\n')[jx]
//
// ${name#func} func(Vars[name])
// ${name##func} Same
//
//
// Regex substitutions:
// ${name/pat/repl/flags} Vars[name].replace(/regex/,replacement)
// ${name/pat/repl/sep/flags} Like gnu grep -o. Non-matching text is replaced with sep.
//
// ${name/pat/repl/flags##func} func(regex result)
// ${name/pat/repl/sep/flags##func} func(regex result)
//
// A literal 'str' can be used in place of name in the above substitutions.
//
//
// Conditional selection:
// ${cond?clause1:clause2} If cond is true, use clause1, else clause2.
// ${cond?clause1:clause2##func} func(conditional result)
//
// Cond:
// 'str' Literal string
// name True if Vars[name] is not one of { 0, "", null, undefined, etc. }
// name#func True if func(Vars[name]) is not zero, etc.
// name/pat/flags True if Vars[name] matches regex (actually, searches)
//
// Clause:
// 'str' Literal string
// name Vars[name]
// name#func func(clause); compare to ##func which applies to whole expression
// name/pat/repl/flags clause.replace(regex,repl)
//
//
// Shorthand notation:
// ${cond?clause1} Same as ${cond?clause1:''}
// ${cond?:clause2} Same as ${cond?'':clause2}
// ${clause1:clause2} Same as ${clause1?clause1:clause2}
//
//
// Binary operators:
// ${op1+op2} String concatenation.
//
// Operands:
// 'str' Literal string
// 'str'#func func('str')
//
// name Vars[name]
// name#func func(Vars[name])
//
//
// Home/search conditionals:
// @{substitution} if (!Vars.q) then substitution, else ''
// ?{substitution} if ( Vars.q) then substitution, else ''
//
// Substitution can be a simple substitution, regex substitution,
// conditional selection, or binary operation.
//
//
// var MacroPattern =
// esc /(@@|\?\?|\$\$)|
// gvar \$([a-z])|
// home, search (?:(@)|(\?)|\$)
// \{
// (?:
// (?:
// cstr,cvar,cix,cjx (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// cpat,cflag,cfnc (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/([i]*)|#([_a-z]+))?
// qmark (\?)
// )?
// (?:
// astr,avar,aix,ajx (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// apat,asub,asep,aflag,afnc (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?
// )?
// (?:
// colon (:)
// bstr,bvar,bix,bjx (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// bpat,bsub,bsep,bflag,bfnc (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?
// )?
// |
// NOTE: only '+' is implemented (?:
// xstr,xvar,xix,xjx,xfnc (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?
// binop (\+|%%?|##?)
// ystr,yvar,yix,yjx,yfnc (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?
// )
// )
// gfnc (?:##([_a-z]+))?
// \}
// /g;
//
require 'rubygems'
require 'nokogiri'
require 'open-uri'
###Load Google homepage
doc = Nokogiri::HTML(open('http://google.com/'))
###Get all the links and output the text
doc.xpath("//a").each { |data| puts data.content }
###Which will output the following...
#Images
#Videos
#Maps
#....
###Get all the "P" tags
doc.xpath("//p").each { |data| puts data.content }
###Which will output the following...
#©2009 - Privacy
How friggin hard could it possibly be? It's like 6 lines of code, and retard easy to use, dead obvious to maintain, with documentation on the homepage.Once more unto the breach, dear friends, once more;Meow. So let's try round 'n', where 'n' is becoming a large number.
Or close the wall up with our English dead.
In peace there's nothing so becomes a man
As modest stillness and humility:
But when the blast of war blows in our ears,
Then imitate the action of the tiger;
for x in $( pbpaste | egrep -o '<a href="/tarballs/[^"]+' | cut -c 10- ); do echo $x curl -O http://www.opensource.apple.com$x 2>/dev/null doneIt worked great. It tooks less than five minutes to write. I didn't even need to open an editor, I just hammered it out on the command line. The first time I ran it, I substituted echo for curl. The commands looked fine, so I ran it for real.
« Older Olympic Flame Burns for Icy Relay... | History of a New York Block.... Newer »
This thread has been archived and is closed to new comments
posted by spiderskull at 1:32 PM on November 15, 2009