Search this site


Metadata

Articles

Projects

Presentations

xpathtool - powerful xpath queries on the commandline

What is xpathtool?

Short version: xpath query tool for xml and html.

Long version: swanky frontend to xsltproc which takes an xpath query and content and spits out the results.

Dependencies: xsltproc (comes with libxslt), xmllint (comes with libxml2).

Download

xpathtool-20071102.tar.gz

Usage

--ihtml
Set input format as html.
--otext
Output should be text. Implemented as <xsl:value-of select="." />
--oxml
Output should be xml.
--ohtml
Output should be html.
--indent (default) or --noindent
Set whether or not xml or html output should be depth-based indented.
--stripspace=XXX
Define elements who's content should be space-stripped. Implemented with <xsl:strips-ace>.
--pretty (default) or --nopretty
Pretty print xml and html output by filtering through 'xmllint --format'

Example: Technorati WTF RSS

% GET feeds.technorati.com/wtf | ./xpathtool.sh '//link' | tail -3
http://technorati.com/wtf/we-can-take-our-country-back/2007/05/16/ron-paul-is-standing-up-tot-the-establishment-1
http://technorati.com/wtf/giuliani-is-deluded/2007/05/16/delusional-and-out-of-touch-with-reality-1
http://technorati.com/wtf/macbook/2007/05/16/apples-rule-1

Example: Slashdot article links

Slashdot is worthless. The article writeups are worthless. The comments are worthless. The users are worthless.

Sometimes, the linked content is not. Let's pull out all the links in all the articles on the frontpage:
# slashdot articles are inside the following html element
% xbase="//div[@class='article']//div[@class='intro']/i"
% GET www.slashdot.org | ./xpathtool.sh --ihtml "[email protected]|$xbase//a/text()"  | paste -d" " - - 
http://www.foreignpolicy.com/story/cms.php?story_id=3807 the world's biggest digital dump
http://googleblog.blogspot.com/2007/05/google-apps-partner-edition.html turn over their entire email operation to Google
http://apcmag.com/6138/the_dark_side_of_google_apps_for_isps the dark side of Google's offer