Can anybody recommend good tools or services (hosted tools) for scraping data from websites.

Tools should be open-source and being well-documented is a big plus.

Prefer open services (i.e. code is open-source and data is open) and free to use (at least up to some level of usage).

asked 19 Jan '11, 09:02

rgrp's gravatar image

rgrp ♦♦
accept rate: 14%

edited 19 Jan '11, 09:47

Does this include hosted tools, or just standalone code libraries? Any particular language preferences?

(19 Jan '11, 09:30) psychemedia ♦♦

@psychemedia: by service meant hosted tools (have edited question to reflect this)

(19 Jan '11, 09:51) rgrp ♦♦

it's scraperwiki jeopardy then, right? non-open http://www.80legs.com/ might also be worth a go, haven't tested it.

(20 Jan '11, 09:19) pudo ♦♦

12next »

Scraperwiki is a hosted site that provides editable code running pages that allow you to write screenscrapers in a variety of languages (PHP, Python, Ruby). Data can be scraped into a hosted data store and then accessed via an API (XML, JSON, PHP, YAML or CSV).

Scraperwiki provides a variety of utility libraries to support screenscraping activities, including:

  • Beautiful Soup (py)

Examples are provided for scraping data from:

  • HTML pages
  • PDF docs

If you want to host your own version of Scraperwiki, the code is also available for download under the GNU Affero General Public License.

This answer is marked "community wiki".

answered 19 Jan '11, 09:40

psychemedia's gravatar image

psychemedia ♦♦
accept rate: 11%

edited 19 Jan '11, 09:40

XQuery, the XML query language, is perhaps an unusual and even unlikely candidate for scrapping but as implemented in the eXist-db open source XML database (a Java application), this functional language is very powerful for this kind of work.

For example to get the list of Airports beginning with a given letter from wikipedia e.g. the A's as XML :

declare namespace h = "http://www.w3.org/1999/xhtml";
let $letter := request:get-parameter("letter","A")
let $uri := concat("http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_",$letter)
let $table := doc($uri)//h:table[2]
  {for $row in $table/h:tr[empty(h:th)]  (: ignore separator rows :)
      {let $icao := $row/h:td[2]/string()
         if ($icao ne "")
         then <ICAO>{$icao}</ICAO>
         else ()

and execute as :

The script could be extended to iterate over the letters to compile a full listing.

Modules in eXist-db also handle textual and mal-formed XML documents and export as json. The database provides storage for the scripts themselves as well as cached files, reference data and configuration files.


answered 19 Jan '11, 19:55

kitwallace's gravatar image

accept rate: 13%

edited 19 Jan '11, 20:55

I use eXist for screen scraping too, and think it's a great tool too!

(27 Jan '11, 20:20) Joe Wicentowski

Another tool you might like to use is WebHarvest. "Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content."

The WS02 Mashup Server embeds web harvest and adds lots of other functionality that allows you to build web services from your web scrapers.


answered 20 Jan '11, 01:48

rgardler's gravatar image

accept rate: 0%

Other techniques I've seen include using Google Refine, see the Wiki:



This requires a fixed list of URLs and a bit of coding to extract the relevant data, but does the retrieval for you.

For an even more user-friendly approach that can also automate the crawling process, it also might be worth looking at Needle, which doesn't require any developer chops at all from what I can see.


answered 23 Jan '11, 08:04

JeniT's gravatar image

accept rate: 0%

There are so many services that I created a flow chart to help me and others work out quickly which is the best to use depending on the data you're scraping. It's here: http://onlinejournalismblog.com/2011/09/06/gathering-data-a-flow-chart-for-data-journalists/


answered 18 Nov '11, 09:50

paulbradshaw's gravatar image

accept rate: 0%

@paulbradshaw: really useful Paul. Would you be up for contributing this http://datapatterns.org/ (either via allowing someone else to include or including a pattern yourself - e.g. on "how to find and gather data")?

(11 Dec '11, 16:43) rgrp ♦♦

Thanks - happy for anyone to add it. Not clear how to add it myself?

(13 Dec '11, 09:16) paulbradshaw

The Simile project at MIT has a great tool called Solvent. It's a firefox extension that allows you to build and run screen scrapers. That scraped data is then added to their PiggyBank firefox extension which turns your browser into a mashup platform.

Brilliant stuff.


answered 19 Jan '11, 11:11

rgardler's gravatar image

accept rate: 0%

edited 20 Jan '11, 09:43

rgrp's gravatar image

rgrp ♦♦

Is Solvent working on the latest stable release of FF? It was broken on the latest versions of Firefox the last time I checked.

(20 Jan '11, 10:08) IainSproat

http://open.dapper.net/ is an interactive GUI for indicating the interesting changing parts of arbitrary web pages and making a feed from them. I don't have any currently running case where it worked, but I keep trying it because it gets close to working sometimes :)


answered 13 Feb '11, 18:56

drewp's gravatar image

accept rate: 0%

My current tool of choice is the Python lxml library: its extremely fast, can parse directly from URLs and allows for easy access using both generic XPath queries and their own variant. In general, I've found it to be superior to BeautifulSoup and any onboard utilities.

Another important tool for me has been MongoDB. While one can discuss its value as a general-purpose DB, its capability to hold semi-random JSON data fresh from the web is simply unbeatable ;-)


answered 19 Jan '11, 21:46

pudo's gravatar image

pudo ♦♦
accept rate: 33%


That's all you need to know.


answered 23 Jan '11, 18:47

krees's gravatar image

accept rate: 0%

Google has announced that it will retire Needlebase June 1, 2012 (see http://needlebase.com/).

(15 May, 03:03) re-search-er ♦

Here are a few more great open source projects that are fairly easy to use:





answered 23 Mar, 11:00

Data9er's gravatar image

accept rate: 0%

edited 23 Mar, 11:04

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 19 Jan '11, 09:02

Seen: 5,848 times

Last updated: 26 Jul, 18:40

powered by OSQA