6
2

Can anybody recommend good tools or services (hosted tools) for scraping data from websites.

Tools should be open-source and being well-documented is a big plus.

Prefer open services (i.e. code is open-source and data is open) and free to use (at least up to some level of usage).

asked 19 Jan '11, 09:02

rgrp's gravatar image

rgrp ♦♦
501122027
accept rate: 14%

edited 19 Jan '11, 09:47

Does this include hosted tools, or just standalone code libraries? Any particular language preferences?

(19 Jan '11, 09:30) psychemedia ♦♦

@psychemedia: by service meant hosted tools (have edited question to reflect this)

(19 Jan '11, 09:51) rgrp ♦♦

it's scraperwiki jeopardy then, right? non-open http://www.80legs.com/ might also be worth a go, haven't tested it.

(20 Jan '11, 09:19) pudo ♦♦

12next »

Scraperwiki is a hosted site that provides editable code running pages that allow you to write screenscrapers in a variety of languages (PHP, Python, Ruby). Data can be scraped into a hosted data store and then accessed via an API (XML, JSON, PHP, YAML or CSV).

Scraperwiki provides a variety of utility libraries to support screenscraping activities, including:

  • Beautiful Soup (py)

Examples are provided for scraping data from:

  • HTML pages
  • PDF docs

If you want to host your own version of Scraperwiki, the code is also available for download under the GNU Affero General Public License.

link
This answer is marked "community wiki".

answered 19 Jan '11, 09:40

psychemedia's gravatar image

psychemedia ♦♦
1.1k323961
accept rate: 11%

edited 19 Jan '11, 09:40

XQuery, the XML query language, is perhaps an unusual and even unlikely candidate for scrapping but as implemented in the eXist-db open source XML database (a Java application), this functional language is very powerful for this kind of work.

For example to get the list of Airports beginning with a given letter from wikipedia e.g. the A's as XML :

declare namespace h = "http://www.w3.org/1999/xhtml";
let $letter := request:get-parameter("letter","A")
let $uri := concat("http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_",$letter)
let $table := doc($uri)//h:table[2]
return 
<airports>
  {for $row in $table/h:tr[empty(h:th)]  (: ignore separator rows :)
   return
     <airport>
      <IATA>{$row/h:td[1]/string()}</IATA>
      {let $icao := $row/h:td[2]/string()
       return 
         if ($icao ne "")
         then <ICAO>{$icao}</ICAO>
         else ()
      }
      <name>{$row/h:td[3]/string()}</name>
      <location>{$row/h:td[4]/string()}</location>
     </airport>
  }
</airports>

and execute as : http://184.73.216.20/exist/rest/db/apps/airports/wikidata.xq?letter=A

The script could be extended to iterate over the letters to compile a full listing.

Modules in eXist-db also handle textual and mal-formed XML documents and export as json. The database provides storage for the scripts themselves as well as cached files, reference data and configuration files.

link

answered 19 Jan '11, 19:55

kitwallace's gravatar image

kitwallace
67081426
accept rate: 13%

edited 19 Jan '11, 20:55

I use eXist for screen scraping too, and think it's a great tool too!

(27 Jan '11, 20:20) Joe Wicentowski

Another tool you might like to use is WebHarvest. "Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content."

The WS02 Mashup Server embeds web harvest and adds lots of other functionality that allows you to build web services from your web scrapers.

link

answered 20 Jan '11, 01:48

rgardler's gravatar image

rgardler
605
accept rate: 0%

Other techniques I've seen include using Google Refine, see the Wiki:

http://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices

http://code.google.com/p/google-refine/wiki/StrippingHTML

This requires a fixed list of URLs and a bit of coding to extract the relevant data, but does the retrieval for you.

For an even more user-friendly approach that can also automate the crawling process, it also might be worth looking at Needle, which doesn't require any developer chops at all from what I can see.

link

answered 23 Jan '11, 08:04

JeniT's gravatar image

JeniT
463
accept rate: 0%

There are so many services that I created a flow chart to help me and others work out quickly which is the best to use depending on the data you're scraping. It's here: http://onlinejournalismblog.com/2011/09/06/gathering-data-a-flow-chart-for-data-journalists/

link

answered 18 Nov '11, 09:50

paulbradshaw's gravatar image

paulbradshaw
31124
accept rate: 0%

@paulbradshaw: really useful Paul. Would you be up for contributing this http://datapatterns.org/ (either via allowing someone else to include or including a pattern yourself - e.g. on "how to find and gather data")?

(11 Dec '11, 16:43) rgrp ♦♦

Thanks - happy for anyone to add it. Not clear how to add it myself?

(13 Dec '11, 09:16) paulbradshaw

The Simile project at MIT has a great tool called Solvent. It's a firefox extension that allows you to build and run screen scrapers. That scraped data is then added to their PiggyBank firefox extension which turns your browser into a mashup platform.

Brilliant stuff.

link

answered 19 Jan '11, 11:11

rgardler's gravatar image

rgardler
605
accept rate: 0%

edited 20 Jan '11, 09:43

rgrp's gravatar image

rgrp ♦♦
501122027

Is Solvent working on the latest stable release of FF? It was broken on the latest versions of Firefox the last time I checked.

(20 Jan '11, 10:08) IainSproat

http://open.dapper.net/ is an interactive GUI for indicating the interesting changing parts of arbitrary web pages and making a feed from them. I don't have any currently running case where it worked, but I keep trying it because it gets close to working sometimes :)

link

answered 13 Feb '11, 18:56

drewp's gravatar image

drewp
16112
accept rate: 0%

My current tool of choice is the Python lxml library: its extremely fast, can parse directly from URLs and allows for easy access using both generic XPath queries and their own variant. In general, I've found it to be superior to BeautifulSoup and any onboard utilities.

Another important tool for me has been MongoDB. While one can discuss its value as a general-purpose DB, its capability to hold semi-random JSON data fresh from the web is simply unbeatable ;-)

link

answered 19 Jan '11, 21:46

pudo's gravatar image

pudo ♦♦
31238
accept rate: 33%

Needlebase.

That's all you need to know.

link

answered 23 Jan '11, 18:47

krees's gravatar image

krees
1
accept rate: 0%

Google has announced that it will retire Needlebase June 1, 2012 (see http://needlebase.com/).

(15 May, 03:03) re-search-er ♦

Here are a few more great open source projects that are fairly easy to use:

http://irobotsoft.com/download.htm

http://deixto.com/download.php

http://scrapy.org/

link

answered 23 Mar, 11:00

Data9er's gravatar image

Data9er
413
accept rate: 0%

edited 23 Mar, 11:04

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×8
×2
×2

Asked: 19 Jan '11, 09:02

Seen: 5,848 times

Last updated: 26 Jul, 18:40

powered by OSQA