I am a UK journalist, writing for the Mail on Sunday. I work on infographics, and I'm looking to do something on the history and evolution of Wikipedia. What I'm wondering is, is it possible to build a scraper or similar tool to go over every wikipedia article and note when it was first made and the category it belongs to. This way I'm hoping to come up with a dataset that shows not only how quickly wikipedia expanded, but which subject areas grew first and quickest, and which have the most articles.

I'm not code-literate; I'm looking for someone who knows a reliable way to do this, and who is prepared to sit and carry out the data-gathering - for which we are prepared to pay. For more info, email me at [email protected]


asked 07 Feb '11, 17:09

Chris%20Hall's gravatar image

Chris Hall
accept rate: 0%

I've started a database dump containing the data you are looking for on the Wikimedia Toolserver. It should be done tomorrow, unless the server gods smite it for using too many resources. I'll post the URL if it goes through. You'll have to normalize the categories yourself, though.


answered 07 Feb '11, 22:33

Magnus%20Manske's gravatar image

Magnus Manske
accept rate: 0%

edited 07 Feb '11, 22:33

Amazing! Any pointers on a README on how to play around with that dump (fields heading, sql queries that would list page creation dates etc ...)

(07 Feb '11, 22:45) rgrp ♦♦

OK, data is here: http://toolserver.org/~magnus/data/en_created_cats.tab.gz


SQL query was : select page_title,min(rev_timestamp) AS created,group_concat(distinct cl_to) from page,revision,categorylinks where page_id=rev_page and page_namespace=0 and page_is_redirect=0 and cl_from=page_id group by page_id

(08 Feb '11, 09:14) Magnus Manske


Thanks for helping with my request. I'm not sure I fully understand the database you've created - but it looks like it's on the right lines. How do you read the 'date created' column, eg 20010803163502? Also, this file contains details for some 65000 pages - what does that set represent? Recently updated pages, or a random subset, or something else?

Finally, the categories data is useful, but there are just too many categories. Is it possible to categorise pages according to Wiki's contents divisions? (http://en.wikipedia.org/wiki/Portal:Contents)

Thanks again,


(08 Feb '11, 11:29) Chris Hall

The date format is Year - Month - Day - Hour - Minute - Second (2 digits each, except year)

The file I linked to contains data on all 3.5 million articles on en.wikipedia. If you only see 65000, your software (Excel?) has truncated them.

Backtracking categories to the "divisions" is not trivial technically; also, there is no guarantee than an article will fall into any of the divisions, or it might fall into more than one. I might have a look at that later today, but no promises (/me eyes stack of urgent to-do things on desk).

(08 Feb '11, 11:46) Magnus Manske

Ah thanks, I thought that date format was something like that. You're right, excel is truncating the data. re. categories, any help is much appreciated, but I understand you've got more important things to do.


(08 Feb '11, 14:33) Chris Hall

Chris: if Magnus is busy (he has already done an amazing job for you!) I can help out with the data-wrangling here (Excel really isn't going to cut it given the size of the data).

(08 Feb '11, 15:04) rgrp ♦♦

There you go: http://toolserver.org/~magnus/data/New_articles_by_topic.xlsx

I have summarized articles by day, so the whole thing is only 1MB and won't be eaten by Excel. Each article is counted only once, even if it tracks to several categories (in these cases, I counted it for the "majority root" of the individual category trees). Most articles (>3.4M) could be tracked to such a a "root" topic, but a few had no discernable category associated.

(08 Feb '11, 21:08) Magnus Manske


Thanks, but I can't download the file. Your link just displays random characters. As it's not a large file, you could try emailing it to me if that's easier? Really appreciate your help here.

@rgrp thanks, will let you know if I need help with the data once I get it.

(09 Feb '11, 10:36) Chris Hall

@Chris, I was talking about processing the raw data (e.g. the 600Mb tab-separated dump) -- I don't tend to use Excel ;)

(09 Feb '11, 16:13) rgrp ♦♦
showing 5 of 9 show all

Hi Christopher,

To answer your first question, yes, it is possible, with a few exceptions.

The data dumps required to answer the question can be found at http://dumps.wikimedia.org/enwiki/20110115/

However, there might be some things to consider as they do not contain version histories of articles that were created once and have been deleted since then.


answered 07 Feb '11, 22:02

Mathias%20Schindler's gravatar image

Mathias Schi...
accept rate: 0%

edited 07 Feb '11, 22:13

rgrp's gravatar image

rgrp ♦♦


Thanks for your help. So, just to be clear, if I wanted the article title, date of creation and category info for every English Wikipedia page, I would to download from here? http://dumps.wikimedia.org/enwiki/20110115/

I'm only interested in what exists on wikipedia now, so if a page was created and deleted, that's not a problem if it doesn't show up in the data.


(08 Feb '11, 11:33) Chris Hall

As detailed on the CKAN Wikipedia entry Wikipedia does provide database dumps and you would definitely want to use these rather than scraping the Wikipedia site (which is forbidden!). The dumps are very large but you only need the metadata (page first created, when edits were made, categories etc) so the problem is rather more tractable.

I'll try and update with some more details about how you would get that data out later ...

I'm just going to fold my efforts into Magnus and Matthias' great contributions.


answered 07 Feb '11, 22:11

rgrp's gravatar image

rgrp ♦♦
accept rate: 14%

edited 08 Feb '11, 15:01

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 07 Feb '11, 17:09

Seen: 1,613 times

Last updated: 20 Jul, 10:52

powered by OSQA