I am looking around for datasets mainly in the form of natural language text corpus, that has been redacted by experts/authority. So far I found only enron dataset that has been redacted ... but it appears only a few specific things(like employee mail-id etc) were redacted, which will not be beneficial for feature extraction. Something like declassified government records, company data would be awesome; medical records would probably be comparatively easier to get, but are quite unrelated to the problem. Any idea if such dataset is available?

[crossposted with stats:http://stats.stackexchange.com/q/9148/2192]

This question is marked "community wiki".

asked 04 Apr '11, 05:15

tathagata's gravatar image

accept rate: 0%

edited 17 Apr '11, 19:44

lucychambers's gravatar image

lucychambers ♦♦


Copying in your text from crosspost, hopefully that will help solve your query:

Q. how do you want to use the data? A. @John Salvatier the goal is redacted :D ... actually I want to do some analysis on what kind of information generally gets redacted in free form text documents. In other words what sorts of information do people consider sensitive - and if its possible to use some machine learning algorithms to aid automatic redaction. I read sometime back Xerox had something like this - but there's no more info on how that can be done. I have some idea I want to try out, but no data ...

(17 Apr '11, 19:47) lucychambers ♦♦


We do not have redacted data (if I understand what you are looking for) but we have a couple of corpora available that might serve another purpose. It includes some Enron data, gov't docs, etc.:

(1) Open American National Corpus (OANC): 15 million words of text across a variety of genres, including government docs and medical records etc. These come with lots of linguistic annotations but you could ignore these and pull out only the files with a txt extension. http://www.anc.org/OANC

(2) A smaller 500K corpus (MASC) across 20 genres (25K each), also partly with lots of linguistic annotation. The text-only data can be downloaded at http://www.anc.org/MASC/Download.html (look for Full MASC data only).

I also have a small 100K corpus of declassified documents from the FDR Library, comprised of internal administration correspondence concerning Japanese-American relations in the 6 months prior to Pearl Harbor. I do not believe that anything has been redacted in these, though.

Best, Nancy Ide


answered 20 Apr '11, 16:29

nancyide's gravatar image

accept rate: 0%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 04 Apr '11, 05:15

Seen: 936 times

Last updated: 28 May, 10:16

powered by OSQA