I am looking around for datasets mainly in the form of natural language text corpus, that has been redacted by experts/authority. So far I found only enron dataset that has been redacted ... but it appears only a few specific things(like employee mail-id etc) were redacted, which will not be beneficial for feature extraction. Something like declassified government records, company data would be awesome; medical records would probably be comparatively easier to get, but are quite unrelated to the problem. Any idea if such dataset is available?

Copying in your text from crosspost, hopefully that will help solve your query:

Q. how do you want to use the data? A. @John Salvatier the goal is redacted :D ... actually I want to do some analysis on what kind of information generally gets redacted in free form text documents. In other words what sorts of information do people consider sensitive - and if its possible to use some machine learning algorithms to aid automatic redaction. I read sometime back Xerox had something like this - but there's no more info on how that can be done. I have some idea I want to try out, but no data ...

We do not have redacted data (if I understand what you are looking for) but we have a couple of corpora available that might serve another purpose. It includes some Enron data, gov't docs, etc.:

(1) Open American National Corpus (OANC): 15 million words of text across a variety of genres, including government docs and medical records etc. These come with lots of linguistic annotations but you could ignore these and pull out only the files with a txt extension. http://www.anc.org/OANC

(2) A smaller 500K corpus (MASC) across 20 genres (25K each), also partly with lots of linguistic annotation. The text-only data can be downloaded at http://www.anc.org/MASC/Download.html (look for Full MASC data only).

I also have a small 100K corpus of declassified documents from the FDR Library, comprised of internal administration correspondence concerning Japanese-American relations in the 6 months prior to Pearl Harbor. I do not believe that anything has been redacted in these, though.

Best, Nancy Ide


