For the moment, you can ignore the details and just concentrate on the output.The Reuters Corpus contains 10,788 news documents totaling 1.3 million words.Some languages have no established writing system, or are endangered.(See 7 for suggestions on how to locate language resources.) We have seen a variety of corpus structures so far; these are summarized in 1.3.The first handful of words in each of these texts are the titles, which by convention are stored as upper case.In 1, we looked at the Inaugural Address Corpus, but treated it as a single text.Don't worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and — if you're game — modify it by substituting some part of the code with a different text or word.This way you will associate a task with a programming idiom, and learn the hows and whys later.

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.

As just mentioned, a text corpus is a large body of text.

Many corpora are designed to contain a careful balance of material in one or more genres.

: Common Structures for Text Corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories like genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).

NLTK's corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora.

For convenience, the corpus methods accept a single fileid or a list of fileids.

