Tuesday, October 08, 2013

Language stuff - corpus (pl. corpora)

A new(ish) tool for the systematic examination of the English language is the development of language corpora. These are collections of samples of English language into a "body", which can then be examined using computer programs.

According to Wikipedia, the first corpus used for language investigation was the Brown Corpus. It consisted of around a million words, gathered from about 500 samples of American English, and was used in the preparation of the benchmark work Computational Analysis of Present-Day American English by Kucera and Francis. This was as recently as 1967.

The size of corpora has increased with increasing computer power. The British National Corpus currently contains 100 million words. The Oxford English Corpus - used by the makers of Oxford Dictionaries amongst others - contains 2 billion words of English. The Cambridge English Corpus is a "multi-billion word" corpus. In addition to the texts that make up the corpora, the words they include can be tagged for parts of speech - for example, whether "love" as it appears in a text is being used as a noun ("His love was so great...") or a verb ("I love you"). A corpus can be examined with software called concordancing software. This will search for specific words, phrases or instances of grammar, and can do things like highlight words that are frequently collocated. This can be used to identify patterns in the language that might otherwise go unnoticed.

Corpora can be produced from particular classes of text - for example, transcribed conversations, newspaper articles, academic journals, fiction. For E303, the Open University undergraduate course that introduces corpus linguistics, we were provided with a 4 million word corpus, with a million words drawn from each of these classes. It's also possible to produce your own corpus. I created a corpus of pop song lyrics - only 33000 words or so, but still enough to look for trends and patterns of language use. There is software available that can tag a text with parts of speech - for example, CLAWS4. And for analysing the sofware, the AntConc concordancing software is freely available.

No comments: