View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Corpus of Contemporary American English (COCA) |
Identification Number: |
doi:10.7910/DVN/AMUDUW |
Distributor: |
Harvard Dataverse |
Date of Distribution: |
2015-11-09 |
Version: |
2 |
Bibliographic Citation: |
Davies, Mark, 2015, "Corpus of Contemporary American English (COCA)", https://doi.org/10.7910/DVN/AMUDUW, Harvard Dataverse, V2 |
Citation |
|
Title: |
Corpus of Contemporary American English (COCA) |
Identification Number: |
doi:10.7910/DVN/AMUDUW |
Authoring Entity: |
Davies, Mark (Brigham Young University) |
Producer: |
Davies, Mark |
Distributor: |
Harvard Dataverse |
Access Authority: |
Jennie Murack |
Depositor: |
McNeill, Katherine |
Date of Deposit: |
2015-10-06 |
Holdings Information: |
https://doi.org/10.7910/DVN/AMUDUW |
Study Scope |
|
Keywords: |
Arts and Humanities, Other, English language, Corpora (Linguistics), Computational linguistics |
Abstract: |
Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts |
Time Period: |
1990-2012 |
Kind of Data: |
linguistic corpora |
Notes: |
<b>MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology.</b> The various file formats contain the following types of data: • Database files: A balanced collection of words pulled from fiction, popular magazines, newspapers, non-fiction books, and spoken word sources. • Lexicon: includes information on each wordID: word (e.g., walked), lemma (e.g., walk), and part of speech (e.g., vvd). • Sources: genre or country, source, and title of each text. • Text: Provides a textID for each text, and then the entire text on the same line, with no annotations. Note: In this format, words are not annotated for part of speech or lemma. In addition, contracted words like <can't> are separated into two parts (ca|n't) and punctuation is separated from words (eye level . As her).” • Word Lemma PoS (Part of Speech): Tables that list each word, lemma, and part of speech in vertical format; can be imported into a database. Note: Word, lemma and PoS are the three parts that are included in the lexicon. “Word” is the actual word pulled from the text. “Lemma” is the basic core “word” that would be included as the headword in a dictionary. For example, if the “word” is “running,” the lemma is “run.” The PoS is the part of speech. • subgenreCodes: a file explaining the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. See http://corpus.byu.edu/coca/?f=texts_e. More information on files is available at: • http://corpus.byu.edu/full-text/formats.asp • http://corpus.byu.edu/full-text/database.asp |
Methodology and Processing |
|
Sources Statement |
|
Data Sources: |
The corpus is composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. Detailed information on sources is available at: http://corpus.byu.edu/coca/?f=texts_e. Main sources for each file type are as follows: • Spoken: (95 million words [95,385,672]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts). • Fiction: (90 million words [90,344,134]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. • Popular Magazines: (95 million words [95,564,706]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc. • Newspapers: (92 million words [91,680,966]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. • Academic Journals: (91 million words [91,044,778]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year |
Data Access |
|
Notes: |
Licensed electronic resources are restricted to members of the MIT community and for the purposes of research, education, and scholarship. Under MIT's licenses for electronic resources, users generally may not: - redistribute the materials or permit anyone other than a member of the MIT community to use them - remove, obscure or modify any copyright or other notices included in the materials - use the materials for commercial purposes. Users are individually responsible for compliance with these terms. This data is restricted to members of the MIT community for educational, scholarly, and research purposes. In no case can the data be distributed beyond the MIT community, even in joint research with individuals at other institutions. 1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data. 2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the MIT community. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus. 3. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.) 4. Any publications or products that are based on the data should contain a reference to the source of the data: http://corpus.byu.edu/full-text.” 5. Note that a small, unique change will be made to each set of data, and this will serve as a "fingerprint" to identify you as the source of this data. Automated Google searches are run daily to find copies of the data on the Web. If the data that is sent to you is found outside of your organization, you will make a reasonable effort to contact the administrators for that web page or website, to have the data removed. |
<b>MIT affiliates should access this dataset by logging into Dataverse and selecting Massachusetts Institute of Technology.</b> |
|
Other Study Description Materials |
|
Related Studies |
|
Corpus of Historical American English (COHA) |
|
Label: |
coca-sources.txt |
Text: |
Genre or country, source, and title for each text. |
Notes: |
text/plain |
Label: |
db_academic_rpe.zip |
Text: |
Set of database files for academic sources; one file per year. |
Notes: |
application/zip |
Label: |
db_fiction_awq.zip |
Text: |
Set of database files for fiction sources; one file per year. |
Notes: |
application/zip |
Label: |
db_magazine_qjg.zip |
Text: |
Set of database files for magazine sources; one file per year. |
Notes: |
application/zip |
Label: |
db_newspaper_lsp.zip |
Text: |
Set of database files for newspaper sources; one file per year. |
Notes: |
application/zip |
Label: |
db_spoken_kde.zip |
Text: |
Set of database files for spoken sources; one file per year. |
Notes: |
application/zip |
Label: |
lexicon.txt |
Text: |
Contains lexicon information for the data (see notes). |
Notes: |
text/plain |
Label: |
subgenreCodes.txt |
Text: |
Explains the codes used to identify sub-genres, or sub-categories, within each of the major categories of text. |
Notes: |
text/plain |
Label: |
text_academic_rpe.zip |
Text: |
Original text from academic sources; one file per year. |
Notes: |
application/zip |
Label: |
text_fiction_awq.zip |
Text: |
Original text from fiction sources; one file per year. |
Notes: |
application/zip |
Label: |
text_magazine_qjg.zip |
Text: |
Original text from magazine sources; one file per year. |
Notes: |
application/zip |
Label: |
text_newspaper_lsp.zip |
Text: |
Original text from newspaper sources; one file per year. |
Notes: |
application/zip |
Label: |
text_spoken_kde.zip |
Text: |
Original text from spoken sources; one file per year. |
Notes: |
application/zip |
Label: |
wlp_academic_rpe.zip |
Text: |
Tables that list each word, lemma, and part of speech for academic sources. |
Notes: |
application/zip |
Label: |
wlp_fiction_awq.zip |
Text: |
Tables that list each word, lemma, and part of speech for fiction sources. |
Notes: |
application/zip |
Label: |
wlp_magazine_qjg.zip |
Text: |
Tables that list each word, lemma, and part of speech for magazine sources. |
Notes: |
application/zip |
Label: |
wlp_newspaper_lsp.zip |
Text: |
Tables that list each word, lemma, and part of speech for newspaper sources. |
Notes: |
application/zip |
Label: |
wlp_spoken_kde.zip |
Text: |
Tables that list each word, lemma, and part of speech for spoken sources. |
Notes: |
application/zip |