CSR-II (WSJ1) Sennheiser Discs 1 - 3 (doi:10.7910/DVN/OVXSNR)

View:

Part 1: Document Description
Part 2: Study Description
Entire Codebook

Document Description

Citation

Title:

CSR-II (WSJ1) Sennheiser Discs 1 - 3

Identification Number:

doi:10.7910/DVN/OVXSNR

Distributor:

Harvard Dataverse

Date of Distribution:

2016-08-02

Version:

1

Bibliographic Citation:

LDC, 2016, "CSR-II (WSJ1) Sennheiser Discs 1 - 3", https://doi.org/10.7910/DVN/OVXSNR, Harvard Dataverse, V1

Study Description

Citation

Title:

CSR-II (WSJ1) Sennheiser Discs 1 - 3

Identification Number:

doi:10.7910/DVN/OVXSNR

Authoring Entity:

LDC

Distributor:

Harvard Dataverse

Depositor:

Cabanas, Jordi

Date of Deposit:

2016-06-27

Holdings Information:

https://doi.org/10.7910/DVN/OVXSNR

Study Scope

Keywords:

Social Sciences, LDC Catalog No.: LDC94S13B, ISBN: 1-58563-031-4

Abstract:

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 conventional development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours. <br> <br> In early 1993, a Hub and Spoke test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or hub condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech). <br> <br> WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded Shorten compression algorithm developed at Cambridge University.

Notes:

The cdrom labeled Evaluation Test Data, Part 1 (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z (WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z on a Windows OS). Please note that even though this file has the .z extension, it is not a compressed file. In order to use the file, simply ignore the .z extension.

Methodology and Processing

Sources Statement

Documentation and Access to Sources:

The files are too large to be provided directly on Dataverse. To access this data, please bring a Harvard University ID and a flash drive with 16 GB capacity to CGIS Knafel, Room 350, 1737 Cambridge St. Cambridge, MA 02138

Data Access

Notes:

Datasets are restricted for use to Harvard University affiliates.<br/><br/> The files are too large to be provided directly on Dataverse. To access this data, please bring a Harvard University ID and a flash drive with 16 GB capacity to CGIS Knafel, Room 350, 1737 Cambridge St. Cambridge, MA 02138

Other Study Description Materials