UBC Abacus Harvested Dataverse

Featured Dataverses

In order to use this feature you must have at least one published or linked dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

71 to 80 of 2,445 Results

BOLT Arabic Discussion Forums Aug 30, 2023 Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Ismael, Safa, 2018, "BOLT Arabic Discussion Forums", https://hdl.handle.net/11272.1/AB2/DP4INP, Linguistic Data Consortium BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retri... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
2010 NIST Speaker Recognition Evaluation Test Set Aug 30, 2023 Greenberg, Craig; Martin, Alvin; Graff, David; Brandschain, Linda; Walker, Kevin, 2017, "2010 NIST Speaker Recognition Evaluation Test Set", https://hdl.handle.net/11272.1/AB2/2CPM3O, Linguistic Data Consortium Introduction 2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and speech recorded over a microphone channel involving an interview scenario used as test data in the NIST-spons... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
AISHELL-1 Aug 30, 2023 Bu, Hui, 2018, "AISHELL-1", https://hdl.handle.net/11272.1/AB2/2WMDTT, Linguistic Data Consortium AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech recognition system development in 11 domains, including smart homes, aut... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
RATS Speech Activity Detection Aug 26, 2023 Walker, Kevin; Ma, Xiaoyi; Graff, David; Strassel, Stephanie; Sessa, Stephanie; Jones, Karen, 2015, "RATS Speech Activity Detection", https://hdl.handle.net/11272.1/AB2/1UISJ7, Linguistic Data Consortium Introduction RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and ini... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
The Subglottal Resonances Database Aug 26, 2023 Alwan, Abeer; Lulich, Steven; Sommers, Mitchell, 2015, "The Subglottal Resonances Database", https://hdl.handle.net/11272.1/AB2/R82KKG, Linguistic Data Consortium Introduction The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age. The subglottal system is compose... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
MyST Children's Conversational Speech Aug 19, 2023 Pradhan, Sameer; Cole, Ronald Allan; Ward, Wayne, 2021, "MyST Children's Conversational Speech", https://doi.org/10.35111/CYXY-P432 Abstract Introduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
MASRI Synthetic Aug 19, 2023 Hernández Mena, Carlos Daniel; Gatt, Albert; Borg, Claudia; DeMarco, Andrea; van der Plas, Lonneke, 2022, "MASRI Synthetic", https://doi.org/10.35111/WC8H-H752 Abstract Introduction MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech. Data Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese cove... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
Samrómur Icelandic Speech 1.0 Aug 18, 2023 Mollberg, David; Jónsson, Ólafur Helgi; Þorsteinsdóttir, Sunneva; Guðmundsdóttir, Jóhanna Vigdís; Steingrimsson, Steinthor; Magnusdottir, Eydis Huld; Fong, Judy; Borsky, Michal; Gudnason, Jon, 2022, "Samrómur Icelandic Speech 1.0", https://doi.org/10.35111/THX3-F170 Abstract Introduction Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances. This version 1.0 is equivalent to "Samrómur Icelandic... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
Althingi Parliamentary Speech Aug 18, 2023 Helgadóttir, Inga Rún; Kjaran, Róbert; Nikulásdóttir, Anna Björk; Gudnason, Jon, 2021, "Althingi Parliamentary Speech", https://doi.org/10.35111/695B-6697 Abstract Introduction Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016. This dataset was collected in 2016 by the ASR for Althingi project at Reykjavik Un... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.
LORELEI Indonesian Representative Language Pack Aug 18, 2023 Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Griffitt, Kira; Delgado, Dana; Arrigo, Michael, 2023, "LORELEI Indonesian Representative Language Pack", https://doi.org/10.35111/6GWW-XC16 Abstract Introduction LORELEI Indonesian Representative Language Pack consists of Indonesian monolingual text, Indonesian-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium (LDC) for the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) p... This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data.

BOLT Arabic Discussion Forums

Aug 30, 2023

Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Ismael, Safa, 2018, "BOLT Arabic Discussion Forums", https://hdl.handle.net/11272.1/AB2/DP4INP, Linguistic Data Consortium

BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retri...