


Texas Instruments launched Switchboard in the early 1990s to build up a repository of voice data, which was then distributed by the LDC for machine learning programs. One of its most famous corpora is Switchboard.

The University of Pennsylvania’s Linguistic Data Consortium (LDC) is a powerhouse of these data sets, making them available under licensed agreements for companies and researchers. Governments, academics, and smaller startups rely on collections of audio and transcriptions, called speech corpora, to bypass doing labor-intensive transcriptions themselves. Its flexibility depends on the diversity of the accents to which it’s been introduced. The phrases that occur most frequently become a pattern for an algorithm to learn how a human speaks.īut an AI can only recognize what it’s been trained to hear. This combination of data - audio clips and written transcriptions - allows machines to make associations between sound and words. They then manually transcribe the audio clips. First, researchers have to collect thousands of voices, speaking on a range of topics. To train a machine to recognize speech, you need a lot of audio samples.
