A Brief History of ASR: Automatic Speech Recognition Part 2

A Brief History of ASR: Automatic Speech Recognition Part 2

SnatchBot Team Equipe de SnatchBot, 28/08/2018

As SnatchBot charges through the milestones in our drive towards intelligent, conversational AI by adding text-to-voice to our chatbots, we take a brief look back over the journey towards automatic speech recognition.

Part 2 How did we get from faltering beginnings to today’s rapid progress?

A key turning point came with the popularization of Hidden Markov Models (HMMs) in the mid-1980s. This approach represented a significant shift from simple pattern recognition methods, based on templates and a spectral distance measure, to a statistical method for speech processing, which translated to a leap forward in accuracy.

A large part of the improvement in speech recognition systems since the late 1960s is due to the power of this statistical approach, coupled with the advances in computer technology necessary to implement HMMs.

HMMs took the industry by storm:  but they were no overnight success. Jim Baker first applied them to speech recognition in the early 1970s at CMU, and the models themselves had been described by Leonard E. Baum in the ‘60s. It wasn’t until 1980, when Jack Ferguson gave a set of illuminating lectures at the Institute for Defense Analyses, that the technique began to disseminate more widely.

The success of HMMs validated the work of Frederick Jelinek at IBM’s Watson Research Center, who since the early 1970s had advocated for the use of statistical models to interpret speech, rather than trying to get computers to mimic the way humans digest language: through meaning, syntax, and grammar (a common approach at the time). As Jelinek later put it: ‘Airplanes don’t flap their wings.’

These data-driven approaches also facilitated progress that had as much to do with industry collaboration and accountability as individual eureka moments. With the increasing popularity of statistical models, the ASR field began coalescing around a suite of tests that would provide a standardized benchmark to compare to. This was further encouraged by the release of shared data sets: large corpuses of data that researchers could use to train and test their models on.

In other words: finally, there was an (imperfect) way to measure and compare success.

A Brief History of ASR: Automatic Speech Recognition Part 2

November 1990, Infoworld

Consumer Availability — The ‘90s

For better and worse, the 90s introduced consumers to automatic speech recognition in a form we’d recognize today. Dragon Dictate launched in 1990 for a staggering $9,000, touting a dictionary of 80,000 words and features like natural language processing (see the Infoworld article above).

These tools were time-consuming (the article claims otherwise, but Dragon became known for prompting users to ‘train’ the dictation software to their own voice). And it required that users speak in a stilted manner: Dragon could initially recognize only 30–40 words a minute; people typically talk around four times faster than that.

But it worked well enough for Dragon to grow into a business with hundreds of employees and customers spanning healthcare, law, and more. By 1997 the company introduced Dragon NaturallySpeaking, which could capture words at a more fluid pace and, at $150, a much lower price-tag.

Even so, there may have been as many grumbles as squeals of delight: to the degree that there is consumer skepticism around ASR today, some of the credit should go to the over-enthusiastic marketing of these early products. But without the efforts of industry pioneers James and Janet Baker (who founded Dragon Systems in 1982), the productization of ASR may have taken much longer.

Automatic Speech Recognition

November 1993, IEEE Communications Magazine

Whither Speech Recognition— The Sequel

25 years after J.R. Pierce’s paper criticizing the ASR project was published, the IEEE published a follow-up titled Whither Speech Recognition: the Next 25 Years, authored by two senior employees of Bell Laboratories (the same institution where Pierce worked).

The latter article surveys the state of the industry circa 1993, when the paper was published  and serves as a sort of rebuttal to the pessimism of the original. Among its takeaways:

  • The key issue with Pierce’s letter was his assumption that in order for speech recognition to become useful, computers would need to comprehend what words mean. Given the technology of the time, this was completely infeasible.

  • In a sense, Pierce was right: by 1993 computers had meager understanding of language and in 2018, they’re still notoriously bad at discerning meaning.

  • Pierce’s mistake lay in his failure to anticipate the myriad ways speech recognition can be useful, even when the computer doesn’t know what the words actually mean.

The Whither sequel ends with a prognosis, forecasting where ASR would head in the years after 1993. The section is couched in cheeky hedges (‘We confidently predict that at least one of these eight predictions will turn out to have been incorrect’)  but it’s intriguing all the same. Among their eight predictions:

  • By the year 2000, more people will get remote information via voice dialogues than by typing commands on computer keyboards to access remote databases.

  • People will learn to modify their speech habits to use speech recognition devices, just as they have changed their speaking behavior to leave messages on answering machines. Even though they will learn how to use this technology, people will always complain about speech recognizers.

More recent developments in the current state of automatic speech recognition have seen neural networks playing a starring role.

But neural networks are actually as old as most of the approaches described in this two-part blog, they were introduced in the 1950s. It wasn’t until the computational power of the modern era (along with much larger data sets) that they changed the landscape.

ASR: Automatic Speech Recognition

Timeline via Juang & Rabiner

One major challenge in developing commercially successful devices with ASL was that the models being used required gigabytes of memory, meaning that you wouldn’t download them to your mobile device. With the advent of cloud-based solutions and broadband connection speeds, then software platforms like ours can do all the hard work away from your device.

In the last three or four years, the rapid spread of ASL has come with the ability of millions of users with only modestly powered devices, but with good connections to the cloud, to interact with state-of-the-art software. It’s the same development that has allowed us at SnatchBot to provide NLP, Machine Learning, Text-t0-Speech and all the features you need from a chatbot, without you having to know coding or download massive files.

Understanding the past reminds us how important these latest developments are and appreciate that the era 2015-2025 is going to be remembered for bringing AI communications into everyday life across the entire planet.

This article is based on a collaboration with Descript.