Automatic Speech Recognition — In 2 mins Read

Gal Hever
2 min readApr 23, 2021

This blog-post explains the concept and the main difficulties in ASR task. If you want to learn how to work with ASR algorithm practically you can also read my previous blog-post Getting Started with NVIDIA NeMo ASR.

Conversational AI

So, if we want to understand what is ASR, let’s first of all understand what is Conversational AI. This term refers the technologies that humans can interact and communicate with. These technologies use algorithms that try to imitate human interactions; for example, they recognize speech and text inputs and translate their meanings across various languages.

ASR Task

Automatic Speech Recognition (ASR) is a sub topic of Conversational AI that refers specifically to technologies that enable the recognition and translation of spoken language into text by computers. The main task is to transcribe a sound file into a text file. Also known as speech recognition or speech-to-text (STT).

ASR-Complete — Why?

There are few reasons why ASR is considered as a hard task, we will go over them one by one and explain.

The input is undefined

The first problem is that the input can change, e.i; sound clips with the same information can have a different length. For example, you can say the name "Daniel" and it will take you one second and you can say "Da-ni-el" and it will take you five seconds. Compared to images for example that two different inputs that contain pictures of dogs will contain also the same number of pixels.

Noisy Data

If we record a speech segment and there will be several people talking in parallel, some background noise of cars in the background and a crying baby, we will have to separate the speech from the noise. This separation is not easy and can also require lots of efforts and processing time.

The sampling system affects the sound

Changing the microphone can also be critical. Once we replace a microphone we will have to re-calibrate everything as the new microphone may have a filter that picks up acoustics in a different way (emphasizes certain frequencies and lowers others).

The distance from the sampling system affects the data

The way we speak into the microphone is also critical. If we talk directly into the microphone or tilt our head to the side it will also affect how the data will be processed.

End Notes

In the future I will write few more blog-posts that explain more in detail how ASR works practically. Meanwhile you can continue to read more about NeMo Toolkit in Different Languages.

--

--