Some context

"Tring, tring",... your phone is ringing. You pick it, and hear - "Hello, what's up?". Often, this sound, pouring into your ears, will inform you about the gender of the talker, the identity, and the emotion. Isn't this fascinating? With only 2-3 secs of sound signal, the human brain has estimated so much about the talker. As the conversation progresses, you will be able to even estimate the personality, and the health status of the talker. Speech signal is immensely rich in information, and our brain is trained, since our childhood, to extract a lot from it.

Speech machinery

Let's see how this fascinating sound signal -speech, is produced. You inhale a breath, the air enters your lungs and creates a high pressure. When you speak, a sequence of co-ordinated mechanical processes are initiated, flexing the vocal tract muscles, and the result is the release of air pressure from mouth and nose. If you don't believe this - just pause - take a deep breath, and read aloud the previous line. Didn't you exhale while reading it! The co-ordinated mechanical processes involved during speaking are a miracle. Just to give a context, human beings are the only species on earth which can produce the diverse range of acoustic sounds making up our vocal communication repository. Biology suggests that the reason is linked to the FOXP2 gene - found only in humans. Every human is unique, and this uniqueness also reflects in the speech signals which allows us to easily recognize the voice of many.

Recording speech

Satisfying the human curiosity to record speech signals was a challenge. How do you store the speech sounds? Speak into a jar, close the lid, and open it, and bingo, you hear it! Sorry, physics won't allow this to happen. In the late 19th century, phonograph was invented, a beautiful mind behind this was Thomas Edison. As time progressed, this became popular to record music, and lead to gramophone, magnetic tape cassettes, and compact discs. Now we just store it in solid state semiconductor drives without worrying about how! Technology has been a blessing when you consider the seamless manner in which we capture, store, process, and playback speech, music, and images.

Processing speech

Cool! We are able to record sound signals. Let's move on to processing sound signals. Can we design machine systems which extract information from sound signals? For instance, can machines perform speech recognition, speaker recognition, emotion recognition, and ... the list can go on. Our brains do all this. It shouldn't be impossible to design machines to do this, and beyond this too! Well, automatic speech recognition is now a reality, and accessible in our mobile phones/laptops. Challenges exist when the speech recording is noisy, accented, not in English, or has multiple talkers etc. For single talkers, clean recording, American/British accent, English speech, the machine systems work quite well. Similar is the performance for talker recognition as well. What is the key behind this technology? Welcome to the world of signal processing and machine learning!