About the database

We have released a web application for data collection via crowdsourcing. Essentially, this way anyone with a mobile phone and internet connectivity can contribute to the dataset. The website link can be accessed here. We target participants from following three populations:

healthy: these are individuals with no respiratory illness
unhealthy: these are individuals with respiratory illness
COVID-19 positive: these are individuals identified as COVID-19 positive after RT-PCR test

Metadata description

When a user opens the website, s/he is asked to fill a short questionnaire which helps us collect metadata to categorize the user into one of the above three populations. The complete metadata is composed of age, gender, location (country, state/province), current health status (healthy / exposed / cured / infected) and the presence of comorbidity (pre-existing medical conditions) information. We do not collect any personally identifiable information. Each user gets a unique anonymized ID during data storage. A screenshot of the coswara webpage is provided below. Here, page 1 and 2 correspond to the metadata collection.

Sound sample description

In the screenshot shown above, page 3 corresponds to the "audio sample collection" step. We collect audio samples corresponding to nine categories shown in the figure below. These categories were chosen to capture sound signals which embed in them most of the attributes of the respiratory system associated with speech production. To understand how, let's take a small detour to the speech production system.

Art of speaking

The human speech production system draws contributions from diaphragm, lungs, trachea, larynx, pharynx, tongue, nasal cavity, and lips. You may note that many of these organs are not solely dedicated for speech production, for example, we use mouth for eating and lungs to purify the air! Speech and vocal sound production is an extra feat achieved, thanks to evolution, by these organs.

Lungs have elastic properties, as they are in some sense repurposed swim bladders. During normal respiration, the diaphragm and the abdominal muscles between the ribs work together to expand the lungs. The elastic recoil of the lungs then provides the force that expels air during expiration. This means that the alveolar air pressure increases when you inhale and decreases when you exhale. Something different happens when you speak.

Speaking happens during exhaling. The alveolar air pressure is released gradually in a coordinated manner via the opening and closing of the vocal cords (in the glottis). Something interesting happens here. For voiced sounds, such as vowels, the vocal folds open and close in a periodic fashion. This rate of opening and closing results in imparting periodicity to the output sound pressure wave. Further, this periodicity is one of the easiest perceived attributes in speech, and is referred to as the pitch of the speaker. You would have noticed that male speakers usually have lower pitch than female speakers, and female speakers have lower pitch than kid speakers. Why so? This is related to mass of the vocal folds. Heavier mass means lower pitch, and male anatomy often reveals a higher mass of the vocal folds. But that does not mean you cannot change your pitch. You can by altering the tension of the vocal folds, and we often do this when we want to emphasize something in our speech. Another attribute of speech we perceive quite easily is loudness. Increase in airflow from the lungs blows the vocal folds wider apart resulting in increased strength of the output pressure wave, thus making the sound louder. Voiced sounds are just one category of speech sounds. For unvoiced sounds, such as fricatives, the vocal folds remain open, and for stop consonant sounds, the vocal folds remain closed. Note that these sounds do not have any perceived pitch associated with them. The below figure shows a schematic of the human speech production system.

What happens during coughing? The textbook explaination suggests that cough is a reflex action. The diaphragm contracts, creating a negative pressure around the lung, and the glottis opens. This enables air to rush into the lungs in order to equalise the pressure. The glottis closes and the vocal cords contract to shut the glottis. The abdominal muscles contract to accentuate the action of the relaxing diaphragm, simultaneously, the other expiratory muscles contract. These actions increase the pressure of air within the lungs. The vocal cords relax and the glottis opens, releasing air at over 100 mph. The bronchi and non-cartilaginous portions of the trachea collapse to form slits through which the air is forced, which clears out any irritants attached to the respiratory lining. So a single cough will have no periodic opening and closing of vocal folds, unlike in vowels. However, often natural coughing results in a sequence of 3-4 coughs, and the physiological description of the opening and closing of glottis can become difficult to describe. An attempt to understand this is made here

What happens during breathing? The glottis largely remains open to enhance free flow of air into and out from the lungs, co-ordinated by the movement of the diaphragm and elasticity of the lungs. A nice video is shown here.

Why nine sound categories

As discussed above, we ask every user to record and upload nine sound samples. These can be grouped as follows:

breathing (two kinds; shallow and deep)
coughing (two kinds; shallow and heavy)

The choice of the above two is driven by the reporting by WHO and CDC which have listed dry cough, difficulty in breathing, and chest pain (or pressure) as key symptoms of this viral infection, visible between 2-14 days after exposure to the virus. Also, a recent modeling study of symptoms data collected from a pool of 7178 COVID-19 positive individuals validated the presence of these symptoms, and proposed a real-time prediction and tracking approach. Repeated coughing can adversely impact the mass and tension in the vocal folds. This can in turn alter the speaking style of the patient. You might have noticed that you can make a guess if your friend has cold his/her speaking style over phone.

sustained vowel phonation (three kinds; /ey/~as in made, /i/~as in beet, /u:/ as in cool)

The chosen vowels have a special place in the quantal theory of speech. These vowels are easy to produce and appear almost in every spoken language. Further, these vowel sounds are perceived as most distinct amongst all other vowels, and have been argued to capture the vocal tract attributes effectively. For more details see here and here.

one to twenty digit counting (two kinds; normal and fast paced)

Counting a sequence of digits corresponds to continuous speaking for close to 20 secs. Any breathing difficulty will make this task difficult, and we expect this to reflect in the speaking style such as loudness, stress and pause patterns, and pace of speaking.

Visualizing the waveforms

In the figure below we show an illustration of the waveforms and the corresponding spectrograms of few sound samples. The waveforms represent the recorded time-domain signal. Here, the spectrogram depicts the spectral content of the signal in every 10 msec short-time window of the signal. We can make some observations from the shown plots.

The breathing samples are wideband. The spectral energy is distributed over all frequencies. The inhale is lower in energy than the exhale however, both lasts for a similar time-span, close to 1 sec. The exhale (the center burst) also depicts some formant-like structure in the spectrogram. This can be expected as the air travels through the vocal tract.
For the coughing samples, we can see that these are repeating, and the first cough is a little longer in duration. This can often happen as usually we take a deep breath and release more in the first cough. Also, we can now see some formant structure also in the spectrogram.
For sustained vowel phonation, we can see clear distinct formant structure, specifically, for the second formant in the spectrogram.
For the digit counting, we can see the fluctuating formant structure in the spectrogram.

It should be noted that these recordings are obtained via crowdsourcing and recorded through web browsers. All sound samples are recorded at 48 kHz in WAV file format. Some of these recordings may have ambient noise which cannot be filtered while recording. In another post we will try quantifying different artifacts we observe in these files. We manually listen to every uploaded file, and will share our opinion on the quality and curation procedure.