The Hard Part About Speech

The rise of Large Language Models (LLM) in recent years has certainly made a lot of people see the world differently. I believe that many people are more interested in statistics, machine learning, computer science, and linguistics because of these fantastic advancements, myself included. As I have studied intensely at the University of Edinburgh over the last year, I have realized more about what these kinds of models do, and how they relate to the bigger picture of modeling functions in general.

All of these models are doing some form of function approximation/prediction task. Language models like GPT or BERT are simply predicting the next word token given the context, Automatic Speech Recognizers (ASR) like Whisper are predicting the text output given a waveform, Text to Speech (TTS) models like Amazon Polly are trying to predict the spectrogram or waveform given the textual input and who it should sound like. There is always a task, a dependency, data, and a metric to measure that task.

GPT is a text to text model. 

  • Task: predict the next word. 
  • Dependency: all previous words. 
  • Data: We find a bunch of paragraphs and divide it into context and what comes next. 
  • Metric: Does our model predict what comes next given the context?

ASR is a speech to text model.

  • Task: what does this utterance say?
  • Dependency: the audio.
  • Data: We find a bunch of transcribed audio.
  • Metric: Does our model predict the transcription given the audio?

TTS is a text to speech model.

  • Task: Speak an utterance.
  • Dependency: what to say.
  • We find a bunch of transcribed audio.
  • Metric: Is our model able to say the text?
I have learned a ton about these types of models and what makes them all interesting and unique. One that has been particularly fascinating to me is the problem of text to speech. There are several reasons that speech is hard in TTS.

The main difficulty is a classic problem of information loss. There is so much information in a audio clip. You get information about who the speaker is, where they come from, sometimes what environment that they are in, and how they are feeling while speaking. How much of that information comes through on a bare transcript. The answer: none. This kind of reminds me of a hash function. You have a bunch of utterances that can be very different, but map to the same transcript. However, its much harder to go from a transcript and know all the right details of how to say it, where to say it, and who to say it. 

This problem for some reason reminds me of my MATH 290 class at BYU when we discussed injective, surjective, and bijective functions. I needed to do a math refresher, but wikipedia is great. Injection is where every point in space A has a unique place in space B (one-to-one mapping). Surjection is opposite, where all the points in B are covered by a point in A (many-to-one mapping allowed). Bijection is both, and neither is where our good friend TTS lives. The wonderful problem of one-to-many. Mathematically speaking, these mappings are not functions. So, that makes the problem of learning a function (deep learning) to find the audio given the transcript very difficult. We have to add information to the transcript (like speaker identity information) to increase the text space to make the function as close to injective as possible.

Mapping between audio and transcript space


This is an illustration of the one-to-many issue of TTS and the many-to-one problem of ASR.

In a closed domain, this isn't as big of a problem, because you don't want variability in how you say things. You don't want the train station voice to say train times sarcastically. Those that are building conversational agents and systems have to deal with this problem. 

In my experience at various jobs, when people and companies process audio for ML systems and decisions, they often go through transcripts. Why? Because its easier. What is the tradeoff? Massive information loss. There is a lot of work that needs to be done in order to improve systems in order to understand the best way to knowhow to say something given the content as well as the context. I am excited to follow the field and see the progress that happens in the near future.

Comments

Popular posts from this blog

Phonetic Posteriorgrams

Language Identification Project