Phonetic Posteriorgrams

For my MSc dissertation, I am trying to synthesize speech with variable L2 accent strength. There are a few ways to approach this. The "fastest" and "easiest" way is to have the perfect dataset that has reliable labels for accent strength on each utterance. There are two problems with this approach. First, this dataset would be incredibly difficult and expensive to create. Second, it is difficult to get a trusted label for "accent strength" that is also consistent across the dataset. What does accent strength even mean and how would it be measured non-subjectively?

I was talking about this problem with one of the PhD candidates in the CSTR group, and she told me to look into phonetic posteriorgrams. Since then, I have tried to read a bit about them. I wanted to explain them in this post.

What is a Phonetic Posteriorgram?

A phonetic posteriorgram (PPG) is well described in it's name. Phonetic indicates that we are talking about phonemes, or speech categories like /m/ or /t/ or /ae/, different sounds. Posteriorgram indicates that we are generating posterior probabilities of the phonetic categories. So, the basic goal of PPGs is to map each frame of speech to a probability distribution over the phoneme categories. This gives a map which shows the most probable phonemes across the speech signal.

Figure 1: Example PPG for the word tomato. The y-axis is the phoneme category and the x-axis is time. This image came from Churchwell et al. (2024) [1] at Northwestern University in IL, USA.

Using the model from Northwestern, I was able to produce PPGs from my own audio and plot it using the simple mathpltlib library in python. In Figure 2, you can see the time alignment of different phones. One thing to keep in mind is the data that the model was trained on. So in this case it was trained on audio data from the Common Voice dataset, CMU Arctic, and TIMIT. So these are a wide range of speakers and environments all speaking English.

Figure 2: This is a PPG of a full sentence that was recorded by me from the ARCTIC dataset. You can see the model is fairly confident in the phonemes identified.

Why are PPGs interesting?

1. Language independent

A PPG can be created with a speech signal in any language. A PPG is a representation of the acoustic space of each time step of a speech signal. Because it is a purely acoustic representation, these phonemes are language and accent independent, if one uses a large and inclusive phoneme set (different languages have different phonesets). Thus, one could say that phonetic context is universal and not language specific. However, one aspect of languages is that the actualization of a phoneme is not the same in every language. In Figure 3, we can see that the voice onset time (one property of plosive phones) is different for phonemes in 3 different languages.

Figure 3: This shows the voice onset times for Polish, German, and English. This shows us that the acoustic meaning of plosives /p/, /t/, and /k/ are different across languages. This has severe implications for multilingual PPG generation (Nelson, 2022)[2].

I'm interested in exploring how robust PPGs are across languages. Can PPGs determine and understand difference in L2 accent? Cho and Nam (2021) seem to suggest we can get information about L2 accent from a PPG [3].

2. Fine-grained information

Because a PPG gives a probability distribution of potential phonemes at each frame of speech, they provide a lot of fine-grained information about the phonetic representations of an utterance. This could be used for other tasks such as phoneme recognition, speech recognition, accent conversion, and in my case, hopefully synthesis.

3. Many ways to generate them

There are many ways to generate a PPG. All you need is an input representation of speech and just a final linear layer that projects each frame into a size equal to the size of the defined phoneme set. Churchwell et al. (2024) showed that Wav2Vec2 seemed to be the best speech input representation for accurate PPGs [1].

I am looking into best ways to generate PPGs right now. I currently have experimented a little bit with the huggingface model by vitouphy called wav2vec2-xls-r-300m-timit-phoneme. I'll try to write more when I have a "best" solution for PPG generation, as I'm not convienced this is the best method.

One problem with PPGs is that a neural network is that the objective function is trained to map all representations (L1 and L2 depending on the data used) of an /a/ to /a/. This means it is not explicitly capturing variability, but potentially squishing the representations down too simply. One potential solution to understand variability is to make sure that the PPG generation system is only trained on L1 speech, and observe how L2 speech is perceived [3]. This might make intuitive sense as humans are (often) trained in an L1 language environment.

Conclusion

In conclusion, PPGs are pretty cool. PPGs have many different applications in speech technology. I just hope they will be able to help me on my journey to discover how to synthesize speech with variable L2 accent.

References

[1] Churchwell, C., Morrison, M., & Pardo, B. (2024). High-Fidelity Neural Phonetic Posteriorgrams. arXiv preprint arXiv:2402.17735.

[2] Nelson C. Do a Learner’s Background Languages Change with Increasing Exposure to L3? Comparing the Multilingual Phonological Development of Adolescents and Adults. Languages. 2022; 7(2):78. https://doi.org/10.3390/languages7020078

[3] 조영선, & 남호성. (2021). A Comparison of L1 and L2 Speech Phonetic Posteriorgrams for Applications in Pronunciation Training. 외국어교육연구, 35(1), 293-304.

Search This Blog

spencerjensen.dev