Robotic Voices and Speech Synthesis: Experiments with Unit Selection

This is the paper that I wrote for my Speech Synthesis class in semester 2 at the University of Edinburgh. Building a unit selection voice was a lot of fun, especially getting to record our own voices. Learning unit selection help me to understand a lot of the basics and fundamentals for the state of the art technologies because they are all trying to solve similar problems with different approaches.

Here are some sample audios from my synthesized voice.

"She sells seashells by the seashore."

"I scream. You scream. We all scream for ice cream."

These are very robotic, but this was state of the art not that long ago and cool that it is built with a database of my voice recordings.

Abstract

Unit selection in speech synthesis, an aging technique dating back to 1996, remains a fundamental method for synthesizing speech. This paper explores the theoretical principles of unit selection, its practical implementation, and its implications for modern speech synthesis methods. We demonstrate, through several experiments, the importance of data, prosody, and context in achieving natural speech synthesis.

Introduction

Synthesising speech has intrigued many researchers over the decades. From the earliest primitive mechanical voice boxes [1][2] to the modern digital voices [3][4][5], there have been many techniques to replicate the phenomenon of human speech. One popular and effective method, which for many years was the state-of-the-art, is called unit selection. This paper contains several experiments and discussion which highlights advantages and disadvantages of unit selection as an effective strategy for speech synthesis. This paper also emphasizes the importance of unit selection in our understanding of speech synthesis as a larger discipline.

Background:Unit Selection in Speech Synthesis

What is referred to in this paper as unit selection was first proposed for speech synthesis by Black and Taylor in 1996 [6]. The backbone of unit selection is a basic assumption that a full speech utterance is composed of smaller speech units. This is reasonable because, linguistically, we compose sentences of words and words of letters and sounds. Before Black and Taylor's paper, there was a group of techniques, called selection speech synthesis, that operated under this assumption that speech is composed of smaller sub-units. The assumption is that a collection os sub-units can be selected, rearranged, and assembled to synthesize entirely new full-length speech utterances. Unit selection was an improvement from a naive concatenation of smaller units by introducing a then-novel selection algorithm based on the idea of a target and join cost.

Target Cost

Naive concatenation of phones was sufficient to create intelligible speech, however, these voices were very unnatural. Unit selection makes an interesting claim that not all instances of a phoneme are the same. For example, an /m/ at the beginning of a sentence is fundamentally different from an /m/ in the middle of a word because of the acoustic context which it exists in. This is reasonable to claim, because the vocal tract and mouth shape are continuous through time and the sound properties of any given /m/ sound is dependent on where the mouth and tongue were positioned before and where they are going after. Acoustic context is therefore important to create a voice that not just intelligible but also natural.

The target cost is trying to find the closest speech candidate to the target by minimizing the difference between them. That difference can be defined in terms of the linguistic context difference (independent feature formulation--IFF), or the difference in acoustic space (acoustic space formulation--ASF).

In order for unit selection to a candidate to a target, the candidate needs to be properly labelled with linguistic contextual information. This information could include, prosodic context (like pitch and stress), position in syllable or phrase. For example, /m/ in the word "make" should be more similar to /m/ in "mood" than in "bomb'. When we are synthesizing a word that we don't have recorded, for example "manly", we want to use the sub-word recording of the /m/ from the word "mood" rather than "bomb" because the /m/ proceeding silence should be a better fit. This would be the best fit using IFF.

Zipf [7] observed that natural language exhibits a mathematical property which means that we are unlikely to have data with all phonemes in every combination of context (which is defined and labeled) without a massive amount of data. Because of this law, some targets will have very few data points. Thus, using IFF we might run into data sparsity issues.

Using ASR instead of IFF is one solution as it uses the acoustic representation of each candidate signal. We ask how likely that signal came from the average group of similar phones.

With a database of speech units, we will have sparsity, thus we need to group targets into similar contexts for candidate sharing. Classification and regression trees (CART) are helpful for this. For example, it is reasonable to assume that the phoneme /a/ is similar in the words "van" and "fan" because /v/ and /f/ are quite similar sounds. /v/ and /f/ are both fricatives and thus have a similar effect on the proceeding /a/.

We can use a CART to group these contexts together so candidates in both contexts can be shared. CART asks the optimal questions to group similar contexts. After we define a target, we can parse the tree and get n candidates, $U = {u_1, u_2, ..., u_n}$, from which we can chose the closest match.

Join Cost

To join speech units into a single utterance, we need to calculate the cost of concatenating consecutive units together, $J(u_i, u_{i-1})$. At imperfect joins, there are discontinuities in the acoustic features. These discontinuities are manifest in the output sound as clicks or perceptually unnatural changes in the audio. One way to minimize the perception of joins is to concatenate phones in the middle of their utterance rather than their edges. This is because acoustic movement is typically most stable and predictable around the middle of a phone. Unit selection, as proposed by Black and Taylor, defines the join cost as a simple weighted Euclidian distance of acoustic features at the join point (equation 1) [6]. The reason the distance is weighted is because some acoustic features like $F_0$ may be more important to minimize. The local discontinuity of prosodic features--like $F_0$, energy, or duration--are particularly hurtful to the naturalness of synthesized speech [8].

Total Cost

The basic algorithm combines the costs calculated by target and weighted joins, and find the optimal arrangement of sub-word units, such that the sum of target and join costs is minimized. A naive approach would be to take the most similar candidate at every linguistic time step, which doesn't guarantee a global minimum. We use dynamic programming--Viterbi's algorithm--to find the most likely candidates that minimize the total cost over the entire utterance.

$$\sum_{i=1}^{N} T(u_i) + W J(u_i, u_{i-1})$$

Equation 1

Implications For Today

One of the fundamental benefits that this implementation of unit selection gave us was that the current unit of speech being synthesized is dependant on surrounding linguistic context. Current methods still utilize this same idea that a phone is acoustically dependent on its previous phones. Some neural systems like fastpitch [5] or tacotron [4] use convolution layers to gather surrounding context from around the input. Convolution layers are learnable filters which are applied to input data in a sliding window manner. By convolving the input, we can capture spatial relationships (in this case across the time domain) within the data, allowing for more robust feature learning and understanding of the context relevant to the output. So, instead of hard coding groups that are similar, those similarities can be learned in convolution layers of neural networks.

Another benefit modern speech synthesis gets from unit selection is that more linguistic information as input can benefit the output. If we provide a rich linguistic specification get can get a richer acoustic representation as output. Mapping between these two different spaces is a complex regression task. Lots of models continue today to seek more efficient and accurate ways to regress from linguistic space to acoustic space.

As we grow the richness of the input in unit selection, we are just increasing our defined number of unit types. Consequently, we need more data to achieve acceptable coverage of units. Given any reasonable sized dataset, we cannot possibly have examples of every unit type in the training data. Statistical speech synthesis gives us a way to share acoustic representations between types. Parameter sharing is a way to satiate the hunger for data, because we can model units that have few or no examples in our training. Statistical speech modelling doesn't actually have a database of speech, meaning that synthesis doesn't require the model to look through all the data. Looking at Figure 1 shows how the leaves of the statistical tree just have a mean ($\mu$) and covariance ($\Sigma$) as opposed to actual recordings. This saves space and time during synthesis time, however, we still need large amounts of data to train accurate parameters. This is also true for other complex models. Very large neural architectures require loads of data in order to learn accurate parameters.

Figure 1: Diagram showing the similarities and differences between the use of CART decision trees in unit selection versus statistical HMM-based speech synthesis.

Unit Selection with Festival

In order to explore unit selection and build our own voices, we experiment with a software called Festival which implements unit selection. On a high level, we use a few extra speech tools (HTK-htk.eng.cam.ac.uk/, Festvox-festvox.org, and Edinburgh Speech Tools-cstr.ed.ac.uk/projects/speech_tools)

to create a database of candidates and then use Festival to synthesize speech.

Festival contains voices which are lexicon-based. This means that each voice is calibrated with a pronunciation dictionary and a rule-based letter to sound rules for unknown pronunciations. After selecting a voice a user can synthesize text. Festival implements a basic frontend to tokenize and disambiguate pronunciations, like in the case of homographs. With the normalized text, Festival uses a lookup to gather pronunciation information of the text based on the selected-voice lexicon. Festival then puts together a linguistic specification of the target utterance. With this specification, Festival minimizes the cost of concatenating speech units as previously discussed. Festival can play and save the synthesized speech. This process assumes we have a dictionary/pronunciation rules, a frontend, and a built voice (which includes the labelled speech data).

Data

The backbone of proper synthesis is to have a sufficient amount and quality of data. This entails having quality recordings with accurate transcriptions and alignments.

We need sentences to read to build a voice. Where should these sentences come from? We chose Ted Talk transcripts to serve as our list of possible sentences. Ted Talk transcriptions provide several interesting advantages. Firstly, the transcripts are licensed with Creative Commons, which allows us to use the data acknowledging TED as the creator. Another advantage of Ted Talk transcripts is that it is highly available. We collected thousands of sentences to get a wide variety of sentences that were not too long, too short, and excluded non-ascii (approx. 5,000 sentences). With this pre-processed corpus of text, we could now implement an algorithm to select the most useful sentences for our script.

Script

As discussed previously, our unit selection system needs enough data to have wide unit type coverage. This optimization of unit coverage is a fundamental aspect of our script design.

Our basic algorithm was score each sentence in the corpus. We then repeat that process to get the sentence with the next best score. The score is dependent on what the previously selected sentences are, thus we need to re-crawl the corpus to find the next best.

The scoring function primarily weighted importance of diphone coverage of the script. It compared the diphone counts of the sentence ($f_s$) with the counts of already selected sentences ($f$). Missing diphones were awarded the highest score. If a diphone had already been selected to be in our data, then it would receive points based on a decaying function that rewarded diphones that had been seen fewer times.

$${score(f, f_s)} = \sum^{f_s}_{d=1} d, \text{where}\begin{cases}d = 100, & f[d] = 0 \\d = \frac{10}{1.3^{f[d] - 1}} & f[d] > 0\end{cases}$$

Using this selection algorithm, we chose 600 sentences of the best sentences. 600 was chosen because we wanted to match the ARCTIC A script which will be used as a baseline data set for experimentation. Table 1 shows the comparability of scripts based on word and phone counts. The ARCTIC scripts were curated by experts at Carnegie Mellon University as US English scripts which are phonetically balanced for unit selection speech synthesis (http://festvox.org/cmu_arctic/). To showcase the benefit of the scoring algorithm, we randomly selected 600 sentences and compared it's diphone coverage to the diphone coverage of our selected sentences. This defined algorithm, although simple gives better diphone coverage then the ARCTIC A and a random script (See Figure 2). This scoring algorithm could be improved with accounting for more linguistic context like stress.

Figure 2: Shown is the diphone coverage of a randomly selected script, the ARCTIC A script, and the proposed selection script.

Recording

Now that we needed to record the selected sentences. These sentences are the utterances from which we will build new utterances. Therefore, we want the recordings to be as high quality as possible. High quality recordings will make the snippets of speech easier to understand. It will also make the labelling process more accurate. The recordings were collected in a professional sound studio in controlled conditions and with quality equipment.

When recording the script collected from Ted Talk data, we recorded slightly differently. Instead of recording very monotone like the ARCTIC voice, we decided to add emotion and intonation into the speech. To do this, the speaker tried to engage with the text more, which resulted in more variety in the pitch, tone, and timing.

Dictionary

Another important part of the calibration of our voices is making sure that the dictionary contains all the correct pronunciations of words. It was pretty simple to correct when hearing a pronunciation mistake. I just needed to either add the word and pronunciation into the dictionary or adjust the current pronunciation.

After listening to a bunch of samples, I noticed a handful of pronunciation errors. Firstly, "Edinburgh" was in the dictionary, but not how I wanted to pronounce it, so I simply updated the pronunciation. Often, proper nouns don't appear in the dictionary, so I needed to add several words that were in my sentences. Another example I came across was the phrase "TTS". It wasn't in the default dictionary, so I added the acronym so it didn't give the default letter-to-sound pronunciation.

Example Pronunciation Entry:

("edinburgh" nnp (((e)1)((t* i m)0)((b @@r)2)((r ou)0)))

Example Correction:

before -> ("tts" nil (((t s) 0)))

after -> ("tts" ((t ii)1)((t ii)0)((e s)0))

After a simple AB expert listening test, the change in pronunciation of the string "tts" (as well as several other out-of-dictionary words) was satisfactory.

Labeling the data

In order to select the right units from our database, we need to have correct labels on the data. If the labelling is wrong then the algorithm will select mis-aligned, or even wrong, units of speech. This results in poor speech synthesis.

The speech tools depend on the Festival's frontend to provide labels for phones in a master label file. With the phone labels, we need to align/timestamp them to the actual utterance so we can identify the speech unit boundaries. This "forced" alignment is borrowed from automatic speech recognition, where we want the most likely alignment given the speech and phone sequence. This is accomplished by parameterizing the speech as frames of MFCCs (Mel-frequency cepstrum coefficients or acoustic features) and training HMMs (Hidden Markov Models). The HMMs are necessary because it can probabilistically align all of our frames of speech to our representation of "hidden states" or linguistic phones. After forced alignment, Festival creates utterance structures to assign these timestamps to speech candidates.

Another important task is to label the pitchmarks of the speech. When we concatenate speech (with overlap and add), we want to make the two pitches align to reduce the sudden change of pitch. Poor alignment of pitch is perceived as an audio artifact and is very unnatural. This plays a critical role in the join cost between speech units at synthesis time. Building a voice without marking pitch results in a perceived increase in audio artifacts.

Having well-labelled data is crucial to ensure that the model receives accurate and meaningful information. This is also true for all modern machine learning tasks, including state-of-the-art speech synthesis. Without good labels, the model may learn incorrect patterns or make unreliable predictions. This negatively impacts performance and produces potentially biased or misleading results. High-quality labels enable the model to generalize well to unseen data. In our unit selection task, if the labels are not accurate, the speech units that are selected by the algorithm will not be accurate either. State-of-the-art methods attempt to be robust to imperfect labels by having a large amounts of data. Large amounts of data is something that unit selection is poor at handling because of the storage and computational complexity that increase in size with the number of candidates that we have.

Further Signal Processing in Festival

To speed up the time and cost of synthesizing speech with Festival, we do a bit more signal processing and optimizations to our database. Firstly, we want to measure the changes in F0 so that the join cost can measure and compare segments across a join. Although its related, this is different from pitch marking. Pitch marking allows us align waveforms, but tracking allows us to have an easy measure of sameness in the F0. Pitch tracking will be important for optimised join-cost estimation.

Another bit of processing we do is actually calculate coefficients and combine it with the F0 measures to precompute the values that are used for joining one snippet to another. This can be done, because the acoustic features of each audio snippet boundary are completely fixed and independent from each other. Thus, pre computing these values reduces a lot of computation at synthesis time.

One problem that we might experience with unit selection is that our database of speech is large. In order to reduce the size of our database we store our speech not in waveform, but in terms of a source-filter model. These are parameters called linear prediction coefficients (LPCs). This parameterization would allow for easier waveform manipulation if necessary.

Experiment 1

As discussed previously, one method for improving unit selection results is to include more candidates into the database. So, we recorded two scripts of speech. The first, ARCTIC A from CMU, and the second our selected script of Ted Talks as previously discussed.

We synthesized sentences with all 3 unit selection voices. First, we created the voice using only the Arctic speech which will serve as our baseline. Our second voice used speech from Ted Talks, which we expect to have similar performance to the baseline. Our third voice was a combination of the two datasets. We will refer to these voices as the Arctic voice, the Ted voice and the combined voice.

The script sizes of ARCTIC A and our selected Ted Talk script in terms of word and phone count.

Because our combined dataset has approximately twice the number of candidates to chose from (see Table 1), we expected to have smaller join and target costs at synthesis time. With nearly twice the unique words, our probability of having more contiguous speech already in our database is much higher then with either script on its own. We hypothesize that this property should translate into more natural sounding utterances.

User Study

To understand the naturalness of our synthetic speech, we conducted a listening test using Qualtrics. We presented the listeners with a Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) test. We chose a MUSHRA test because it is an international standard for speech evaluation that would allow us to have listeners rate speech produced by different systems. Not only do listeners rate the speech, but they also implicitly rank the systems. We wanted our listeners to compare all three of our voices. The MUSHRA test also provided the advantage of giving their judgements in reference to natural speech, which acts as an anchor. This anchor combats against bias as well as ensures that the listener is paying attention during testing and giving us useful survey results.

We constructed 2 sets of 8 items for a total of 16 items. 8 items where tongue twisters that are difficult for native English speakers to produce. The thought behind this was to provide sentences that are traditionally hard to say or feel a bit unnatural in order to give the unit selection the best chance at sounding natural. This was loosely corroborated by the fact that, on average, the naturalness rating for tongue twisters was higher for 2/3 voices than with all sentence types taken into account (Figure 4). For the remaining 8 items, we chose to synthesize sentences that were taken from our domain of Ted Talk sentences. This was to give a random sample of naturally occurring speech taken from the same domain that our domain voice was built from. At test time, we presented a random 50% of the items to each listener.

When synthesizing the speech both the Ted and Arctic voices had 4 and 5 missing diphones, respectively, while the combined voice only had 1 missing diphone. Which suggests greater diphone coverage of the combined voice database.

Results

The survey ran for about 5 days, and we found 23 listeners from my classmates, friends, and family that were willing to rate the speech. Unsurprisingly, the reference speech was identified correctly as the reference 100% of the time. This first key finding from the data indicated that the perceived difference in naturalness between the reference and the unit selection voices was large enough to never trip any of the participants up.

Figure 3: The initial results from the MUSHRA test show that the synthetic unit selection voices are very distinguishable from the reference audio. The combined voice is rated more natural of all the synthetic voices tested.

According to the MUSHRA results (see Figure 3), the combined voice was perceived--on average--as the most natural of the unit selection voices. The distribution of scores showed that people tended to agree--based on the inner quartile range (IQR) and the data's mean--that the Arctic voice was slightly better than the Ted voice. The IQR for the Arctic voice ratings was 33 as opposed to 40 for the Ted voice and 42 for the combined. Using the bottom and top quartiles is difficult, because all voices had similar minimum and maximum ratings.

We also tested whether there was any importance in the type of sentences that were synthesized. As mentioned previously, these sentence types are from a Ted Talks corpus or tongue twisters. We separated the results found in Figure 3 into the two sentence types as shown in Figure 4. We see that in both cases, the combined voice outperforms the other two voices, giving confidence that more units means better synthesis. When comparing just the ARCTIC and Ted voices, Ted seems to outperform ARCTIC, but only on sentences that were taken from the same Ted domain (noting that performance here is measured by average MUSHRA rating). The Ted voice seemed to be more natural on average (approximate average rating of 33 compared to 41). That is particularly interesting considering the median and upper quartiles decreased. This indicates that some rated the voice very high in terms of naturalness which skewed the average results upward, even though the mass of data was shifted downwards. Because of the concatenative nature of unit selection, you can get lucky and have just the right phones to make one particular utterance sound much better, especially when there is a lot of units from the same utterance. We suspect that this variability is one reason for the large ranges observed in the MUSHRA ratings.

Figure 4: This figure shows an interesting distinction between sentences that were from Ted Talk transcriptions, and tongue twisters.

Even with partitioning on the sentence type shows that there is still a clear advantage for a system to have more data. This is evident from the fact that the average score for the combined voice is higher than the other two voices.

Experiment 2

In our pursuit to make a more natural sounding unit selection voice. We noticed that arctic was recorded with a very monotone voice, which may concatenate well at synthesis time, but is fairly unnatural and unprosodic speech. When recording the second script, we changed the prosody and intonation of speech to capture more energy, tone, and acoustic movement. The intuition behind this decision was that in order to have intonation in synthesized speech, we needed intonation in the database. We suspected that this would contribute to more expressive synthetic speech, which we predict should sound more natural to a listener. In order to not conflate our results with our finding that more data gives a more natural voice, we imposed the constraint that our 2 voices should have roughly the same amount of data. Because of this constraint, we hypothesize that our Ted voice will actually perform worse than our Arctic voice.

To gather this information on voice preferences, we used the same listeners and the same Qualtrics survey as we used for Experiment 1.

Results

The preference--in terms of naturalness--for arctic over ted and combined over ted is significant with a one-sided significance test with a confidence of 95% (p-value of 0.0226 and 0.0291 respectively). The preference for combined over the arctic was shown to not be statistically significant (p-value of 0.1045). Due to our fairly small sample size, we would want to increase the number of participants to increase confidence and understanding of these preferences.

Figure 5: This figure represents the preferences of the listeners in an AB style comparison. The black dotted line represents 50-50 preference, or non-preference.

The results (shown in Figure 5), are particularly interesting because we can quickly see that the combined voice is always preferred and the Ted voice is never preferred. We note that for each sentence, the MUSHRA score that was higher was determined as the "preferred" voice.

Further Discussion

Comparative results give us a good objective measure of which solution is preferred over another solution, but does not objectively tell us that either solution is good. In our case, showing preference between the Ted voice and the Arctic voice doesn't help us determine if either is a good natural voice. It may have been valuable to have a few other text2speech systems there in order to give clearer levels of naturalness. This might have reduced the variability that we saw in the MUSHRA results. It is somewhat subjective as to the distance of naturalness between the unit selection voices and the reference recording. That is why it was important that we included an anchor audio as an option.

We used the implicit ranking in the MUSHRA results to compare two voices. This is not as effective as a more simple AB test. Another potential drawback of us using MUSHRA is that it isn't absolute in nature, but relative. There are known biases that can be presented in a MUSHRA test based on quality and distribution of items in a question [9]. We sought to mitigate these biases and other anchoring problems by randomizing item order and question order during the administration of the test.

We would have hoped for more participants in order to make our claims even stronger statistically. The small sample size of 23 was limiting to our conclusions. The variability of the data is somewhat concerning. I think there are some data trimming of outliers that seem like rushed or unreliable test results.

Overall, the results clearly identify a large gap between the perceived naturalness of unit selected synthesized speech and recordings of speech. The results also indicate to us that more data is one way to increase the naturalness of a unit selection synthesizer. Also, being expressive in recording audio doesn't directly help the voice be perceived as more natural.

Discussion of Speech Synthesis Today

Although unit selection is--by today's standards--an outdated form of speech synthesis, there are many fundamental principles of unit selection speech synthesis that are still critical for understanding of modern speech synthesis methods.

One of the most critical learning from unit selection is that the quality of input is still vitally important. If you have an airy voice in the recordings database in unit selection, you will synthesize an airy voice. The same thing will be true if you have a self supervised machine learning algorithm. If you only show it airy speech, it will only learn airy speech. The quality of the input is just as important for modern neural architectures as it is for unit selection.

Other modern methods build on the idea that speech is made up of smaller units of speech, which can be rearranged in new ways to generate new speech. One excellent example of this is shown in an application of voice conversation [10]. Researchers observed that speech can be segmented into units of individual mel-scale spectrogram frames. Thus, similar to unit selection, if you had a database of mel-scale spectrogram frames of speech, you could synthesize speech from these frames. In this recent work, they accomplished voice conversion by defining a target (source speech) and minimizing the target cost (k-nearest neighbors selection) of each frame with the database of frames of the new voice. These ideas from unit selection have applications even outside of the domain of text-to-speech.

Many modern methods like HuBERT [11] or SoundStream [12] have proposed methods to unitize/discretize speech. This discretization can be seen as a way of learning units from a large corpus of speech. So instead of identifying and labeling speech, we can simply learn them from data. Discrete units of speech can be seen as building blocks of speech. In the same way unit selection selects the right combination of speech snippets to make speech, a family of methods use language models to select sequences of discrete units and build full utterances. Examples of these speech LM models would be VALLE [13] or SpeechX [14]. Both of these methods use codec-based language modelling and originate from Microsoft. In this case, the discrete speech units replace the recorded snippets of audio. Additionally, the language model minimizes the negative log likelihood of the next speech unit instead of minimizing the total cost function in unit selection. In many ways, modern methods are built on the same principles as unit selection just using different tools, more computation power, and more data.

Conclusion

Unit selection--although in practice outdated--still provides valuable insight into the fundamentals of speech synthesis. Unit selection teaches us that speech is simply a series of smaller units of speech that can, if selected accurately from a good database, concatenate together to create brand new utterances.

Through experimentation, we show that in unit selection speech synthesis that more data is better. We also claim that more variability in the database speech leads to unnatural joins and overall utterances. Further work should be done to determine the limits of these claims including at what point more data doesn't help the voice, and if there is a point when a voice with intonation helps the synthesized speech sound more natural then a monotonic voice.

Acknowledgments

We recognize that many sentence transcriptions used were taken from online sources (https://www.ted.com and https://www.kaggle.com), but are all from the Ted organization.

References

[1] H. Grassegger, “Von kempelen and the physiology of speech produc- tion,” Grazer Linguistische Studien, 2004.

[2] P. Feaster, “Framing the mechanical voice: Generic conventions of early phonograph recording,” 2001.

[3] M. R. Hasanabadi, “An overview of text-to-speech systems and media applications,” 2023.

[4] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” 2017.

[5] A.Łancucki,“Fastpitch: Parallel text-to-speech with pitch prediction,”in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.

[6] A. W. Black and P. A. Taylor, “Automatically clustering similar units for unit selection in speech synthesis.” 1997.

[7] G. K. Zipf, “The psychobiology of language,” 1935.

[8] M. Vainio, J. Jarvikivi, S. Werner, N. Volk, and J. Valikangas, “Effect of prosodic naturalness on segmental acceptability in synthetic speech,” pp. 143–146, 2002.

[9] S. Zielinski, P. Hardisty, C. Hummersone, and F. Rumsey, “Potential biases in mushra listening tests,” Audio Engineering Society, 2007.

[10] M. Baas, B. van Niekerk, and H. Kamper, “Voice conversion with just nearest neighbors,” 2023.

[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.

[12] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.

[13] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” 2023.

[14] X. Wang, M. Thakker, Z. Chen, N. Kanda, S. E. Eskimez, S. Chen, M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec language model as a versatile speech transformer,” 2023.

REFERENCES
[1] H. Grassegger, “Von kempelen and the physiology of speech produc-
tion,” Grazer Linguistische Studien, 2004.
[2] P. Feaster, “Framing the mechanical voice: Generic conventions of early
phonograph recording,” 2001.
[3] M. R. Hasanabadi, “An overview of text-to-speech systems and media
applications,” 2023.
[4] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis,
R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech
synthesis,” 2017.
[5] A. Ła ́ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in
ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.
[6] A. W. Black and P. A. Taylor, “Automatically clustering similar units
for unit selection in speech synthesis.” 1997.
[7] G. K. Zipf, “The psychobiology of language,” 1935.
[8] M. Vainio, J. Jarvikivi, S. Werner, N. Volk, and J. Valikangas, “Effect
of prosodic naturalness on segmental acceptability in synthetic speech,”
pp. 143–146, 2002.
[9] S. Zielinski, P. Hardisty, C. Hummersone, and F. Rumsey, “Potential
biases in mushra listening tests,” Audio Engineering Society, 2007.
[10] M. Baas, B. van Niekerk, and H. Kamper, “Voice conversion with just
nearest neighbors,” 2023.
[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and
A. Mohamed, “Hubert: Self-supervised speech representation learning
by masked prediction of hidden units,” 2021.
[12] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi,
“Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507,
2022.
[13] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu,
H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language
models are zero-shot text to speech synthesizers,” 2023.
[14] X. Wang, M. Thakker, Z. Chen, N. Kanda, S. E. Eskimez, S. Chen,
M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec
language model as a versatile speech transformer,” 2023.
6https://www.ted.com
7https://www.kaggle.com

Search This Blog

spencerjensen.dev