spencerjensen.dev

Posts

EPUB Files Deconstructed

December 30, 2024

Have you ever seen a .EPUB file? It contains all the information required for an electronic document. It stands for electronic publication. I was looking into EPUB files because I am working on a personal project that involves reading and language learning. I wanted to create a basic e-reader that functions with the EPUB standard. The EPUB Standard The EPUB Format became the standard for creating and distributing digital documents, especially books. The standard is currently maintained by the w3c (world wide web consortium) community. There have been several iterations of the standard in order to improve the older version. The current versioning is at 3.3 as of 2023. If you look at an EPUB file with a text reader you get this: This crazy encoding is a good indication that the file is a compressed version of what you want. When you unzip the EPUB file, you get the following files and folders: What makes EPUB particularly powerful is its foundation in web technologies. At its core, an E...

The Hard Part About Speech

July 22, 2024

The rise of Large Language Models (LLM) in recent years has certainly made a lot of people see the world differently. I believe that many people are more interested in statistics, machine learning, computer science, and linguistics because of these fantastic advancements, myself included. As I have studied intensely at the University of Edinburgh over the last year, I have realized more about what these kinds of models do, and how they relate to the bigger picture of modeling functions in general. All of these models are doing some form of function approximation/prediction task. Language models like GPT or BERT are simply predicting the next word token given the context, Automatic Speech Recognizers (ASR) like Whisper are predicting the text output given a waveform, Text to Speech (TTS) models like Amazon Polly are trying to predict the spectrogram or waveform given the textual input and who it should sound like. There is always a task, a dependency, data, and a metric to measure that ...

Phonetic Posteriorgrams

May 29, 2024

For my MSc dissertation, I am trying to synthesize speech with variable L2 accent strength. There are a few ways to approach this. The "fastest" and "easiest" way is to have the perfect dataset that has reliable labels for accent strength on each utterance. There are two problems with this approach. First, this dataset would be incredibly difficult and expensive to create. Second, it is difficult to get a trusted label for "accent strength" that is also consistent across the dataset. What does accent strength even mean and how would it be measured non-subjectively? I was talking about this problem with one of the PhD candidates in the CSTR group, and she told me to look into phonetic posteriorgrams. Since then, I have tried to read a bit about them. I wanted to explain them in this post. What is a Phonetic Posteriorgram? A phonetic posteriorgram (PPG) is well described in it's name. Phonetic indicates that we are talking about phonemes, or speech cate...

Robotic Voices and Speech Synthesis: Experiments with Unit Selection

May 28, 2024

This is the paper that I wrote for my Speech Synthesis class in semester 2 at the University of Edinburgh. Building a unit selection voice was a lot of fun, especially getting to record our own voices. Learning unit selection help me to understand a lot of the basics and fundamentals for the state of the art technologies because they are all trying to solve similar problems with different approaches. Here are some sample audios from my synthesized voice. "She sells seashells by the seashore." "I scream. You scream. We all scream for ice cream." These are very robotic, but this was state of the art not that long ago and cool that it is built with a database of my voice recordings. Abstract Unit selection in speech synthesis, an aging technique dating back to 1996, remains a fundamental method for synthesizing speech. This paper explores the theoretical principles of unit selection, its practical implementation, and its implications for modern speech syn...

Search This Blog