Posts

2024 Reflections

2024 will always be an exceptional and unforgettable year for me. It stands out as one of the most unique namely because we spent nearly the entirety of 2024 living in Edinburgh, Scotland. It would be impossible to compress our year into a post of the highlights, but here we go. About a year ago, I was flying back to Scotland to start my second semester at the University of Edinburgh. The second semester was a smooth continuation of the first great semester and continued to be a great experience. I learned so much about machine learning, linguistics, and speech. Another thing about 2024 was that I learned about academia in a deeper and meaningful way. Sometime in the second semester, I started to seriously consider a PhD at KTH in Stockholm this year. The experiences and people that I met in my program, my professors, everyone I met in the CSTR group at Edinburgh, and just the very interesting topics I was learning about got me quite interested in pursuing more education.  During t...
Image
Have you ever seen a .EPUB file? It contains all the information required for an electronic document. It stands for electronic publication. I was looking into EPUB files because I am working on a personal project that involves reading and language learning. I wanted to create a basic e-reader that functions with the EPUB standard. The EPUB Standard The EPUB Format became the standard for creating and distributing digital documents, especially books. The standard is currently maintained by the w3c (world wide web consortium) community. There have been several iterations of the standard in order to improve the older version. The current versioning is at 3.3 as of 2023. If you look at an EPUB file with a text reader you get this: This crazy encoding is a good indication that the file is a compressed version of what you want. When you unzip the EPUB file, you get the following files and folders: What makes EPUB particularly powerful is its foundation in web technologies. At its core, an E...
Image
The rise of Large Language Models (LLM) in recent years has certainly made a lot of people see the world differently. I believe that many people are more interested in statistics, machine learning, computer science, and linguistics because of these fantastic advancements, myself included. As I have studied intensely at the University of Edinburgh over the last year, I have realized more about what these kinds of models do, and how they relate to the bigger picture of modeling functions in general. All of these models are doing some form of function approximation/prediction task. Language models like GPT or BERT are simply predicting the next word token given the context, Automatic Speech Recognizers (ASR) like Whisper are predicting the text output given a waveform, Text to Speech (TTS) models like Amazon Polly are trying to predict the spectrogram or waveform given the textual input and who it should sound like. There is always a task, a dependency, data, and a metric to measure that ...
Image
For my MSc dissertation, I am trying to synthesize speech with variable L2 accent strength. There are a few ways to approach this. The "fastest" and "easiest" way is to have the perfect dataset that has reliable labels for accent strength on each utterance. There are two problems with this approach. First, this dataset would be incredibly difficult and expensive to create. Second, it is difficult to get a trusted label for "accent strength" that is also consistent across the dataset. What does accent strength even mean and how would it be measured non-subjectively? I was talking about this problem with one of the PhD candidates in the CSTR group, and she told me to look into phonetic posteriorgrams. Since then, I have tried to read a bit about them. I wanted to explain them in this post.  What is a Phonetic Posteriorgram? A phonetic posteriorgram (PPG) is well described in it's name. Phonetic indicates that we are talking about phonemes, or speech cate...
Image
This is the paper that I wrote for my Speech Synthesis class in semester 2 at the University of Edinburgh. Building a unit selection voice was a lot of fun, especially getting to record our own voices. Learning unit selection help me to understand a lot of the basics and fundamentals for the state of the art technologies because they are all trying to solve similar problems with different approaches. Here are some sample audios from my synthesized voice. "She sells seashells by the seashore." "I scream. You scream. We all scream for ice cream." These are very robotic, but this was state of the art not that long ago and cool that it is built with a database of my voice recordings. Abstract Unit selection in speech synthesis, an aging technique dating back to 1996, remains a fundamental method for synthesizing speech. This paper explores the theoretical principles of unit selection, its practical implementation, and its implications for modern speech syn...