Language Identification Project

January 24, 2024

I wanted to complete some practice problems and build a small basic portfolio of small tech projects that showcase some of my recent learning at the University of Edinburgh. I have taken several courses about natural language processing (NLP) at uni.

NLP is a very wide term for a variety of different tasks, for example, text classification, named entity recognition, machine translation, language modeling, etc. This project was just a simple text classification task. I wanted to input text and have a model predict which language it is. Find the code on my Github, and my steps that I took to create the project below. Hope this helps facilitate your learning too.

Step 1: Find data

Foundational to any project involving any sort of prediction. You need data and preferably a lot of it. I wanted to predict language, and one corpus of language data that I already knew about from school was the Europarl Dataset. This is a corpus of EU meetings that are translated into the languages of the EU countries. This gave me a large set of sentences in 21 languages, a great start to a project like mine.

Step 2: Build a basic frontend

I didn't want to make things too complex, so I created a frontend.html and a frontend.css file with a simple form that on submit calls an endpoint called /detect.

Step 3: Build a basic backend

I wanted the easiest setup for a backend, because I would start with only having one endpoint called /detect. The only other constraint that I was working with was to have everything in python, so I could practice my python skills. I decided to use fastAPI because it is just as you'd imagine, the setup is FAST.

Step 4: Preprocess data

The data is a folder for each language and a lot of text files containing the sentences in each text file.

I split this process into a few subprocesses. First, I needed to combine all the texts from the dataset into one large .txt file per language.

Next, I took each large text file and created a cleaned version of the txt file and converted it into a csv in order to load into a pandas dataframe more easily.

I glanced at the cleaned data and was disappointed to still see rows that had data that was just a few characters or very short phrases like "(applause)". I then decided to get rid of all lines that had a short text like this. After this last bit of cleaning, I randomly shuffled and then truncated each language to only 50,000 entries per file. This was mostly to simplify and make things quicker and easier.

The final step was to combine into one giant csv with all the languages represented and then split this file in an approximate 80-20 split. These will be our training and test set for our model.

Step 5: Train the fastText model

I needed a quick way to classify the input sentence, and came across fastText from Facebook Research. To quote their Github readme, "fastText is a library for efficient learning of word representations and sentence classification." The main appeal to me was the seemingly fast implementation time. I was able to clone the fastText repo and train a model and use it to infer within 10 min (with half that time being actual train time).

After training, my code just needed a few lines of python code to preform inference. All I had left to do was to throw those lines in a function and call that function from my API endpoint. The endpoint takes the text to classify and returns a predicted language label and a confidence score.

Testing

I speak Swedish, Czech, and my native English, so let's see how the same translated sentence holds up in the system.

Looks like it works well enough!

It was a fun project that didn't take too long to complete. In future iterations, I would love to implement different models of text classification, and compare accuracies.

Search This Blog

spencerjensen.dev