Céline Leuzinger

Céline Leuzinger

NLP engineer | Teacher | Data Enthusiast

About

Hi there!

My name is Céline and I am a recent graduate in Natural Language Processing. I am a language lover, and by that, I mean any kind of languages - may they be programming or natural languages.


Additionally, I also love teaching - even more so if it is teaching programming languages. This portfolio will give you a quick overview of my projects, research and experience. If you would like to chat about any of these, do no hesitate to reach out!

Projects and research

Current projects:

Fine-tuning m-DeBERTa-V3 for Grammatical Error Detection (started June 2023)
Keywords: m-DeBERTa-V3, Deep Learning, Grammatical Error Detection, Pytorch
In short: since its release in March 2023, m-DeBERTa-V3 has proven to be one of the best performing LLMs ever created (see He et al., 2023). One area in which this brand new model has not yet been tested is Grammatical Error Detection (GED). We can hypothesise that m-DeBERTa-V3 should perform exceptionally well in GED, given the fact that it builds upon two SOTA LLMs for grammatical error checking: XLM-RoBERTa (Conneau et al., 2020) and ELECTRA (Clark et al., 2020). I am currently working on the fine-tuning of m-DeBERTa-V3 for Grammatical Error detection, using a Swedish dataset provided by Spraakbanken.
Current tasks: testing different loss functions (column-wise MSE and cross-entropy loss) and different padding lengths. Prelimiary result for precision: 0.36.
Wanna know more? Check it out on Github.

The Alphabet Challenge (started October 2023)
Keywords: Python, Pygame, quiz, alphabets
In short: a fun side-project where I build a quiz using Pygame. And of course, it is about languages - and more precisely, about alphabets. Can you recognize them all?
Current tasks: working on the game interaction to make the interface more dynamics. Score counter is also coming soon!
Wanna play? Keep an eye on the progresses on Github.

Past projects:

Master's thesis
Untangling LLMs: how good is XLM-RoBERTa when it comes to error checking? (May 2023 - September 2023)
Keywords: LLMs, Deep Learning, Grammatical Error Detection, Linguistics, Data Analysis
In short: the LLM XLM-RoBERTa is the state-of-the-art of multilingual grammatical error detection. But in which areas of the grammar does the model struggle the most?
Results: the model missed more errors related to syntax, morphology and lexicon than punctuation and orthography. More specifically, the model seems to struggle with tense coordination, agreement, missing words and prepositions.
Wanna know more? Check it out on Github.

Comparing LLMs (May 2023)
Keywords: LLMs, BERT, ELECTRA
In short: it is hard to believe that LLMs did not exist a decade ago, given the number of language models that have been released in the last few years. In fact, it is sometimes hard to pick the right language model for a given task, since the differences and similarities between them often reside in complex architectural properties. But as a LLM enthusiast, I happily delved into numerous academic articles to present an overview table of LLMs, from BERT to m-DeBERTa-V3, without forgetting non-BERT models such as ELECTRA.
Wanna check it out? Table available on Github.

A dialogue-based language game (March 2023 - April 2023)
Keywords: Azure, XState, JavaScript, Dialogue System
In short: what is even better than being passionate about languages? Sharing this passion with others. Here is a Swedish vocabulary game, coded in JavaScript using Microsoft Azure and the XState library. Choose a topic, and tell me what you see on the images - in Swedish, please! Note that this game requires basic knowledge of Swedish, but a demo will be uploaded soon if you don't speak it!
Wanna know more? code available on Github. Wanna play? click! Note hat you may need to clear the cache to see the images properly.

Thesis by research
The Morphosyntax of Luwo (September 2021 - September 2022)
Keywords: Fieldwork, Data Collection, Excel, Morphosyntax
In short: The Luwo language, spoken in South Sudan, has remained mostly unknown to Western Linguistics. But with more language disappearing each year, it is of crucial important to collect linguistic data on small languages, so that they can remain - even if the number of speakers decrease. This is a quest I have embarked on as part of the Nilotic Research group at the University of Edinburgh.

Master's thesis
Language complexity at the era of large typological databases (June 2021 - September 2021)
Keywords: Database, WALS, Language Complexity, Complexity Metrics
In short: What are the hardest languages on the planet? This is a question that has kept linguists awake at night for decades now. Over the yeras, researchers have attempted to create metrics to assess linguistic complexity, based on large typological databases such as WALS. But are these metrics reliable? How can we use large database to measure complexity in the most objective way? These are some of the questions I attempted to answer during my Master's thesis on Linguistic Complexity.

Skills


Programming languages
Main programming langages:

Python Advanced
Used in: MSc in Natural Language Processing, MSc in Linguistics, and nearly all of my side projects
Libraries: Pytorch, pandas, sklearn, transformers, etc.

JavaScript Intermediate
Used in: MSc in Natural Language Processing, dialogue-based language game
Libraries: XState

Other languages: SQL, R, C#, C, HTML

Development tools: Git, Excel, VS Code, Jupyter Notebook, PyCharm

Cloud: Azure

Natural languages

French Native

English Native-like, C2
Certificate: Cambridge Advanced Certificate (CAE), obtained June 2016

German Advanced, C1
Certificate: Language certificate from the University of Heidelberg, level C1, obtained March 2019

Swedish Intermediate, B2
Certificate: Language certificate from the University of Gothenburg, level B1, obtained January 2023

Italian Intermediate, B2
Certificate: Language certificate from the University of Sienna, level B1, obtained January 2019