Exploring multilingual NLP & Speech. PhDing @osunlp. Organizer and member @masakhanenlp, @mrl2024_emnlp, ex: Researcher @Intronhealth

Roaming
Joined April 2014
I highlighted some wins and ongoing work toward building AI that reflects African realities, and closed with open research questions we’re exploring. If this aligns with your interests, I’d love to connect and collaborate. Slides: docs.google.com/presentation…
Last week, I gave a talk at the Center for African Studies at OSU on **Going Beyond Text with AI for African Languages**. Particularly, I shared how effective communication is multimodal and that AI systems in the space should also support this. Slides: 👇
3
1
19
Abraham Owodunni retweeted
ML research is an engineering discipline, not a philosophy seminar. You build, you test, you learn. Untested ideas are just speculation.
Abraham Owodunni retweeted
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably huggingface.co/spaces/Huggin…
Abraham Owodunni retweeted
What's the secret to finding impactful work? Good judgment. How do you get good judgment? Experience. How do you get experience? Bad judgment
7
11
3
182
Join my lab! I’m currently recruiting 1-2 PhD students for admission in the fall of 2026 at @Mila_Quebec mila.quebec/en/prospective-s… Are you interested in multilingual NLP + VLMs / AI Safety? I would encourage you to apply. Deadline: December 1
Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit mila.quebec/en/prospective-s…
Come check out our poster on Multilingual Continual Learning at #COLM's MELT workshop today!!!
28
Abraham Owodunni retweeted
🚨 Attention aspiring PhD students 🚨 Meta / FAIR is looking for candidates for a joint academic/industry PhD! Keywords: AI for Math & Code. LLMs, RL, formal and informal reasoning. You will be co-advised by prof. @Amaury_Hayat from ecole des ponts and yours truly. You'll have the opportunity to collaborate with the excellent FAIR codegen team and fellow FAIR PhD students in Paris, have access to state of the art pre- and post-training infra, and significant amounts of compute. A joint industry / academic PhD gives you the best of both worlds: academic freedom, open science & open source, a ton of compute, and talented colleagues working as a team. Ideal candidates should have strong engineering & experimentation skills, strong math skills, and a solid understanding of ML & RL foundations. We want to move fast so apply ASAP at the link below!
Abraham Owodunni retweeted
Introducing AfriMed-QA – the first large-scale pan-African dataset designed to help evaluate & develop optimized and effective LLMs for African healthcare.
1
9
1
17
Do check out our work in collaboration with Google research!!!
Ensuring generalization of LLMs in response to distribution shifts is especially important for medical and health-related models. Here we describe AfriMed-QA, an open-source benchmark question–answer dataset sourced from countries across Africa. More at goo.gle/4mRQCfv
3
21
I'm taking a computer vision class this semester, should be fun!
2
18
Completed the first chapter of my PhD! I'm eagerly looking forward to the many beautiful chapters ahead 💪🏽
11
5
95
Abraham Owodunni retweeted
🚨 Participants wanted! 🚨 💬 We're looking for feedback on our new multi-domain research proposal evaluator. Be first to test it using your own research ideas! 🎁 ~$25/hour task, repeatable 4 times (total $100) 📄 You must have 1+ published papers 👉 Sign up below!
Abraham Owodunni retweeted
I am really excited to share that my first research paper, under @ml_collective too, has been accepted into a workshop of #MICCAI, the world's largest medical imaging AI conference. Notably, it has also been selected for an oral presentation during the conference!
Abraham Owodunni retweeted
This paper introduces FlexiTokens, a language model that learns its own boundaries and shifts them during finetuning. Subword tokenizers break when text looks different, so models waste compute on endless tiny pieces. FlexiTokens works at byte level, runs a lightweight transformer that marks possible split points, then pools bytes into variable segments before the usual layers. Instead of forcing a fixed compression ratio, the authors add a hinge style loss that only cares if a sequence gets too many splits, leaving extra freedom in the other direction. During adaptation the loss lets the boundary predictor loosen or tighten, so medical notes, Turkish verbs, or code get the chunk sizes they deserve. Across 6 languages and 7 tasks the model cuts token counts by up to 2x and still lifts accuracy by about 10%. A 1B parameter version even beats a larger static BPE setup while staying faster because input gets shorter. The same model handles unseen Urdu script without retraining a tokenizer, showing the approach is truly language agnostic. ---- Paper – arxiv. org/abs/2507.12720 Paper Title: "FLEXITOKENS: Flexible Tokenization for Evolving Language Models"
5
7
2
31
Abraham Owodunni retweeted
In order to celebrate the release of the print version for the Ultra-Scale Playbook (of which I have no affiliation with and love deeply), I'm going to be giving away 5 copies! To enter, simply like + retweet this tweet. Winners will be selected at random 10AM EST on the 13th
Abraham Owodunni retweeted
This talk will explore the current progress and persistent challenges in developing Natural Language Processing (NLP) technologies for African languages. With over 2,000 languages spoken across the continent, most remain underrepresented in the AI ecosystem due to limited data,
This!!!
His prompt can’t be the same as yours bro. We all used calculators in maths tests but we all didn’t get 100%.
1
1
I’ll disagree with this. When transformers came out people said it is too expensive to train, but guess what, 14B models are the new “small” lens. These also inspired several efficiency works including LoRA. Likewise, newer methods for byte level token interpretation will emerge!
Killing tokenizers is a bad idea. You're just replacing the input features by something which is less interpretable (chunked bytes) So if you have trouble interpreting token sequences right now, I don't know what you think will happen once you switch to bytes.
1
7
I'm super proud of the amazing work @paul_okewunmi, @FavourJhay and other community members at @ml_collective NG did on this project! You guys deserve this award!!
🏆 We won the best paper award at AfricaNLP!! Huge shoutout to @AbrahamOwos and the @ml_collective (Nigeria) community, this idea was first shared on our discord. Grateful to see it grow into something impactful.✨
2
12
62