Teacher highlights multiple A.I translations of African languages to class

Millions on the African continent can’t fully benefit from the AI revolution. This Princeton course aims to change that.

Princeton postdoc Happy Buzaaba has devised a new Freshman Seminar based on his research focused on introducing more African languages into LLMs.

As the AI revolution transforms the digital world, millions of people on the African continent cannot tap its full promise because the languages they speak aren’t built into the large language models that drive services like ChatGPT. A Princeton postdoc and a new course he devised is focused on how to change that.

Happy Buzaaba’s research as a data engineer is centered on introducing more African languages into LLMs. There are over 7,000 languages in the world today, he said, and around 2,000 of these are spoken on the African continent. Yet not more than 20 African languages are currently represented in commercial LLMs, and there are efforts from both academia and industry to increase this number.

Mwita observing a session on a.i. and african language

Co-instructor Mahiri Mwita, a senior lecturer at the Princeton Institute for International and Regional Studies, mentored Buzaaba in his first teaching experience in the classroom.

“People shouldn't need to switch to another language for them to interact with technology,” Buzaaba said. “For these African
communities, it’s not just about the lack of daily convenience. It’s also a barrier for them to enter and interact with the digital world.”

Buzaaba is also part of a grassroots organization called Masakhane, where he has been working since 2020 with scholars on various projects aimed at advancing African languages in LLMs.

He came to Princeton in fall 2023, hosted by the Center for Digital Humanities and the Princeton African Humanities Colloquium. He is also affiliated with the Africa World Initiative, a multifaceted University initiative to cultivate partnerships with wide-ranging impact in and beyond Africa.

This fall he brought his work into the classroom with a new Freshman Seminar, “Teaching Computers to Understand African Languages,” co-taught with Mahiri Mwita, a senior lecturer at the Princeton Institute for International and Regional Studies.

The new course focuses on the technical, ethical, and logistical issues and challenges around increasing the representation of African languages in LLMs, and on increasing access to the benefits of AI through smartphones and other devices.

Mwita mentored Buzaaba in his first semester as a teacher, and taught the students basic Swahili along with an overview of the history of African languages.

Complexities and collaboration

“Creating new LLMs with African languages is inherently difficult because African languages are generally spoken only and do not have a good textual presence on the internet,” Buzaaba said. Most LLMs are created through a painstaking process that involves training machines on massive amounts of text and data downloaded from the internet — everything from social media posts to news articles.

During one class session in early November, Buzaaba introduced the students to an element of LLM development called information extraction and named-entity recognition. This process identifies entities and their semantic relationships within text, organizing the information into a structured form, such as a “knowledge graph,” that enables further inference.

“Once you have this annotation,” Buzaaba said, “you’re basically taking the knowledge of a linguist and giving it to a computer.”

During an in-class exercise, Olamide Falayi and Emilia Reay worked together annotating sentences in English and in Yoruba. For example, in the sentence “Today, I cried,” they highlighted the word “today” and gave it the label “date,” in both the English and Yoruba sentences.

The purpose of the annotation exercise was not so much for the students to master the task but to experience the challenges that professional annotators face. “In my project with Masakhane, we have people annotating separately and then coming together to look at their annotation disagreements because there are always disagreements,” Buzaaba said. “So, we do this kind of practice in class.”

In a different class session, the students learned “how certain properties of some African languages — for example, agglutination — make it difficult for computers to recognize them,” said Reay, who plans to study Spanish, linguistics and journalism.

Two students in class sharing a laptop

Emilia Reay (left) and Olamide Falayi annotate sentences in English and in Yoruba during an in-class exercise focused on information extraction and named-entity recognition.

Learning from experts

At the start of the class session on annotation, Ernest Mwebaze, executive director of the Uganda-based startup Sunbird AI and one of several guest speakers throughout the semester, told the students about Sunbird’s mission to develop customized, practical AI tools for everyday Ugandans — from street vendors to farmers.

student hand writing notes on artificial intelligence language learning

A student takes notes during a presentation by guest speaker Ernest Mwebaze, executive director of the Uganda-based startup Sunbird AI, which develops customized, practical AI tools for everyday Ugandans — from street vendors to farmers.

For example, these tools might help them get a text summary of a radio program on their phone in their local native language, or upload a photo of a crop pest and get information about it.

"Language is about people, people are important,” Mwebaze said. “Uganda has the 10th highest linguistic diversity in the world, about 45 languages.” Sunbird’s goal is to build language technology resources in Uganda that can be used throughout Africa.

Later that afternoon, the students attended Mwebaze’s public talk in the Center for Digital Humanities’ African Languages in the Age of AI speaker series, part of the “Humanities for AI” initiative taking place this year to mark the center’s 10th anniversary.

Buzaaba’s course is part of the Freshman Seminars, designed to give first-year students the opportunity to participate in small seminars focused on special interest topics.

Reay said she learned about these “incredibly niche” courses at the University in 2023 while participating in the Princeton Summer Journalism Program for students from lower-income backgrounds, which clinched her interest in applying to Princeton.

Falayi, a prospective computer science major whose parents immigrated to the U.S. from Nigeria and speak Yoruba, said she chose this course to explore “two subjects I care deeply about: Africa and machine learning.”

In the classroom and beyond

Buzaaba said he hopes the course will encourage students to take a more discerning view of the digital landscape.

“Whether they pursue anthropology, physics or another field, I hope they will continue to consider these questions of where data on the internet comes from, the quality of the data, the ethics of gathering data,” he said. “In class, we talked about the fact that they don’t necessarily have to be computer scientists to contribute.”

This summer, Buzaaba and Mwita will expand the Freshman Seminar into a six-week Global Seminar, offered through the Princeton Institute for International and Regional Studies, at Maseno University in Kenya.

 In “Technology for African Languages in the Digital Age,” Princeton students will collaborate with Kenyan students on projects that enhance the representation of African languages in the emerging language technologies.

Buzaaba, Mwita, and Ernest having a discussion

Buzaaba (center) hopes the course will encourage students to take a more discerning view of the digital landscape, no matter what field they pursue. “In class, we talked about the fact that they don’t necessarily have to be computer scientists to contribute,” he said.