×
African researchers release 9,000-hour speech dataset for AI in 18 languages
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

African researchers have released what’s believed to be the largest known dataset of African languages for AI development, capturing 9,000 hours of speech across 18 languages from Kenya, Nigeria, and South Africa. This $2.2 million Gates Foundation-funded initiative addresses a critical gap in AI accessibility, as most current AI tools like ChatGPT are trained primarily on English and other European languages, leaving millions of Africans excluded from the AI revolution.

Why this matters: With Africa home to over a quarter of the world’s languages—more than 2,000 in total—the lack of African language representation in AI creates barriers to essential services and economic opportunities for hundreds of millions of people.

The challenge: Most African languages are primarily spoken rather than written, creating a data scarcity problem for AI training.

  • AI systems require vast quantities of text data to function effectively, but African languages lack the extensive online written content available for English, Chinese, and European languages.
  • “We think in our own languages, dream in them and interpret the world through them. If technology doesn’t reflect that, a whole group risks being left behind,” explains University of Pretoria’s Prof Vukosi Marivathe.

What the project accomplished: The Africa Next Voices initiative brought together linguists and computer scientists to create AI-ready datasets capturing everyday scenarios in farming, health, and education.

  • Languages recorded include Kikuyu and Dholuo in Kenya, Hausa and Yoruba in Nigeria, and isiZulu and Tshivenda in South Africa—some spoken by millions of people.
  • The team gathered voices from different regions, ages, and backgrounds to ensure inclusivity, according to computational linguist Lilian Wanzare.
  • The data will be open access, allowing developers to build tools that translate, transcribe, and respond in African languages.

Real-world applications: Indigenous language AI tools are already solving practical challenges across the continent.

  • Farmer Kelebogile Mosime uses the AI-Farmer app, which recognizes several South African languages including Sesotho, isiZulu, and Afrikaans, to troubleshoot farming problems on her 21-hectare vegetable operation.
  • “Daily, I see the benefits of being able to use my home language Setswana on the app when I run into problems on the farm, I ask anything and get a useful answer,” Mosime explains.

The business case: South African company Lelapa AI is building AI tools in African languages for banks and telecoms, highlighting the economic barriers created by language exclusion.

  • “English is the language of opportunity. For many South Africans who don’t speak it, it’s not just inconvenient—it can mean missing out on essential services like healthcare, banking or even government support,” says CEO Pelonomi Moiloa.

Cultural preservation concerns: Beyond practical applications, researchers warn that excluding indigenous languages from AI development risks losing cultural knowledge and worldviews.

  • “Language is access to imagination,” Prof Marivathe notes. “It’s not just words—it’s history, culture, knowledge. If indigenous languages aren’t included, we lose more than data; we lose ways of seeing and understanding the world.”
AI in Africa: Experts aim to close the language gap

Recent News

NSF launches national AI research operations center to scale pilot

Democratizing access to high-performance AI tools beyond well-funded institutions and private companies.

Nightfood buys $80M in hotels to test and license AI robotics tech

Robotics-as-a-Service subscriptions reportedly cut labor costs by 30-40% at company properties.

African researchers release 9,000-hour speech dataset for AI in 18 languages

The $2.2 million initiative tackles AI's English bias that excludes millions across the continent.