What happens to the promise of an inclusive and multilingual knowledge society when only ten languages constitute 82 percent of the total content online? Natural Language Processing researcher, Kathleen Siminyu found just the answer in the words of Kenyan writer Ngũgĩ wa Thiong'o. In his book, Decolonising the Mind: The Politics of Language in African Literature, Ngũgĩ wa Thiong' wrote, “The effect of a cultural bomb is to annihilate a people's belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves."
Ideally, the ability to deal with human language is an essential attribute in all information and communication technologies. Although there are currently more than 7000 spoken languages, very few of them are flourishing in the digital world and have associated language technologies developed.
A shared concern for the digital language divide brought together researchers, youth, representatives of indigenous communities, civil society, governments and international organisations to discuss ideas for strengthening access to information in low resourced languages through the use of Artificial Intelligence technologies and approaches.
Dorothy Gordon, the Chairperson of the UNESCO intergovernmental Information for All Programme, underlined that “language is the main vector of communication and the transmission of heritage and knowledge and its use in new technology determines the degree of access and participation in knowledge societies.”
One of the challenges in realizing greater access to information in low resourced languages is the lack of datasets available to develop language technology tools in these languages. Going to the heart of this problem of lack of access to datasets in low resourced African languages, Prof. Joyce Nakatumba-Nabende at Makerere University in Uganda is leading a research group to develop machine transliteration, machine translation and grammatical frameworks and spell checkers in Luganda, Acholi, Lumasaaba, Runyankore-Rukiga. She discussed how some of her research group’s efforts in voice data processing is yielding results in spotting crop pests and diseases and understanding perceptions, trends and mentions around the COVID-19 pandemic through the analysis of discussions on local radio stations in Uganda.
The development of datasets when content in many of these languages is scarce, is a daunting task that requires innovative approaches. Kathleen Siminyu, Regional Coordinator for the AI4D network in Africa, emphasised the potential of participatory approaches in development of datasets in low resourced languages. She highlighted Masakhane and the UNESCO supported African language dataset development project as examples of participatory approaches in this field. As a tool for participatory language technology development, Kathleen presented a framework with entry points for content creators, translators, curators, language technologists and evaluators to work together in addressing some of the challenges of language development.
Framework for participatory development of language datasets presented by Kathleen Siminyu
Roy Boney Jr., Programme Manager at the Language Department of the Cherokee Nation shared his experience in leveraging technology tools and societal engagement practices in the Cherokee language revitalisation efforts. The Cherokee language with only about 2000 fluent speakers, thanks to the efforts of the community, is available on different technology platforms and has its own newspaper, radio shows, television shows and animations for engaging young speakers. His experience in promotion of the Cherokee language goes on to demonstrate the importance of organised advocacy and policy support for revitalisation of indigenous languages.
Subhashish Panigrahi, a documentary film maker and proponent of openness discussed his work with indigenous communities in India. He had four key messages, a need for greater understanding of digital rights amongst indigenous communities, a need for media and information literacy, protection of intellectual property of indigenous communities and a need to ensure that content developed by researchers working with indigenous communities is openly available to the communities for benefit.
The open discussions helped form an understanding that cooperation between civil society and research institutes for solving problems facing local communities needs to be strengthened. Further, novel data collection models based on participatory and multistakeholder approaches that can create data sets for AI that respect international norms for privacy and data protection need to be fostered. The workshop was a stepping-stone towards setting the stage for greater cooperation between different stakeholders working on questions of access to information and multilingualism. As a follow up to the discussion concrete projects and exchange of ideas would be anchored within the work of the Open for Good Alliance that will be launched on 25 November 2020.
The IGF workshop was jointly organised by UNESCO, GIZ and youth and civil society partners from Australia, South Korea and Pakistan at the Internet Governance Forum on 11 November 2020.
- Video recording of the presentation is available at the IGF Youtube Channel
- UNESCO’s project on Development of Datasets for Low Resourced African Languages