Documents
Methods to computerize "little equipped" languages and groups of languages
In 2004, less than 1 % of the 6000 or so languages of the world are beneftitting from the opportunities that computerization offers, such as a broad range of services going from text processing to machine translation.This thesis, which focuses on the lesser used languages - the "pi-languages" - seeks to propose solutions to cure their digital underdevelopment.
In a first part, intended to show the complexity of the problem, we present the languages' diversity, the technologies used, as well as the approaches of the various actors: linguistic populations, software publishers, the United Nations, States... A technique for measuring the computerization degree of a language - the sigma-index - is proposed, as well as several optimization methods.
The second part deals with the computerization of the Laotian language and concretely presents the results obtained for this language by applying the methods described previously. The described achievements contributed to improve the sigma-index of the Laotian language by approximately 4 points, this index being currently evaluated with 8.7/20.
In the third part, we show that an approach by groups of languages can reduce the computerization costs thanks to the use of a modular architecture associating existing general software and specific complements. For the most language-related parts, complementary generic lingware tools give the populations the possibility to computerize their languages by themselves. We validated this method by applying it to the syllabic segmentation of Southeast Asian languages with unsegmented writings, such as Burmese, Khmer, Laotian and Siamese (Thai).
The second part deals with the computerization of the Laotian language and concretely presents the results obtained for this language by applying the methods described previously. The described achievements contributed to improve the sigma-index of the Laotian language by approximately 4 points, this index being currently evaluated with 8.7/20.
In the third part, we show that an approach by groups of languages can reduce the computerization costs thanks to the use of a modular architecture associating existing general software and specific complements. For the most language-related parts, complementary generic lingware tools give the populations the possibility to computerize their languages by themselves. We validated this method by applying it to the syllabic segmentation of Southeast Asian languages with unsegmented writings, such as Burmese, Khmer, Laotian and Siamese (Thai).
Details | |
---|---|
Collation | 277 |
French | these_Berment.pdf |
French | |
Author(s) | Vincent Berment |
Publication year | 2004 |