Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba

Singh, Divraj

Publication:
Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba

Files

ds7743_written_final_report-3.pdf (1.22 MB)

Date

2025

Authors

Singh, Divraj

Abstract

Despite the impressive capabilities of large language models, low-resource languages (LRLs) such as Swahili, Haitian Creole, and Yoruba remain significantly underserved due to a lack of training data. This thesis explores a text-only approach to addressing this gap by leveraging back-translation to generate synthetic data. Using pre-trained multilingual models like mT5, original sentences in each target language are translated into English and then back into their original form to produce varied and contextually rich text pairs. These pairs are used to fine-tune LLMs, enhancing their fluency and generalization in low-resource settings. The results show measurable improvements in output diversity and translation quality, demonstrating that synthetic data augmentation can play a key role in advancing equitable language technology.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp01r781wk52p

Collections

Computer Science, 1987-2025

Full item page

Thesis Central

Publication:
Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba