Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Enhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba

datacite.rightsrestricted
dc.contributor.advisorPetras, Iasonas
dc.contributor.authorSingh, Divraj
dc.date.accessioned2026-01-05T22:08:55Z
dc.date.available2026-01-05T22:08:55Z
dc.date.issued2025
dc.description.abstractDespite the impressive capabilities of large language models, low-resource languages (LRLs) such as Swahili, Haitian Creole, and Yoruba remain significantly underserved due to a lack of training data. This thesis explores a text-only approach to addressing this gap by leveraging back-translation to generate synthetic data. Using pre-trained multilingual models like mT5, original sentences in each target language are translated into English and then back into their original form to produce varied and contextually rich text pairs. These pairs are used to fine-tune LLMs, enhancing their fluency and generalization in low-resource settings. The results show measurable improvements in output diversity and translation quality, demonstrating that synthetic data augmentation can play a key role in advancing equitable language technology.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01r781wk52p
dc.language.isoen_US
dc.titleEnhancing Low-Resource Language Modeling Through Synthetic Text Generation: A Case Study on Swahili, Haitian Creole, and Yoruba
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-12-15T16:01:26.158Z
pu.contributor.authorid920287761
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ds7743_written_final_report-3.pdf
Size:
1.22 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download