Multilingual Prompting in LLMs: Investigating the Accuracy and Performance

Abstract

Large Language Models (LLMs) such as GPT-3.5 are trained using data from multiple sources, such as web data, which are predominantly in English. Hence, LLMs are commonly hypothesized to exhibit significant variations in response accuracy across multiple languages. This research investigates the hypothesis that the primary language of training data impacts the accuracy of responses to multilingual prompts. The experiments are conducted to evaluate the performance of LLMs across English and several other supported and unsupported languages, with questions structured to measure accuracy quantitatively. The study has been conducted on diverse tasks that include mathematical operations, word manipulation, and linguistic analysis. The results of the experiment demonstrate a clear edge of English prompts over prompts in other languages, with an accuracy of 80% to 100% for English prompts. A significant degradation is observed in accuracy for the same prompts translated into multiple languages other than English. The research underscores the limitations of English-dominated LLM architectures in effectively handling prompts across diverse languages. The study reveals the requirement for support for multiple languages to enable equitable access to AI-powered applications throughout the world. The source code is available at github.com/Pro-GenAI/PromptLang.

Keywords: Large Language Models (LLMs), multilingual prompts, cross-language NLP, response accuracy, Natural Language Processing (NLP)