The SEA-LION Can Roar: Enhancing Large Language Models’ Inclusivity
Published
Large Language Models have taken the world by storm, but they have vulnerabilities when it comes to inherent bias and inability to capture contextual nuances.
Large Language Models (LLMs) have taken the world by storm in a short span of a year. While the significance of LLMs is profound, the discussion around them often gets lost amid the dominant focus on the big-power technology competition in the semiconductor and electric vehicle industries. This is particularly concerning, especially considering how the current LLM market, which is dominated by the US and China, poses several challenges.
For one, LLMs produced by US-based companies have been criticised for inherent biases and their inability to accurately capture contextual nuances specific to non-Western cultures. These LLMs scrape training data from the Internet, which is largely influenced by “western, industrialized, rich, educated and democratic” (WIRED) societies. Conversely, AI models in China undergo mandatory government review by the Cyberspace Administration of China (CAC), and LLMs’ responses to politically sensitive topics must embody the “core values of socialism” and are prohibited from generating “illegal” content. Chinese LLMs thus avoid answering contentious topics like the 1989 Tiananmen Square protests.
Given such limitations and constraints, regional capacity for LLM innovation is crucial in alleviating concerns related to representation and cultural inaccuracies. There is an opportunity for Southeast Asian countries to leverage their technological expertise to develop LLMs that are more attuned to the nuances of their own languages and cultures. In this context, the Southeast Asian Languages in One Network (SEA-LION), the region’s first LLM launched in December 2023, serves as an effective complement to the existing LLMs to enhance representation of the region.
…regional capacity for LLM innovation is crucial in alleviating concerns related to representation and cultural inaccuracies. There is an opportunity for Southeast Asian countries to leverage their technological expertise to develop LLMs that are more attuned to the nuances of their own languages and cultures.
Trained on content produced in Southeast Asian languages like Bahasa Indonesia, Thai and Vietnamese, SEA-LION distinguishes itself from other LLMs, which are trained primarily on the corpus of English Internet content. By employing SEABPETokenizer, a form of tokenisation specifically tailored for Southeast Asian languages, SEA-LION is highly adept at understanding Southeast Asian languages and capturing local cultural nuances.
Language tokenisation is a crucial step for converting text into a format that machine-learning models can interpret. Given that existing tokenisers of LLMs are English-centric, there is a high tendency for those LLMs to misunderstand the semantics of regional languages and thus produce results that are either inaccurate or fail to fully appreciate local cultural contexts.
In a demonstration shown to the author by AI Singapore, SEA-LION, Meta’s Llama 2, OpenAI’s GPT-4 and Alibaba’s SEA-LLM were given a prompt that enquired about the benefits of ojek (Figure 1). In Indonesia, an ojek refers to a motorcycle taxi, and comes not only in the form of traditional street-hailing but also the modern online form pioneered by Gojek. SEA-LION accurately responded to the prompt by highlighting the role of ojek in providing faster transportation. In contrast, Meta’s Llama 2 and Alibaba’s SEA-LLM primarily focused on Gojek and excluded references to the traditional offline taxis that are still in operation.
Figure 1. LLMs Describing Ojek

(Author’s Note: This figure is intended for illustrative purposes only. Models depicted in the video may have undergone further development since the time of this assessment.)
In another assessment, LLMs were given a prompt in Thai to describe ASEAN in Bahasa Indonesia. SEA-LION was able to give an accurate output in the right language. Contrastingly, Llama 2 failed to understand the Thai prompt. SEA-LLM answered in Thai instead of Bahasa Indonesia and even “hallucinated,” saying that ASEAN is comprised of 11 member nations, including countries such as Venezuela (Figure 2).
Figure 2. LLMs Describing ASEAN

(Author’s Note: This figure is intended for illustrative purposes only. Models depicted in the video may have undergone further development since the time of this assessment.)
While LLMs exclude “low-quality” content during model training, the criteria for determining what constitutes “quality content” might differ. SEA-LION, operating through a Southeast Asian perspective, therefore has 26 times more Southeast Asian content when juxtaposed to LLMs like Llama 2. Additionally, SEA-LION collects contributed data from Southeast Asian partners, such as Cambodian and Thai media outlets. This approach ensures that the model is not exclusively limited to Western-centric material and incorporates the regional perspective on different issues, enhancing its capacity to provide contextually accurate answers and promote a more authentic representation of the region.
When the LLMs were asked about the leadership style of President Suharto of Indonesia, SEA-LION provided a more balanced and nuanced response, replying that while his governing style could be characterised as authoritarian, President Suharto was also a prominent military leader with a vision to modernise Indonesia. This view will most likely resonate well with the Indonesian people. In contrast, SEA-LLM and GPT-4 offered a skewed and more negative characterisation of Suharto, emphasising his corruption and questionable record on human rights. Their responses reflect how the international community often portrays and views the late Indonesian president (Figure 3).
Figure 3. LLMs Describing President Suharto

(Author’s Note: This figure is intended for illustrative purposes only. Models depicted in the video may have undergone further development since the time of this assessment.)
SEA-LION, while demonstrating significant strengths, is continually being refined and fine-tuned to enhance its overall performance. Presently, the model is focused on supporting the more commonly used languages in Southeast Asia, namely Bahasa Indonesia, Vietnamese, Thai and Malay. There are plans to extend its capabilities to encompass other languages such as the Lao and Myanmar languages. While SEA-LION was assessed to be better than other LLMs like the GPT-4 in understanding regional sentiments, it ranked behind GPT-4 in reasoning tasks.
As the AI race continues to intensify, the region ought to leverage its collective expertise to advance more holistic and culturally informed technologies. It is, therefore, reassuring to note that the open-source nature of SEA-LION enables other countries to build upon these technologies. An open-source LLM is an LLM in which code and data are made publicly accessible, allowing people to use, modify and build upon the model’s internal workings. This collaborative potential thus fosters further innovation and refinement, allowing the AI community to advance the region’s LLM technology beyond its initial capabilities.
2024/254
Cha Hae Won is a Research Officer in the Regional Strategic and Political Studies Programme, ISEAS - Yusof Ishak Institute.









