We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇠🇲🇾 🇰🇠🇱🇦 🇲🇲 🇵ðŸ‡. It outperforms comparable baselines across diverse multilingual tasks, from world knowledge, math reasoning, instruction following, etc. It also surpasses ChatGPT-3.5 in various knowledge and reasoning bechmarks in multiple non-Latin languages (Thai, Khmer, Lao and Burmese), while remaining lightweight and open-source.
SeaLLMs is a continuously iterated and improved series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are typically continue-pretrained and fine-tuned from strong English models to build outstanding capabilities in SEA languages without degrading performances in high-resource languages. SeaLLMs are built with focus in prioritizing local cultural and legal norms, customs, stylistic preferences, as well as cost-effectiveness
We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for Eng, 3-shot M3Exam for Eng, Zho, Vie, Ind, Tha, and zero-shot VMLU for Vie.
M3Exam was evaluated using the standard prompting implementation, while 0-shot VMLU was run with vmlu_run.py for SeaLLMs.
Model | Langs | Eng MMLU 5 shots |
Eng M3exam 3 shots |
Zho M3exam 3 shots |
Vie M3exam 3 shots |
Vie VMLU 0 shots |
Ind M3exam 3 shots |
Tha M3exam 3 shots |
---|---|---|---|---|---|---|---|---|
ChatGPT-3.5 | Multi | 68.90 | 75.46 | 60.20 | 58.64 | 46.32 | 49.27 | 37.41 |
Vistral-7B-chat | Mono | 56.86 | 67.00 | 44.56 | 54.33 | 50.03 | 36.49 | 25.27 |
Qwen1.5-7B-chat | Multi | 61.00 | 52.07 | 81.96 | 43.38 | 45.02 | 24.29 | 20.25 |
SailorLM-7B | Multi | 52.72 | 59.76 | 67.74 | 50.14 | --- | 39.53 | 37.73 |
SeaLLM-7B-v2 | Multi | 61.89 | 70.91 | 55.43 | 51.15 | 45.74 | 42.25 | 35.52 |
SeaLLM-7B-v2.5 | Multi | 64.05 | 76.87 | 62.54 | 63.11 | 53.30 | 48.64 | 46.86 |
According to the SeaExam leaderboard, which evaluates model performance through human-exam style questions in Southeast Asian languages, the latest SeaLLMs-v2.5 is ranked at the top among open-source models of similar size.
SeaLLM-7B-v2.5 achieves with 78.5 and 34.9 in GSM8K and MATH with zero-shot CoT reasoning, making it outperforms GPT-3.5 in MATH. It also outperforms GPT-3.5 in all GSM8K and MATH benchmark as translated into 4 SEA languages (🇨🇳 🇻🇳 🇮🇩 🇹ðŸ‡).
Model | Eng GSM8K | Eng MATH |
Zho GSM8K | Zho MATH |
Vie GSM8K | Vie MATH |
Ind GSM8K | Ind MATH |
Tha GSM8K | Tha MATH |
---|---|---|---|---|---|---|---|---|---|---|
ChatGPT-3.5 | 80.8 | 34.1 | 48.2 | 21.5 | 55.0 | 26.5 | 64.3 | 26.4 | 35.8 | 18.1 |
Vistral-7B-Chat | 48.2 | 12.5 | 48.7 | 3.1 | ||||||
Qwen1.5-7B-chat | 56.8 | 15.3 | 40.0 | 2.7 | 37.7 | 9.0 | 36.9 | 7.7 | 21.9 | 4.7 |
SeaLLM-7B-v2 | 78.2 | 27.5 | 53.7 | 17.6 | 69.9 | 23.8 | 71.5 | 24.4 | 59.6 | 22.4 |
SeaLLM-7B-v2.5 | 78.5 | 34.9 | 51.3 | 22.1 | 72.3 | 30.2 | 71.5 | 30.1 | 62.0 | 28.4 |
Sea-Bench is a set of categorized instruction test sets to measure models' ability as an assistant that is specifically focused on 9 SEA languages,
including non-Latin low-resource languages. Sea-Bench's model responses are rated by GPT-4 following MT-bench LLM-judge procedure.
As shown, SeaLLM-7B-v2.5 reaches GPT-3.5 level of performance in many common SEA languages (Eng, Zho, Vie, Ind, Tha, Msa)
and far-surpasses it in low-resource non-Latin languages (Mya, Lao, Khm).
We compare SeaLLM-7B-v2.5 with ChatGPT and Mistral-7B-instruct on various zero-shot commonsense benchmarks (Arc-Challenge, Winogrande and Hellaswag). We use the 2-stage technique in (Kojima et al., 2023) to grab the answer. Note that we DID NOT use "Let's think step-by-step" to invoke explicit CoT.
Model | Arc-Challenge | Winogrande | Hellaswag |
---|---|---|---|
ChatGPT (Reported) | 84.6* | 66.8* | 72.0* |
ChatGPT (Reproduced) | 84.1 | 63.1 | 79.5 |
Mistral-7B-Instruct | 68.1 | 56.4 | 45.6 |
Qwen1.5-7B-Chat | 79.3 | 59.4 | 69.3 |
SeaLLM-7B-v2 | 82.5 | 68.3 | 80.9 |
SeaLLM-7B-v2.5 | 86.5 | 75.4 | 91.6 |
All SeaLLM models underwent continue-pretraining, instruction and alignment tuning to ensure not only their competitive performances in SEA languages, but also maintain high level of safety and legal compliance. All models are trained with 32 A800 GPUs.
Model | Backbone | Context Length | Vocab Size | Chat format |
---|---|---|---|---|
SeaLLM-7B-v2.5 | gemma-7b | 8192 | 256000 |
Add <bos> at start if your tokenizer does not do so!
|
SeaLLM-7B-v2 | Mistral-7B-v0.1 | 8192 | 48384 |
Add <bos> at start if your tokenizer does not do so!
|
SeaLLM-7B-v1 | Llama-2-7b | 4096 | 48512 | Same as Llama-2 |
SeaLLM-7B-v2.5 was released in April 2024. It possesses outstanding abilities in world knowledge and math reasoning in both English and SEA languages.
SeaLLM-7B-v2 was released in Feb 2024. It possesses outstanding abilities in math and commonsense reasoning in Sea languages.
SeaLLM-7B-v1 was released in Nov 2023. It was the first release of SeaLLMs model family, and the first LLM built specifically for Southeast Asia.
We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
If you find our project useful, we hope you would kindly star our repo and cite our work as follows.
Corresponding Author: l.bing@alibaba-inc.com
@article{damonlpsg2023seallm,
author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*, Weiwen Xu, Hou Pong Chan,
Zhiqiang Hu, Chenhui Shen^, Yew Ken Chia^, Xingxuan Li, Jianyu Wang,
Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, Lidong Bing},
title = {SeaLLMs - Large Language Models for Southeast Asia},
year = 2023,
Eprint = {arXiv:2312.00738},
}