🔥[NEW!] SeaLLM-7B-v2.5 is released with SoTA in world knowledge and math reasoning.
🔥[HOT!] SeaLMMM-7B-v0.1 is introduced with Multimodal Multilingual capabilities in SEA languages.

Abstract

We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭. It outperforms comparable baselines across diverse multilingual tasks, from world knowledge, math reasoning, instruction following, etc. It also surpasses ChatGPT-3.5 in various knowledge and reasoning bechmarks in multiple non-Latin languages (Thai, Khmer, Lao and Burmese), while remaining lightweight and open-source.

SeaLLMs is a continuously iterated and improved series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are typically continue-pretrained and fine-tuned from strong English models to build outstanding capabilities in SEA languages without degrading performances in high-resource languages. SeaLLMs are built with focus in prioritizing local cultural and legal norms, customs, stylistic preferences, as well as cost-effectiveness

World Knowledge

We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for Eng, 3-shot M3Exam for Eng, Zho, Vie, Ind, Tha, and zero-shot VMLU for Vie.

M3Exam was evaluated using the standard prompting implementation, while 0-shot VMLU was run with vmlu_run.py for SeaLLMs.

Model Langs Eng
MMLU
5 shots
Eng
M3exam
3 shots
Zho
M3exam
3 shots
Vie
M3exam
3 shots
Vie
VMLU
0 shots
Ind
M3exam
3 shots
Tha
M3exam
3 shots
ChatGPT-3.5 Multi 68.90 75.46 60.20 58.64 46.32 49.27 37.41
Vistral-7B-chat Mono 56.86 67.00 44.56 54.33 50.03 36.49 25.27
Qwen1.5-7B-chat Multi 61.00 52.07 81.96 43.38 45.02 24.29 20.25
SailorLM-7B Multi 52.72 59.76 67.74 50.14 --- 39.53 37.73
SeaLLM-7B-v2 Multi 61.89 70.91 55.43 51.15 45.74 42.25 35.52
SeaLLM-7B-v2.5 Multi 64.05 76.87 62.54 63.11 53.30 48.64 46.86

SeaExam Leaderboard

According to the SeaExam leaderboard, which evaluates model performance through human-exam style questions in Southeast Asian languages, the latest SeaLLMs-v2.5 is ranked at the top among open-source models of similar size.

Multilingual Math Reasoning

SeaLLM-7B-v2.5 achieves with 78.5 and 34.9 in GSM8K and MATH with zero-shot CoT reasoning, making it outperforms GPT-3.5 in MATH. It also outperforms GPT-3.5 in all GSM8K and MATH benchmark as translated into 4 SEA languages (🇨🇳 🇻🇳 🇮🇩 🇹🇭).

Model Eng
GSM8K
Eng
MATH
Zho
GSM8K
Zho
MATH
Vie
GSM8K
Vie
MATH
Ind
GSM8K
Ind
MATH
Tha
GSM8K
Tha
MATH
ChatGPT-3.5 80.8 34.1 48.2 21.5 55.0 26.5 64.3 26.4 35.8 18.1
Vistral-7B-Chat 48.2 12.5 48.7 3.1
Qwen1.5-7B-chat 56.8 15.3 40.0 2.7 37.7 9.0 36.9 7.7 21.9 4.7
SeaLLM-7B-v2 78.2 27.5 53.7 17.6 69.9 23.8 71.5 24.4 59.6 22.4
SeaLLM-7B-v2.5 78.5 34.9 51.3 22.1 72.3 30.2 71.5 30.1 62.0 28.4

Multilingual Instruction Following

Sea-Bench is a set of categorized instruction test sets to measure models' ability as an assistant that is specifically focused on 9 SEA languages, including non-Latin low-resource languages. Sea-Bench's model responses are rated by GPT-4 following MT-bench LLM-judge procedure.
As shown, SeaLLM-7B-v2.5 reaches GPT-3.5 level of performance in many common SEA languages (Eng, Zho, Vie, Ind, Tha, Msa) and far-surpasses it in low-resource non-Latin languages (Mya, Lao, Khm).

Zero-shot Commonsense Reasoning

We compare SeaLLM-7B-v2.5 with ChatGPT and Mistral-7B-instruct on various zero-shot commonsense benchmarks (Arc-Challenge, Winogrande and Hellaswag). We use the 2-stage technique in (Kojima et al., 2023) to grab the answer. Note that we DID NOT use "Let's think step-by-step" to invoke explicit CoT.

Model Arc-Challenge Winogrande Hellaswag
ChatGPT (Reported) 84.6* 66.8* 72.0*
ChatGPT (Reproduced) 84.1 63.1 79.5
Mistral-7B-Instruct 68.1 56.4 45.6
Qwen1.5-7B-Chat 79.3 59.4 69.3
SeaLLM-7B-v2 82.5 68.3 80.9
SeaLLM-7B-v2.5 86.5 75.4 91.6

Model Information

All SeaLLM models underwent continue-pretraining, instruction and alignment tuning to ensure not only their competitive performances in SEA languages, but also maintain high level of safety and legal compliance. All models are trained with 32 A800 GPUs.

Model Backbone Context Length Vocab Size Chat format
SeaLLM-7B-v2.5 gemma-7b 8192 256000 Add <bos> at start if your tokenizer does not do so!
<|im_start|>user
{content}<eos>
<|im_start|>assistant
{content}<eos>
SeaLLM-7B-v2 Mistral-7B-v0.1 8192 48384 Add <bos> at start if your tokenizer does not do so!
<|im_start|>user
{content}</s><|im_start|>assistant
{content}</s>
SeaLLM-7B-v1 Llama-2-7b 4096 48512 Same as Llama-2

Related Links

SeaLLM-7B-v2.5 was released in April 2024. It possesses outstanding abilities in world knowledge and math reasoning in both English and SEA languages.

SeaLLM-7B-v2 was released in Feb 2024. It possesses outstanding abilities in math and commonsense reasoning in Sea languages.

SeaLLM-7B-v1 was released in Nov 2023. It was the first release of SeaLLMs model family, and the first LLM built specifically for Southeast Asia.

Acknowledgement

We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.

BibTeX

If you find our project useful, we hope you would kindly star our repo and cite our work as follows.
Corresponding Author: l.bing@alibaba-inc.com

@article{damonlpsg2023seallm,
  author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*, Weiwen Xu, Hou Pong Chan,
            Zhiqiang Hu, Chenhui Shen^, Yew Ken Chia^, Xingxuan Li, Jianyu Wang,
            Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
            Chaoqun Liu, Hang Zhang, Lidong Bing},
  title = {SeaLLMs - Large Language Models for Southeast Asia},
  year = 2023,
  Eprint = {arXiv:2312.00738},
}