SeaLLMs - Large Language Models for Southeast Asia

Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Weiwen Xu, Hou Pong Chan, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang & Lidong Bing* (Corresponding Author)

DAMO Academy, Alibaba Group

Paper 🤗 DEMO Github 🤗 Models

🔥[NEW!] SeaLLM-7B-v2.5 is released with SoTA in world knowledge and math reasoning.
🔥[HOT!] SeaLMMM-7B-v0.1 is introduced with Multimodal Multilingual capabilities in SEA languages.

Abstract

We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭. It outperforms comparable baselines across diverse multilingual tasks, from world knowledge, math reasoning, instruction following, etc. It also surpasses ChatGPT-3.5 in various knowledge and reasoning bechmarks in multiple non-Latin languages (Thai, Khmer, Lao and Burmese), while remaining lightweight and open-source.

SeaLLMs is a continuously iterated and improved series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are typically continue-pretrained and fine-tuned from strong English models to build outstanding capabilities in SEA languages without degrading performances in high-resource languages. SeaLLMs are built with focus in prioritizing local cultural and legal norms, customs, stylistic preferences, as well as cost-effectiveness

World Knowledge

We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for Eng, 3-shot M3Exam for Eng, Zho, Vie, Ind, Tha, and zero-shot VMLU for Vie.

M3Exam was evaluated using the standard prompting implementation, while 0-shot VMLU was run with vmlu_run.py for SeaLLMs.

Model	Langs	Eng MMLU 5 shots	Eng M3exam 3 shots	Zho M3exam 3 shots	Vie M3exam 3 shots	Vie VMLU 0 shots	Ind M3exam 3 shots	Tha M3exam 3 shots
ChatGPT-3.5	Multi	68.90	75.46	60.20	58.64	46.32	49.27	37.41
Vistral-7B-chat	Mono	56.86	67.00	44.56	54.33	50.03	36.49	25.27
Qwen1.5-7B-chat	Multi	61.00	52.07	81.96	43.38	45.02	24.29	20.25
SailorLM-7B	Multi	52.72	59.76	67.74	50.14	---	39.53	37.73
SeaLLM-7B-v2	Multi	61.89	70.91	55.43	51.15	45.74	42.25	35.52
SeaLLM-7B-v2.5	Multi	64.05	76.87	62.54	63.11	53.30	48.64	46.86

SeaExam Leaderboard

According to the SeaExam leaderboard, which evaluates model performance through human-exam style questions in Southeast Asian languages, the latest SeaLLMs-v2.5 is ranked at the top among open-source models of similar size.

Multilingual Math Reasoning

SeaLLM-7B-v2.5 achieves with 78.5 and 34.9 in GSM8K and MATH with zero-shot CoT reasoning, making it outperforms GPT-3.5 in MATH. It also outperforms GPT-3.5 in all GSM8K and MATH benchmark as translated into 4 SEA languages (🇨🇳 🇻🇳 🇮🇩 🇹🇭).

Model	Eng GSM8K	Eng MATH	Zho GSM8K	Zho MATH	Vie GSM8K	Vie MATH	Ind GSM8K	Ind MATH	Tha GSM8K	Tha MATH
ChatGPT-3.5	80.8	34.1	48.2	21.5	55.0	26.5	64.3	26.4	35.8	18.1
Vistral-7B-Chat	48.2	12.5			48.7	3.1
Qwen1.5-7B-chat	56.8	15.3	40.0	2.7	37.7	9.0	36.9	7.7	21.9	4.7
SeaLLM-7B-v2	78.2	27.5	53.7	17.6	69.9	23.8	71.5	24.4	59.6	22.4
SeaLLM-7B-v2.5	78.5	34.9	51.3	22.1	72.3	30.2	71.5	30.1	62.0	28.4

Multilingual Instruction Following

Sea-Bench is a set of categorized instruction test sets to measure models' ability as an assistant that is specifically focused on 9 SEA languages, including non-Latin low-resource languages. Sea-Bench's model responses are rated by GPT-4 following MT-bench LLM-judge procedure.
As shown, SeaLLM-7B-v2.5 reaches GPT-3.5 level of performance in many common SEA languages (Eng, Zho, Vie, Ind, Tha, Msa) and far-surpasses it in low-resource non-Latin languages (Mya, Lao, Khm).

Zero-shot Commonsense Reasoning

We compare SeaLLM-7B-v2.5 with ChatGPT and Mistral-7B-instruct on various zero-shot commonsense benchmarks (Arc-Challenge, Winogrande and Hellaswag). We use the 2-stage technique in (Kojima et al., 2023) to grab the answer. Note that we DID NOT use "Let's think step-by-step" to invoke explicit CoT.

Model	Arc-Challenge	Winogrande	Hellaswag
ChatGPT (Reported)	84.6*	66.8*	72.0*
ChatGPT (Reproduced)	84.1	63.1	79.5
Mistral-7B-Instruct	68.1	56.4	45.6
Qwen1.5-7B-Chat	79.3	59.4	69.3
SeaLLM-7B-v2	82.5	68.3	80.9
SeaLLM-7B-v2.5	86.5	75.4	91.6

Model Information

All SeaLLM models underwent continue-pretraining, instruction and alignment tuning to ensure not only their competitive performances in SEA languages, but also maintain high level of safety and legal compliance. All models are trained with 32 A800 GPUs.

Model	Backbone	Context Length	Vocab Size	Chat format
SeaLLM-7B-v2.5	gemma-7b	8192	256000	Add `<bos>` at start if your tokenizer does not do so! `<\|im_start\|>user {content}<eos> <\|im_start\|>assistant {content}<eos>`
SeaLLM-7B-v2	Mistral-7B-v0.1	8192	48384	Add `<bos>` at start if your tokenizer does not do so! `<\|im_start\|>user {content}</s><\|im_start\|>assistant {content}</s>`
SeaLLM-7B-v1	Llama-2-7b	4096	48512	Same as Llama-2

Acknowledgement

We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.

BibTeX

If you find our project useful, we hope you would kindly star our repo and cite our work as follows.
Corresponding Author: l.bing@alibaba-inc.com

@article{damonlpsg2023seallm,
  author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*, Weiwen Xu, Hou Pong Chan,
            Zhiqiang Hu, Chenhui Shen^, Yew Ken Chia^, Xingxuan Li, Jianyu Wang,
            Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
            Chaoqun Liu, Hang Zhang, Lidong Bing},
  title = {SeaLLMs - Large Language Models for Southeast Asia},
  year = 2023,
  Eprint = {arXiv:2312.00738},
}