We introduce SeaLLMs-v3, the latest series of the SeaLLMs (Large Language Models for Southeast Asian languages) family. It achieves state-of-the-art performance among models with similar sizes, excelling across a diverse array of tasks such as world knowledge, mathematical reasoning, translation, and instruction following. In the meantime, it was specifically enhanced to be more trustworthy, exhibiting reduced hallucination and providing safe responses, particularly in queries closed related to Southeast Asian culture.
SeaLLMs is a continuously iterated and improved series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are typically continue-pretrained and fine-tuned from strong English models to build outstanding capabilities in SEA languages without degrading performances in high-resource languages. SeaLLMs are built with focus in prioritizing local cultural and legal norms, customs, stylistic preferences, as well as cost-effectiveness
M3Exam consists of local exam questions collected from each country. It reflects the model's world knowledge (e.g., with language or social science subjects) and reasoning abilities (e.g., with mathematics or natural science subjects).
Model | en | zh | id | th | vi | avg | avg_sea |
---|---|---|---|---|---|---|---|
Sailor-7B-Chat | 0.66 | 0.652 | 0.475 | 0.462 | 0.513 | 0.552 | 0.483 |
gemma-7b | 0.732 | 0.519 | 0.475 | 0.46 | 0.594 | 0.556 | 0.510 |
SeaLLM-7B-v2.5 | 0.758 | 0.581 | 0.499 | 0.502 | 0.622 | 0.592 | 0.541 |
Qwen2-7B | 0.815 | 0.874 | 0.53 | 0.479 | 0.628 | 0.665 | 0.546 |
Qwen2-7B-Instruct | 0.809 | 0.88 | 0.558 | 0.555 | 0.624 | 0.685 | 0.579 |
Sailor-14B | 0.748 | 0.84 | 0.536 | 0.528 | 0.621 | 0.655 | 0.562 |
Sailor-14B-Chat | 0.749 | 0.843 | 0.553 | 0.566 | 0.637 | 0.67 | 0.585 |
SeaLLMs-v3-7B | 0.814 | 0.866 | 0.549 | 0.52 | 0.628 | 0.675 | 0.566 |
SeaLLMs-v3-7B-Chat | 0.809 | 0.874 | 0.558 | 0.569 | 0.649 | 0.692 | 0.592 |
SeaBench consists of multi-turn human instructions spanning various task types. It evaluates chat-based models on their ability to follow human instructions in both single and multi-turn settings and assesses their performance across different task types. The dataset and corresponding evaluation code will be released soon!
Model | id turn1 |
id turn2 |
id avg |
th turn1 |
th turn2 |
th avg |
vi turn1 |
vi turn2 |
vi avg |
avg |
---|---|---|---|---|---|---|---|---|---|---|
Qwen2-7B-Instruct | 5.93 | 5.84 | 5.89 | 5.47 | 5.20 | 5.34 | 6.17 | 5.60 | 5.89 | 5.70 |
SeaLLM-7B-v2.5 | 6.27 | 4.96 | 5.62 | 5.79 | 3.82 | 4.81 | 6.02 | 4.02 | 5.02 | 5.15 |
Sailor-14B-Chat | 5.26 | 5.53 | 5.40 | 4.62 | 4.36 | 4.49 | 5.31 | 4.74 | 5.03 | 4.97 |
Sailor-7B-Chat | 4.60 | 4.04 | 4.32 | 3.94 | 3.17 | 3.56 | 4.82 | 3.62 | 4.22 | 4.03 |
SeaLLMs-v3-7B-Chat | 6.73 | 6.59 | 6.66 | 6.48 | 5.90 | 6.19 | 6.34 | 5.79 | 6.07 | 6.31 |
We evaluate the multilingual math capability using the MGSM dataset. MGSM originally contains Chinese and Thai testing sets only, we use Google Translate to translate the same English questions into other SEA languages. Note that we adopt the tradition of each country to represent the number, e.g., in Indonesian and Vietnamese, dots are used as thousands separators and commas as decimal separators, the opposite of the English system.
MGSM | en | id | ms | th | vi | zh | avg |
---|---|---|---|---|---|---|---|
Sailor-7B-Chat | 33.6 | 22.4 | 22.4 | 21.6 | 25.2 | 29.2 | 25.7 |
Meta-Llama-3-8B-Instruct | 77.6 | 48 | 57.6 | 56 | 46.8 | 58.8 | 57.5 |
glm-4-9b-chat | 72.8 | 53.6 | 53.6 | 34.8 | 52.4 | 70.8 | 56.3 |
Qwen1.5-7B-Chat | 64 | 34.4 | 38.4 | 25.2 | 36 | 53.6 | 41.9 |
Qwen2-7B-instruct | 82 | 66.4 | 62.4 | 58.4 | 64.4 | 76.8 | 68.4 |
aya-23-8B | 28.8 | 16.4 | 14.4 | 2 | 16 | 12.8 | 15.1 |
gemma-1.1-7b-it | 58.8 | 32.4 | 34.8 | 31.2 | 39.6 | 35.2 | 38.7 |
SeaLLM-7B-v2.5 | 79.6 | 69.2 | 70.8 | 61.2 | 66.8 | 62.4 | 68.3 |
SeaLLMs-v3-7B-Chat | 74.8 | 71.2 | 70.8 | 71.2 | 71.2 | 79.6 | 73.1 |
We use the test sets from Flores-200 for evaluation and report the zero-shot chrF scores for translations between every pair of languages. Each row in the table below presents the average results of translating from various source languages into the target languages. The last column displays the overall average results of translating from any language to any other language for each model.
Model | en | id | jv | km | lo | ms | my | ta | th | tl | vi | zh | avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Meta-Llama-3-8B-Instruct | 51.54 | 49.03 | 22.46 | 15.34 | 5.42 | 46.72 | 21.24 | 32.09 | 35.75 | 40.80 | 39.31 | 14.87 | 31.22 |
Qwen2-7B-Instruct | 50.36 | 47.55 | 29.36 | 19.26 | 11.06 | 42.43 | 19.33 | 20.04 | 36.07 | 37.91 | 39.63 | 22.87 | 31.32 |
Sailor-7B-Chat | 49.40 | 49.78 | 28.33 | 2.68 | 6.85 | 47.75 | 5.35 | 18.23 | 38.92 | 29.00 | 41.76 | 20.87 | 28.24 |
SeaLLM-7B-v2.5 | 55.09 | 53.71 | 18.13 | 18.09 | 15.53 | 51.33 | 19.71 | 26.10 | 40.55 | 45.58 | 44.56 | 24.18 | 34.38 |
SeaLLMs-v3-7B-Chat | 54.68 | 52.52 | 29.86 | 27.30 | 26.34 | 45.04 | 21.54 | 31.93 | 41.52 | 38.51 | 43.78 | 26.10 | 36.52 |
Performance of whether a model can refuse questions about the non-existing entity. The following is the F1 score. We use refusal as the positive label. Our test set consists of ~1k test samples per language. Each unanswerable question is generated by GPT4o. The ratio of answerable and unanswerable questions are 1:1. We define keywords to automatically detect whether a model-generated response is a refusal response.
Refusal-F1 Scores | en | zh | vi | th | id | avg |
---|---|---|---|---|---|---|
Qwen1.5-7B-Instruct | 53.85 | 51.70 | 52.85 | 35.50 | 58.40 | 50.46 |
Qwen2-7B-Instruct | 58.79 | 33.08 | 56.21 | 44.60 | 55.98 | 49.732 |
SeaLLM-7B-v2.5 | 12.90 | 0.77 | 2.45 | 19.42 | 0.78 | 7.26 |
Sailor-7B-Chat | 33.49 | 18.82 | 5.19 | 9.68 | 16.42 | 16.72 |
glm-4-9b-chat | 44.48 | 37.89 | 18.66 | 4.27 | 1.97 | 21.45 |
aya-23-8B | 6.38 | 0.79 | 2.83 | 1.98 | 14.80 | 5.36 |
Llama-3-8B-Instruct | 72.08 | 0.00 | 1.23 | 0.80 | 3.91 | 15.60 |
gemma-1.1-7b-it | 52.39 | 27.74 | 23.96 | 22.97 | 31.72 | 31.76 |
SeaLLMs-v3-7B-Chat | 71.36 | 78.39 | 77.93 | 61.31 | 68.95 | 71.588 |
Multijaildataset consists of harmful prompts in multiple languages. We take those relevant prompts in SEA languages here and report their safe rate (the higher the better).
Model | en | jv | th | vi | zh | avg |
---|---|---|---|---|---|---|
Qwen2-7B-Instruct | 0.8857 | 0.4381 | 0.6381 | 0.7302 | 0.873 | 0.713 |
Sailor-7B-Chat | 0.7873 | 0.5492 | 0.6222 | 0.6762 | 0.7619 | 0.6794 |
Meta-Llama-3-8B-Instruct | 0.8825 | 0.2635 | 0.7111 | 0.6984 | 0.7714 | 0.6654 |
Sailor-14B-Chat | 0.8698 | 0.3048 | 0.5365 | 0.6095 | 0.727 | 0.6095 |
glm-4-9b-chat | 0.7714 | 0.2127 | 0.3016 | 0.6063 | 0.7492 | 0.52824 |
SeaLLMs-v3-7B-Chat | 0.8889 | 0.6000 | 0.7333 | 0.8381 | 0.927 | 0.7975 |
SeaLLMs-v3 was released in July 2024. It achieves SOTA performance of diverse tasks while specifically enhanced to be more trustworthy, exhibiting reduced hallucination and providing safe response.
SeaLLM-7B-v2.5 was released in April 2024. It possesses outstanding abilities in world knowledge and math reasoning in both English and SEA languages.
SeaLLM-7B-v2 was released in Feb 2024. It possesses outstanding abilities in math and commonsense reasoning in Sea languages.
SeaLLM-7B-v1 was released in Nov 2023. It was the first release of SeaLLMs model family, and the first LLM built specifically for Southeast Asia.
Contributors of SeaLLMs (of all versions), ranked alphabetically by last name: Mahani Aljunied, Lidong Bing, Guanzheng Chen, Liying Cheng, Yue Deng, Zhiqiang Hu, Yew Ken Chia, Xingxuan Li, Xin Li, Chaoqun Liu, Xuan-Phi Nguyen, Hou Pong Chan, Chenhui Shen, Qingyu Tan, Jianyu Wang, Weiwen Xu, Sen Yang, Wenxuan Zhang, Hang Zhang, Yiran Zhao.
We would like to express our special thanks to our professional and native linguists, Tantong Champaiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
If you find our project useful, we hope you would kindly star our repo and cite our work as follows.
Corresponding Author: l.bing@alibaba-inc.com
@article{damonlp2024seallm3,
author = {Wenxuan Zhang*, Hou Pong Chan*, Yiran Zhao*, Mahani Aljunied*,
Jianyu Wang*, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu,
Yew Ken Chia, Xin Li, Lidong Bing},
title = {SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages},
year = {2024},
url = {https://arxiv.org/abs/2407.19672}
}
@article{damonlpsg2023seallm,
author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*,
Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang,
Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, Lidong Bing},
title = {SeaLLMs - Large Language Models for Southeast Asia},
year = {2024},
booktitle = {ACL 2024 System Demonstrations},
url = {https://arxiv.org/pdf/2312.00738},