SeaLLMs-Audio

Voice Chat: Answering questions about factuality Next

Example 1

User

SeaLLMs-Audio

The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume.

Example 2

User

SeaLLMs-Audio

Satelit alami Bumi adalah Bulan. Bulan adalah satu-satunya satelit alami Bumi dan merupakan objek terbesar di orbit Bumi.

Voice Chat: Answering mathematical questions Next

Example 1

User

SeaLLMs-Audio

If 60% of the cars in the parking lot are four-door cars, and there are 75 four-door cars, you can find the total number of cars by dividing the number of four-door cars by the percentage they represent. So, if 75 is 60% of the total, you divide 75 by 0.6 to get the total number of cars. This gives you 125 cars in the parking lot altogether.

Example 2

User

SeaLLMs-Audio

设矩形的宽为 \( w \) 厘米，那么矩形的长是 \( 2w \) 厘米。

矩形的周长公式为：

\[ P = 2(\text{长} + \text{宽}) \]

根据题意，周长为36厘米，因此可以列出方程：

\[ 2(2w + w) = 36 \]

简化方程：

\[ 2(3w) = 36 \] \[ 6w = 36 \]

解这个方程，得到：

\[ w = \frac{36}{6} = 6 \]

所以矩形的宽是6厘米。矩形的长是宽的两倍，因此：

\[ \text{长} = 2w = 2 \times 6 = 12 \]

矩形的长是12厘米。

Voice Chat: Answering general questions Next

Example 1

User

SeaLLMs-Audio

Breakfast is important because it provides your body with the energy and nutrients needed to start the day. It helps to boost your metabolism, improve concentration, and increase energy levels. Additionally, eating breakfast can help you maintain a healthy weight by preventing overeating later in the day. It also sets a positive tone for the rest of the meals you eat throughout the day.

Example 2

User

SeaLLMs-Audio

การลดหย่อนภาษีสำหรับคนไทยสามารถทำได้หลายวิธี โดยเฉพาะอย่างยิ่งในส่วนของภาษีเงินได้บุคคลธรรมดา ดังนี้:

1. **การบริจาคเพื่อการกุศล**: การบริจาคเงินให้กับองค์กรการกุศลที่ได้รับการรับรองจากกรมสรรพากรสามารถนำมาลดหย่อนภาษีได้ ตามกฎหมาย ผู้บริจาคสามารถนำจำนวนเงินที่บริจาคมาหักออกจากฐานภาษีได้ แต่ต้องมีหลักฐานการบริจาค เช่น ใบเสร็จหรือใบกำกับภาษี

2. **การลงทุนในกองทุนรวม**: การลงทุนในกองทุนรวมที่ได้รับการรับรองจากสำนักงานคณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์ (ก.ล.ต.) และกองทุนรวมที่ได้รับการรับรองจากกรมสรรพากร สามารถนำมาลดหย่อนภาษีได้ ตามกฎหมาย ผู้ลงทุนสามารถนำผลตอบแทนจากการลงทุนมาหักออกจากฐานภาษีได้

3. **การประกันชีวิต**: การซื้อประกันชีวิตที่ได้รับการรับรองจากกรมสรรพากรสามารถนำมาลดหย่อนภาษีได้ ตามกฎหมาย ผู้ที่ซื้อประกันชีวิตสามารถนำเบี้ยประกันมาหักออกจากฐานภาษีได้ แต่ต้องมีหลักฐานการชำระเบี้ยประกัน เช่น ใบเสร็จหรือใบกำกับภาษี

การใช้วิธีเหล่านี้ในการลดหย่อนภาษีจะช่วยลดภาระภาษีของคุณได้อย่างมีประสิทธิภาพ หากมีข้อสงสัยเพิ่มเติม ควรปรึกษากับผู้เชี่ยวชาญด้านภาษีหรือที่ปรึกษาทางการเงินเพื่อรับคำแนะนำที่เหมาะสมกับสถานการณ์ของคุณ

Audio Analysis: Audio Captioning Next

Example 1

User

Write a description of the audio file.

SeaLLMs-Audio

A dog pants and then a car horn blows followed by a dog barking

Example 2

User

Diễn tả nội dung đoạn ghi âm.

SeaLLMs-Audio

Mưa rơi và sấm sét vang xa

Audio Analysis: Automatic Speech Recognition Next

Example 1

User

Please write down what is spoken in the audio file.

SeaLLMs-Audio

Ah you know i saw you by the way, ah i was gonna say on broadway but it wasn't it was shakespeare in the park.

Example 2

User

ช่วยพิมพ์ข้อความที

SeaLLMs-Audio

เพื่อความปลอดภัยของท่าน โปรดให้ความสนใจชมการสาธิตความปลอดภัยบนรถเมย์ด้วยค่ะ

Audio Analysis: Speech Question Answering Next

Example 1

User

Who has more activity in the day?

SeaLLMs-Audio

Neil has more activity, spending a day in meetings, shopping, and collecting kids from school.

Example 2

User

Nhân vật chính trong câu chuyện này sinh ra ở đâu?

SeaLLMs-Audio

Nhân vật chính trong câu chuyện này được sinh ra ở thủ đô Nam Vang.

Due to the absence of standard audio benchmarks for evaluating audio LLMs in Southeast Asia, we have manually created a benchmark called SeaBench-Audio. It comprises nine tasks:

Tasks with both audio and text inputs: Audio Captioning (AC), Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), Speech Question Answering (SQA), and Speech Summarization (SS).
Tasks with only audio inputs: Factuality, Math, and General.

We manually annotated 15 questions per task per language. For evaluation, qualified native speakers rated each response on a scale of 1 to 5, with 5 representing the highest quality.

Due to the lack of LALMs for all the three Southeast Asian languages, we compare the performance of SeaLLMs-Audio with relevant audio LLMs with similar sizes, including Qwen2-Audio-7B-Instruct (Qwen2-Audio), MERaLiON-AudioLLM-Whisper-SEA-LION (MERaLiON), llama3.1-typhoon2-audio-8b-instruct (typhoon2-audio), and DiVA-llama-3-v0-8b (DiVA). All the audio LLMs can accept audio with text as input. The results are shown in the figure below.

Our model is available on Hugging Face. You can easily use it with the transformers library or vllm library. Below are some examples to get you started.

Get started with transformers

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
import librosa
import os

model = Qwen2AudioForConditionalGeneration.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B", device_map="auto")
processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")

def response_to_audio(conversation, model=None, processor=None):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )
    if audios != []:
        inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True,sampling_rate=16000)
    else: 
        inputs = processor(text=text, return_tensors="pt", padding=True)
    inputs.input_ids = inputs.input_ids.to("cuda")
    inputs = {k: v.to("cuda") for k, v in inputs.items() if v is not None}
    generate_ids = model.generate(**inputs, max_new_tokens=2048, temperature = 0, do_sample=False)
    generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return response

# Voice Chat
os.system(f"wget -O fact_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

# Audio Analysis
os.system(f"wget -O ASR_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=model, processor=processor)
print(response)

Get started with vllm

from vllm import LLM, SamplingParams
import librosa, os
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
llm = LLM(
    model="SeaLLMs/SeaLLMs-Audio-7B", trust_remote_code=True, gpu_memory_utilization=0.5,  
    enforce_eager=True,  device = "cuda",
    limit_mm_per_prompt={"audio": 5},
)

def response_to_audio(conversation, model=None, processor=None, temperature = 0.1,repetition_penalty=1.1, top_p = 0.9,max_new_tokens = 4096):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    if ele['audio_url'] != None:
                        audios.append(librosa.load(
                            ele['audio_url'], 
                            sr=processor.feature_extractor.sampling_rate)[0]
                        )

    sampling_params = SamplingParams(
        temperature=temperature, max_tokens=max_new_tokens, repetition_penalty=repetition_penalty, top_p=top_p, top_k=20,
        stop_token_ids=[],
    )

    input = {
            'prompt': text,
            'multi_modal_data': {
                'audio': [(audio, 16000) for audio in audios]
            }
            }

    output = model.generate([input], sampling_params=sampling_params)[0]
    response = output.outputs[0].text
    return response

# Voice Chat
os.system(f"wget -O fact_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
os.system(f"wget -O general_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "fact_en.wav"},
    ]},
    {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "general_en.wav"},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

# Audio Analysis
os.system(f"wget -O ASR_en.wav https://damo-nlp-sg.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "ASR_en.wav"},
        {"type": "text", "text": "Please write down what is spoken in the audio file."},
    ]},
]

response = response_to_audio(conversation, model=llm, processor=processor)
print(response)

SeaLLMs-Audio: Large Audio-Language Models
for Southeast Asia

Voice Chat

Audio Analysis

Performance: SeaBench-Audio

Quick Start

Get started with transformers

Get started with vllm

Acknowledgement

BibTeX

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Voice Chat

Audio Analysis

Performance: SeaBench-Audio

Quick Start

Get started with transformers

Get started with vllm

Acknowledgement

BibTeX

SeaLLMs-Audio: Large Audio-Language Models
for Southeast Asia