FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Generative AI & LLMs
Published: arXiv: 2509.16195v1
Authors

Luca Della Libera Cem Subakan Mirco Ravanelli

Abstract

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

Paper Summary

Problem
The main problem addressed in this research paper is the challenge of creating a neural audio codec (NAC) that can compress speech into a compact discrete representation at low bitrates while supporting real-time streaming inference. Current NACs are not streamable, limiting their use in applications such as speech assistants, interactive dialogue, and low-latency generation.
Key Innovation
The key innovation of this work is the introduction of FocalCodec-Stream, a hybrid codec that combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module. This approach enables the codec to compress speech into a single binary codebook at low bitrates (0.55 - 0.80 kbps) while supporting streaming inference with a theoretical latency of 80 ms.
Practical Impact
The practical impact of this research is significant, as it enables the creation of efficient and real-time speech processing systems. FocalCodec-Stream can be used in various applications such as speech assistants, interactive dialogue, and low-latency generation, where fast and accurate speech processing is crucial. The codec's ability to preserve both semantic and acoustic information also makes it suitable for tasks such as speech language models (SLMs).
Analogy / Intuitive Explanation
Imagine trying to compress a high-quality video into a small file that can be sent over the internet. Current NACs are like trying to compress the video into a large file that takes too long to send, while FocalCodec-Stream is like compressing the video into a small file that can be sent quickly and efficiently. The codec uses a combination of techniques to achieve this, including a lightweight refiner module that helps to preserve the quality of the compressed video.
Paper Information
Categories:
cs.SD cs.AI cs.LG eess.AS
Published Date:

arXiv ID:

2509.16195v1

Quick Actions