> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.
I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.
Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.
Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.
Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.
There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.
Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.
I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.
IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.
> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]
accent matching (if you sound Indian, it shouldn't also sound Indian)
Why not? I've found that it helps greatly with mutual intelligibility when both sides are speaking a similar dialect, and the one who can do this switching, switches to that of the one who can't.
(I wish I could also use an Indian accent bidirectionally; would definitely come in handy for those aggravating times I've had to talk to an outsourced customer service department.)
Exactly this. It fits that worldview to think a literal computer was “being racist” and mocking the user, even just by copying their speech patterns accurately.
Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.
You can often tell where someone is from from text alone! There are plenty of idiosyncrasies even in how different English speaking countries use the language.
All my Indian colleagues say "I agree with the same", this "the same" turn of phrase was so strange to me I had to ask (I'm French, so I have my own silly quirks, like I forget non-vocal plural(s<-- see, often I don't write that s)). They told me it was like that in Hindi so they just reproduce the pattern and it's grammatically acceptable.
For French people like me, false friends are immediately noticeable: for instance, "actually" to mean "now" instead of "in fact".
Pre-nerf the 4o voice model had a wide range of expressivity, and it would match affect (still tries to do this) and idiolect of listeners if asked. Nowadays there's a list of accents that are considered "hate-ish" and a list that aren't.
I will elide the rant inside me that west coast 20 somethings get to decide if speaking in a certain accent is racist or "bad". But it's a heartfelt rant.
There are subtle differences in language where two groups can be speaking English and one is having a completely different conversation without saying much.
No, I'm saying that it is more meaningful to use what is directly derived rather than what is an indirect assumption. There is already issues with people erroneously considering whatever LLMs output as truth, the last thing anyone needs is an LLM claiming someone like Idris Elba is a white Briton because of his accent. We don't need automated phrenology machines, and that's what "determined your race from your voice" is pretty close to.
I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".
advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.
we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.
I don't find the articles argument that this is due to tokenisation convincing.
> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
Like others I noticed this capability was interfered with in some way. I had fun getting it to speak to me in a cheesy over-the-top Bostonian accent early on, then one day when I tried to demonstrate for a friend it interrupted itself mid-sentence, literally one voice speaking over the other truncated voice, saying something like "I'm sorry I can't mimic voices".
It seemed like they had one model monitoring the output of another model and then cutting it off when it crossed some line.
I wonder if a linear-space, constant-time model like RWKV or S4 would work better here. For audio, I wouldn't think you'd need long range context, and all-to-all mapping seems like overkill.
Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.
They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.
Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.
I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio.
The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]
If anyone wants to buy me some GPU time I'd be happy to try it out! Fair warning: my only experience in deep learning thus far was training a CNN to count dots on an image, which worked semi reliably up to 8, when the image was perfectly square black "dots" on a perfectly white background.
Off-topic, but it would be great if everyone who voiced their opinion on something would add a small disclaimer with their actual knowledge about the subject. Thanks for sharing :)
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?
Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.
I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.
Author here. There are a few reasons, but the biggest one is simply the compression ratio.
The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.
The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.
Why RVQ though, rather than using the raw VAE embedding?
If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?
I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.
But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.
At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926
Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.
The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.
Human audio perception is based on detecting the frequency components, which we detect via what amounts to a filter bank in the inner ear (different length hairs with different resonant frequencies).
Speech perception builds upon frequencies and is based on "formants" - the frequency bands that are attentuated via the vocal tract resonances created by articulation when the speech was generated. More specifically, most speech information is contained in formant changes since these correspond to articulatory changes. There are also other articulatory artifacts in speech such as the onsets of speech energy corresponding to plosives ("puh", "buh"), and the high frequencies generated by fricatives like "sss".
One problem with embedding MP3 frames as audio tokens would be that although MP3 compression is based on frequency representation, you've then got quantization, huffman encoding and the MP3 frame structure all on top of that, so the frame as a whole is going to be more of a black box. Presumably a transformer could still use MP3 frames to predict the text transcription, or any arbitrary encoding of speech audio for that matter (similar to how an LLM can predict text from Base64 representation, or vice versa), but it's certainly not making it easier if the input is obfuscating the frequency components and formants etc that correspond to the generating process.
Not having direct access to the frequency/formant information is also going to make generalization more difficult since that is based around formant structure and changes. When articulating the same word, the specific formant frequencies will differ between individuals, primarily based on vocal tract length, but humans have no problem generalizing across these and understanding speech from different individuals. I'm not sure if an LLM only trained to predict MP3 speech from, say, male adults, would necessarily have generalized enough to also be able to recognize child speech or that from a speech synthesizer.
You can try to train an adapter from a raw 400-byte MP3 frame to an embedding for a given LLM (4096+ floating point numbers, exact precision varies).
But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.
As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.
if you were able to normalize and quantokenize the distinct dct values in a consistent way, it could be an interesting approach. so yeah, undo the bit packing but keep the front end signal processing and compressed dct representation and viola! something quite weird that might actually work. :)
The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.
> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.
I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?
Humans that read (at least) Indo-European languages can read texts in their native language with all the vowels removed. Does that suggest that it would be a good idea to remove the vowels from text before using it for training text-based LLMs ?
Presumably you want to train on as rich a set of data as possible, even if some of that data is redundant or irrelevant when it comes to human perception.
I imagine it would be like if there were Rosetta Stones of text, written with a language you could read and a language you couldn't. For your purposes, discarding the text you can't read would be fine and you wouldn't lose anything. But if you were ingesting a bunch into an LLM, the additional text would give the LLM more context and help it make connections and relate words more accurately, even if you never were going to have it output anything in the language you don't understand.
The inaudible sounds add context and additional datapoints on how the audible sounds are related.
I believe language models usually use 2-byte (16 bit) tokens, which corresponds to an embedding dimension of 2^16=65536. With 400 bytes per token this would be 2^(400*8), which is an extremely large number. Way too large to be practical, I assume.
This has got to be one of the most visually pleasing explanations I have seen of these concepts. Congrats!
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.
An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
The article is talking about doing exactly that. The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.
That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.
There are samples towards the end of the post of their LLM trained on their Mimi audio codec.
> The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.
I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).
An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).
I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)
> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.
As an experienced DAW author, I very, very much doubt this.
What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.
I started in music but have since edited thousands of hours of podcasts. I cannot transcribe a track by looking at the waveform, except the word "um" haha. But without playing the audio I can tell you where words start and end, whether a peak is a B or a T or an A or an I sound... And melodyne can add layers to that and tell me the pitch, formants (vowels), quantize the syllables etc. If I can do all this, a computer ought to be able to do the same and more
Hundreds of hours here, and I can't even always reliably spot my own ums. I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air. This is boring and tedious work but it makes the end result considerably better I think.
I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.
> I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air.
Hopefully using Ardour's "Ripple - Interview" mode :))
I use Descript to edit videos/podcasts and it works great for this kind of thing! It transcribes your audio and then you can edit it as if you were editing text.
Yeah, that stuff is just freaking amazing. I don't know what the transcription quality is like, but if I was doing this as a job, and it was good at transcription, I'd definitely be using that all the time.
> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.
DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.
I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.
One of the popular speech-to-text models is Whisper, which starts with the conventional spectral analysis of the speech signal, and then feeds the data into a Transformer model. It works quite well.
Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"
> the key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens
Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.
Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?
Yes, quantization isn't anything new, nor are audio codecs. As you point out, though, it's not just about designing a quantization scheme to reconstruct an analog signal. The scheme itself needs to be "easy" for current model architectures to learn and decode autoregressively (ideally, in realtime on standard hardware).
The blog post addresses this directly with samples from their own baseline (an autoregressive mu-law vocoder), and from WaveNet (which was similar architecture). The sound is mostly recognizable as a human voice, but it's unintelligible. The sequence length is too long and the SNR for the encoding scheme is too low for an generative/autoregressive model to learn.
This is what the neural codec is intended to address. Decoupling semantic from acoustic modelling is an important step ("how our ears interpret a sound" vs. "what we need to reconstruct the exact acoustic signal"). Mimi works at 1.1kbps, and others work at low bitrates (descript, semanticodec, etc). Encodec runs at at a higher bitrate so generally delivers better audio quality.
Now - why are neural codecs easier to model than conventional parametric codecs? I don't know. Maybe they're not, maybe it's just an artifact of the transformer architecture (since semantic tokens are generally extracted from self-supervised models like WavLM). It's definitely an interesting question.
There is data but nowhere near the amount of written language that is fairly normalized and without the need to account for additional features such as language, dialect, intonation, facial expression, hand gestures. Speech to text is used as the translation layer as it throws many of those other features away and contextualizes it into a set of tokens that are much more efficient to map between languages.
It costs more to train on audio tokens but I'm sure we will get there. Training a model on transcript of a lecture on YouTube vs. training on audio of it will make a difference.
Audio tokenization consumes at least 4x tokens versus text. So there is an efficiency problem to start with. Then is there enough audio data to train a LLM from scratch?
Don't we have tens of thousands of hours (hundreds of thousands?) of closed captioned tv shows and movies? How many hours of news broadcasts with transcripts do we have? Maybe I just don't understand what is needed, but it seems like we have a lot of data to work with.
Correct me if I’m wrong but you need more than just closed captions. You need precise timing too. I’d think you’d need the text to line up exactly with the audio so when the voice makes an “A” sound the text it aligns with is “A” as well.
So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.
But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!
Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.
Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.
Author here, thanks for the kind words! I think such a physics-based codec is unlikely to happen: in general, machine learning is always moving from handcrafted domain-specific assumptions to leaving as much as possible to the model. The more assumptions you bake in, the smaller the space of sounds you can model, so the quality is capped. Basically, modern ML is just about putting the right data into transformers.
That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643
You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.
Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS
There’s a long history of attempts at artificial speech that take this approach, recreating mouth parts and vibrating air. They are all pretty silly, like this work, which fails to understand how writing isn’t just a derivative of speech.
In speech coding/synthesis this called a "source-filter" model (decompose speech production into a sound generator in the vocal folds and filter in the vocal tract, and parameterize them) and it's actually older than Tukey and Cooley's rediscovery of the FFT.
the OP is quite an interesting team to watch regarding open-weights* voice-related efforts. this is a nice read to understand the core of their approach.
quite unfortunate, however, their approach to accessibility. unmute [1], which uses the approach discussed in this post, runs quite well with claimed feature of adapting to any voice provided you have a 10 second recording. this is not made available to public at all, despite an issue raised since july. [2]
given the pace of the industry, it is a shame that we need to look elsewhere for using an otherwise well-designed tooling.
Out of curiosity, would it be possible to attach pitch, emotion, tone info as text-based metadata to each word during ASR, so that the asr output retains these metadata?
Thanks for sharing this well written post that I will share with my team; we just recently started using audio/voice in our AI suite and the details herein will be helpful and informative.
I've been messing around with Higgs Audio that actually uses the delay pattern. It has to apply it and then unapply it after the generation. I noticed it's actually really hard to chunk and stream audio correctly when you need to apply and reapply these patterns essentially to the "entire" output.
I wouldn't mind so much if they cheat on the way back but listen in earnest. There are use cases like teaching language where having the AI understand the sounds carefully matters a ton.
I train for 1M steps (batch size 64, block size 2048), which is enough for the model to more-or-less converge.
It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.
> Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (...), but it’s a wrapper, not real speech understanding.
But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.
There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio?
https://sander.ai/2025/04/15/latents.html
For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185
Nothing is real understanding because we have no benchmark for understanding because we don't mechanistically know what understanding is. The best we have is people "vibe knowing" a benchmark that they made up on the spot.
Another interesting thing here is that the model presumably has some understanding of the passage of time. That's one thing that can be odd about chat models, in that they will respond the same no matter whether you respond a second later or a month later.
I think even for text models, "streams" could be useful. Perhaps if the LLM sees too long of a pause after explaining something and asking a question, they could interject a "do you need help?" or something. Pure chat GPTs don't have that ability.
I can't wait for LLMs to actually understand how they and you are speaking. It's going to be so cool when an AI can correct your second language pronunciation or laugh at you for making a silly sound. The usecases and value will explode when that happens 100%
Indeed, the title undersells it and I'm glad I didn't skip over it, the article is basically an information-dense but approachable summary of audio generation.
Man, one of the best uses of all those AI algorithms based around finding similarities between stuff, would be to give you actually relevant recommendations for music.
All the streaming services are shit at it. They can't do much beyond shallow similarities or hardcoded recommendations that are probably just based on manually-entered keywords like the genre etc.
Has that already been done?
Or is it yet another of those what-could-have-been utopian things that got crippled before it was born because of corporate overcontrolling/overcautiousness (not being able to train on copyrighted music)
Maybe some open-source project could do it?
(I don't even feel confident in asking AI if a music-recc AI exists because ChatGPT 5 didn't know ChatGPT 5 was out, and Claude still thinks iOS 26 isn't out yet..sigh)
Y’all need to learn about the history and development of spoken language and writing. Writing isn’t just a copy or derivation of writing. LLMs work because of the conceptual characteristics of writing (consider the distinctions between ideographic, logographic, alphabetical…). What a sloppy mess!
Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.
I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.
Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.
Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.
I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.
* = https://github.com/coezbek/PlayDiffusion
From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...
Why not? I've found that it helps greatly with mutual intelligibility when both sides are speaking a similar dialect, and the one who can do this switching, switches to that of the one who can't.
(I wish I could also use an Indian accent bidirectionally; would definitely come in handy for those aggravating times I've had to talk to an outsourced customer service department.)
For French people like me, false friends are immediately noticeable: for instance, "actually" to mean "now" instead of "in fact".
I will elide the rant inside me that west coast 20 somethings get to decide if speaking in a certain accent is racist or "bad". But it's a heartfelt rant.
I don't find the articles argument that this is due to tokenisation convincing.
> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
It seemed like they had one model monitoring the output of another model and then cutting it off when it crossed some line.
Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.
They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.
Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.
The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]
[1] https://arxiv.org/abs/2005.00341
[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...
Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.
I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.
The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.
The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.
Why RVQ though, rather than using the raw VAE embedding?
If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?
But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.
Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759
At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926
The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.
https://proceedings.neurips.cc/paper_files/paper/2018/file/7...
I suspect mp3 is also a good idea
Speech perception builds upon frequencies and is based on "formants" - the frequency bands that are attentuated via the vocal tract resonances created by articulation when the speech was generated. More specifically, most speech information is contained in formant changes since these correspond to articulatory changes. There are also other articulatory artifacts in speech such as the onsets of speech energy corresponding to plosives ("puh", "buh"), and the high frequencies generated by fricatives like "sss".
One problem with embedding MP3 frames as audio tokens would be that although MP3 compression is based on frequency representation, you've then got quantization, huffman encoding and the MP3 frame structure all on top of that, so the frame as a whole is going to be more of a black box. Presumably a transformer could still use MP3 frames to predict the text transcription, or any arbitrary encoding of speech audio for that matter (similar to how an LLM can predict text from Base64 representation, or vice versa), but it's certainly not making it easier if the input is obfuscating the frequency components and formants etc that correspond to the generating process.
Not having direct access to the frequency/formant information is also going to make generalization more difficult since that is based around formant structure and changes. When articulating the same word, the specific formant frequencies will differ between individuals, primarily based on vocal tract length, but humans have no problem generalizing across these and understanding speech from different individuals. I'm not sure if an LLM only trained to predict MP3 speech from, say, male adults, would necessarily have generalized enough to also be able to recognize child speech or that from a speech synthesizer.
But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.
As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.
I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?
Presumably you want to train on as rich a set of data as possible, even if some of that data is redundant or irrelevant when it comes to human perception.
0 - https://www.acelinguist.com/2020/01/the-pin-pen-merger.html
1 - https://en.wikipedia.org/wiki/Soundex
The inaudible sounds add context and additional datapoints on how the audible sounds are related.
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.
That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.
There are samples towards the end of the post of their LLM trained on their Mimi audio codec.
I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).
An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).
I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)
As an experienced DAW author, I very, very much doubt this.
What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.
I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.
Hopefully using Ardour's "Ripple - Interview" mode :))
DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.
I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.
https://openai.com/index/whisper/
Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"
https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf
Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.
Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?
The blog post addresses this directly with samples from their own baseline (an autoregressive mu-law vocoder), and from WaveNet (which was similar architecture). The sound is mostly recognizable as a human voice, but it's unintelligible. The sequence length is too long and the SNR for the encoding scheme is too low for an generative/autoregressive model to learn.
This is what the neural codec is intended to address. Decoupling semantic from acoustic modelling is an important step ("how our ears interpret a sound" vs. "what we need to reconstruct the exact acoustic signal"). Mimi works at 1.1kbps, and others work at low bitrates (descript, semanticodec, etc). Encodec runs at at a higher bitrate so generally delivers better audio quality.
Now - why are neural codecs easier to model than conventional parametric codecs? I don't know. Maybe they're not, maybe it's just an artifact of the transformer architecture (since semantic tokens are generally extracted from self-supervised models like WavLM). It's definitely an interesting question.
So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.
But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!
See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037
There are big libraries of old speeches.
Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)
q: What is 2+2?
A: The warranty for your car has expired...
It mostly uses the UN reports as a source of parallel translated texts, so the language is quite a bit stilted. But it's a good start.
I recall someone telling me once up to 90% of communication can be non-verbal, so when an LLM sticks to just text, it's only getting 10% of the data.
Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.
That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643
You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.
Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS
Huh? How?
> like this work which fails to understand how writing isn’t just a derivative of speech.
The whole point of the article is that writing isn't just a derivative of speech. It's in the introduction.
quite unfortunate, however, their approach to accessibility. unmute [1], which uses the approach discussed in this post, runs quite well with claimed feature of adapting to any voice provided you have a 10 second recording. this is not made available to public at all, despite an issue raised since july. [2]
given the pace of the industry, it is a shame that we need to look elsewhere for using an otherwise well-designed tooling.
[1] https://news.ycombinator.com/item?id=44109610 [2] https://github.com/kyutai-labs/unmute/issues/99
It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.
But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.
For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185
I think even for text models, "streams" could be useful. Perhaps if the LLM sees too long of a pause after explaining something and asking a question, they could interject a "do you need help?" or something. Pure chat GPTs don't have that ability.
All the streaming services are shit at it. They can't do much beyond shallow similarities or hardcoded recommendations that are probably just based on manually-entered keywords like the genre etc.
Has that already been done?
Or is it yet another of those what-could-have-been utopian things that got crippled before it was born because of corporate overcontrolling/overcautiousness (not being able to train on copyrighted music)
Maybe some open-source project could do it?
(I don't even feel confident in asking AI if a music-recc AI exists because ChatGPT 5 didn't know ChatGPT 5 was out, and Claude still thinks iOS 26 isn't out yet..sigh)
Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.