Audio Quality vs. Amount of Compression

Mel-spectrogram loss (top) and Fréchet Audio Distance (bottom) for VampNet samples at varying bitrates. A noisy baseline is provided by replacing tokens in the input sequence with random tokens, according to a given noise ratio .

Here, we listen to some examples of the compression-generation continuum observed by prompting vampnet to decompress music at increasingly smaller bitrates. Nk denotes the number of codebook levels kept after compression, and P denotes the timestep downsampling factor used during compression.

Input

in.wav.mp3

Coarse-to-Fine (800 bps)

Output after keeping all of the timesteps for the coarse codebook levels of the input token sequence, and using the coarse-to-fine model to generate the remaining codebook levels of the token sequence.

c2f.wav

Nk1, P=1 (600 bps)

Output after keeping the first coarse codebook level of the input tokens, using the VampNet (coarse) model to infer the remaining 3 coarse codebook levels, and subsequently using coarse-to-fine to generate the fine codebook levels.

n1p1.wav.mp3

Nk4, P=8 (100 bps)

Output after keeping the all coarse codebook level of the input tokens, but only keeping the tokens for 1 out of every 8 timesteps in the sequence.

p8.wav

Nk4,P=16 (50 bps)

Output after keeping the all coarse codebook level of the input tokens, but only keeping the tokens for 1 out of every 16 timesteps in the sequence.

50bps.wav

Input

Coarse-to-Fine (800 bps)

Nk1, P=1 (600 bps)

Nk4, P=8 (100 bps)

Nk4,P=16 (50 bps)

Nk4,P=32 (25 bps)