Mel-spectrogram loss (top) and Fréchet Audio Distance (bottom) for VampNet samples at varying bitrates. A noisy baseline is provided by replacing tokens in the input sequence with random tokens, according to a given noise ratio r
.
Here, we listen to some examples of the compression-generation continuum observed by prompting vampnet to decompress music at increasingly smaller bitrates. Nk
denotes the number of codebook levels kept after compression, and P
denotes the timestep downsampling factor used during compression.
Output after keeping all of the timesteps for the coarse codebook levels of the input token sequence, and using the coarse-to-fine model to generate the remaining codebook levels of the token sequence.
Output after keeping the first coarse codebook level of the input tokens, using the VampNet (coarse) model to infer the remaining 3 coarse codebook levels, and subsequently using coarse-to-fine to generate the fine codebook levels.
Output after keeping the all coarse codebook level of the input tokens, but only keeping the tokens for 1 out of every 8 timesteps in the sequence.
Output after keeping the all coarse codebook level of the input tokens, but only keeping the tokens for 1 out of every 16 timesteps in the sequence.