Multiscale Mel-spectrogram loss (top) and Fréchet Audio Distance (FAD, bottom) for VampNet samples taken with a different types of prompts.

Multiscale Mel-spectrogram loss (top) and Fréchet Audio Distance (FAD, bottom) for VampNet samples taken with a different types of prompts.

Here, we can examine the effect of different token prompting techniques on the outputs generated by VampNet.

Input

Input used as prompt to the model.

input.wav.mp3

Periodic (P=16 )

Here, a periodic prompt of 16 is used to condition the model. P=16 means that one every 16 tokens in a sequence are unmasked, meaning that about 6% of the tokens in the sequence are unmasked, while the remaining 94% are masked.

Masked Prompt

p=16.wav.mp3

Output

p16.wav

Periodic (P=32 )

Here, a periodic prompt of 32 is used to condition the model. P=32 means that one every 32 tokens in a sequence are unmasked, meaning that about 3% of the tokens in the sequence are unmasked, while the remaining 97% are masked.

Masked Prompt

p=32.wav.mp3

Output

p32.wav