Facebook’s Universal Music Translation and Google’s Looking to Listen at a Cocktail Party

This week, we take a look at one of Facebook’s most recent research findings where they translate across musical instruments, styles and genres. Given a clip of music audio, the neural network could output that same clip of audio in another genre.

Full paper: https://arxiv.org/pdf/1805.07848.pdf

Video: https://youtu.be/vdxCqNWTpUs

Google Researchers present a joint audio-visual model for isolating a single speech signal from a mixture of sounds (other speakers or background noise) (cocktail party effect). The inputs for their model is an audio clip of a mixture of sounds and its corresponding video. The visual components of the input is used to “focus” their attention on the audio of the desired speakers. They also introduce a new dataset called the AVSpeech, which contains thousands of hours of video segments from the web. Their method outperforms the state-of-the-art audio-only speech separation in cases of mixed speech. Their model is also (the first?) speaker-independent model and also produces better results than recent audio-visual speech separation models that are speaker-dependent.

Full paper: https://arxiv.org/pdf/1804.03619.pdf

Video: https://youtu.be/rVQVAPiJWKU