Drone view of waves crashing against the rugged cliffs along Big Sur's garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff's edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff's edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.

Borneo wildlife on the Kinabatangan River

A Chinese Lunar New Year celebration video with Chinese Dragon.

Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care.

Reflections in the window of a train traveling through the Tokyo suburbs.

A cartoon kangaroo disco dances.

Step-printing scene of a person running, cinematic film shot in 35mm.

A giant, towering cloud in the shape of a man looms over the earth. The cloud man shoots lighting bolts down to the earth.

3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bushy, striped tail. It hops along a sparkling stream, its eyes wide with wonder. The forest is alive with magical elements: flowers that glow and change colors, trees with leaves in shades of purple and silver, and small floating lights that resemble fireflies. The creature stops to interact playfully with a group of tiny, fairy-like beings dancing around a mushroom ring. The creature looks up in awe at a large, glowing tree that seems to be the heart of the forest.

Joint video-audio generation and audio-steered visual generation will come soon.

Stay tuned.

Abstract

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross visual-audio and joint visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks.

Research Paper

Seeing and Hearing

Open-domain Visual-Audio Generation with Diffusion Latent Aligners

¹The Hong Kong University of Science and Technology ²ARC Lab, Tencent PCG

^*Equal contribution

Paper GitHub

Overview

We propose a flexible framework for joint video-audio generation, visual-steered audio generation, and audio-steered visual (image/video) generation tasks.

Video-to-audio generation

Video credit to OpenAI Sora.

Joint video-audio generation and audio-steered visual generation will come soon.

Stay tuned.

Abstract

Supplement Video

Capabilities of our framework and comparisons with baselines.

Seeing and Hearing

Open-domain Visual-Audio Generation with Diffusion Latent Aligners

1The Hong Kong University of Science and Technology 2ARC Lab, Tencent PCG

*Equal contribution Paper GitHub

Overview

We propose a flexible framework for joint video-audio generation, visual-steered audio generation, and audio-steered visual (image/video) generation tasks.

Video-to-audio generation

Video credit to OpenAI Sora.

Joint video-audio generation and audio-steered visual generation will come soon.

Stay tuned.

Abstract

Supplement Video

Capabilities of our framework and comparisons with baselines.

¹The Hong Kong University of Science and Technology ²ARC Lab, Tencent PCG

^*Equal contribution

Paper GitHub