1 | Idan Schwartz

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, …

Zero-shot video captioning with evolving pseudo-tokens

We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model. The matching score is used to steer the language model toward generating a sentence that has a high …

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: how can …

Discriminative Class Tokens for Text-to-Image Diffusion Models

Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. However, generated images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. …

1

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Zero-shot video captioning with evolving pseudo-tokens

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Discriminative Class Tokens for Text-to-Image Diffusion Models

Describing Sets of Images with Textual-PCA

Optimizing Relevance Maps of Vision Transformers Improves Robustness

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Video and Text Matching with Conditioned Embeddings

Ordered attention for coherent visual storytelling

Latent space explanation by intervention