Idan Schwartz

Assistant Professor

Bar-Ilan University

Research & Bio

I'm an Assistant Professor at Bar-Ilan University, leading the Multimodal Lab.

My research focuses on multimodal problems, primarily generative, with challenges including modal alignment and efficient inference-time solutions.

I also have a strong interest in connecting ideas from cognitive science decision making to deep learning concepts, such as model perceptiveness, representation as comprehension, attention, and programming as system 2 problem solving.

I completed my postdoc with Prof. Lior Wolf at Tel Aviv University. Before that, I earned my PhD in Computer Science from the Technion, under the supervision of Prof. Tamir Hazan and Prof. Alexander G. Schwing from UIUC. My thesis focused on Cognitive Models in Deep Learning. You can find my thesis here.

My experience as a researcher in industry includes work on eBay's catalog (vision and language), Microsoft's Assistant (meeting insights, transcript-based), and at Spot for cloud workload optimization platform (time-series prediction). I currently serve as Chief Scientist at Aigency.ai, where I'm helping revolutionize the internet in the age of intelligent agents.

Interests

Artificial Intelligence & Cognition
Attention Models
Multimodal Learning
Computer Vision
Natural Language Processing

Education

Postdoc in Computer Science, 2023

Tel-Aviv University
PhD in Computer Science, 2022

Technion
BSc in Computer Science, 2015

Technion

The Multimodal Lab

PhD Students

Ben Fishman (joint with Gal Chechik)

MSc Students

Gilad Carmel
Mark Vexler
Uriel Dolev (joint with Yoav Goldberg)
Amit Ronen
Aviv Weidenfeld
Omri Keren
Binyamin Ramati
Yona Orunov
Yuval Cohen

Alumni

Shira Schiber (joint with Ofir Lindenbaum) — TempoControl, CVPR 2026
Yair Shpitzer (joint with Gal Chechik) — SISO, CVPR 2026 Workshop

News

Apr 2026 LaMI: Augmenting Large Language Models via Late Multi-Image Fusion accepted to ACL 2026 (Main Conference).
Apr 2026 Single Image Iterative Subject-driven Generation and Editing accepted to the P13N Workshop, CVPR 2026.
Apr 2026 TempoControl: Temporal Attention Guidance for Text-to-Video Models accepted to CVPR 2026.
Mar 2026 Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models presented at WACV 2026.
2024 Started as Assistant Professor at Bar-Ilan University and founded the Multimodal Lab.

Research Highlights

CVPR 2026

TempoControl: Temporal Attention Guidance for Text-to-Video Models

S. Schiber, O. Lindenbaum, I. Schwartz

Steers cross-attention so concepts appear at the right moment in a generated video — without retraining.

WACV 2026

Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

O. Zafar, Y. Cohen, L. Wolf, I. Schwartz

Optimizes a reusable counting token using detector feedback so generated images contain the requested number of objects.

ACL 2026

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

G. Yariv, I. Schwartz, Y. Adi, S. Benaim

Generates multiple images from the prompt and fuses them with a frozen LLM at the last layer — visual commonsense without retraining the model.

AAAI 2024

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

G. Yariv, I. Gat, S. Benaim, L. Wolf, I. Schwartz, Y. Adi

Maps audio into the input space of a frozen text-to-video model — generates videos that stay temporally aligned with the sound.

ICCV 2023

Discriminative Class Tokens for Text-to-Image Diffusion Models

I. Schwartz, V. Snæbjarnarson, S. Benaim, H. Chefer, R. Cotterell, L. Wolf, S. Belongie

Learns a single token from a classifier’s gradient to disambiguate fine-grained classes — sharper, more accurate generations.

CVPR 2022

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Y. Tewel, Y. Shalev, I. Schwartz, L. Wolf

Combines a frozen LM with CLIP at inference time to caption images — even doing visual arithmetic like image – word + image.

NeurIPS 2022

Optimizing Relevance Maps of Vision Transformers Improves Robustness

H. Chefer, I. Schwartz, L. Wolf

Fine-tune ViTs by directly shaping their relevance maps to focus on foreground — large gains in distribution-shift robustness.

CVPR 2019

Factor Graph Attention

I. Schwartz, A. G. Schwing, T. Hazan

A general attention mechanism that fuses any number of utilities (image, history, question, …) for visual dialog.

Publications

Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim (2026). LaMI: Augmenting Large Language Models via Late Multi-Image Fusion. ACL'26.

PDF OpenReview Project Page

Shira Schiber, Ofir Lindenbaum, Idan Schwartz (2026). TempoControl: Temporal Attention Guidance for Text-to-Video Models. CVPR'26.

PDF arXiv Project Page

Oz Zafar, Yuval Cohen, Lior Wolf, Idan Schwartz (2026). Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models. WACV'26.

PDF arXiv Project Page

Yair Shpitzer, Gal Chechik, Idan Schwartz (2026). Single Image Iterative Subject-driven Generation and Editing. P13N Workshop, CVPR'26.

PDF Project Page

G. Yariv, I. Gat, S. Benaim, L. Wolf, I. Schwartz, Y. Adi (2023). Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation . AAAI'24.

PDF Code

I. Schwartz, V. Snæbjarnarson, S. Benaim, H. Chefer, R. Cotterell, L. Wolf, S. Belongie (2023). Discriminative Class Tokens for Text-to-Image Diffusion Models. ICCV'23.

PDF arXiv Code Project Page

Y. Tewel, Y. Shalev, R. Nadler, I. Schwartz, L. Wolf (2023). Zero-shot video captioning with evolving pseudo-tokens. BMVC'23.

PDF Code

G. Yariv, I. Gat, S. Benaim, L. Wolf, I. Schwartz (2023). AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation. INTERSPEECH'23.

PDF Code

H. Chefer, I. Schwartz, L. Wolf (2022). Optimizing Relevance Maps of Vision Transformers Improves Robustness. NeurIPS'22.

PDF Code

Y. Tewel, Y. Shalev, R. Nadler, I. Schwartz, L. Wolf (2022). ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic. CVPR'22.

PDF Code

I. Gat, G. Lorberbom, I. Schwartz, T. Hazan (2022). Latent Space Explanation by Intervention. AAAI'22.

PDF arXiv

A. Ali, I. Schwartz, T. Hazan, L. Wolf (2022). Video and Text Matching with Conditioned Embeddings. WACV'22.

PDF Code

T. Braude, I. Schwartz, A. G. Schwing, A. Shamir (2022). Ordered attention for coherent visual storytelling. ACM-MM'22.

PDF

O. Hupert, I. Schwartz, L. Wolf (2022). Describing Sets of Images with Textual-PCA. Findings of EMNLP'22.

PDF arXiv

I. Gat, I. Schwartz, A. G. Schwing (2021). Perceptual Score: Measuring Perceptiveness of Multi-Modal Classifiers. NIPS'21.

PDF Code

I. Schwartz (2021). Ensemble of MRR and NDCG models for Visual Dialog. NAACL'21.

PDF Code

I. Gat, I. Schwartz, A. G. Schwing (2020). Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies. NeurIPS'20.

PDF Code

I. Schwartz, A. G. Schwing, T. Hazan (2019). A Simple Baseline for Audio-Visual Scene-Aware Dialog. CVPR'19.

PDF Code

I. Schwartz, S. Yu, T. Hazan, A. G. Schwing (2019). Factor Graph Attention. CVPR'19.

PDF arXiv Code

I. Schwartz, A. G. Schwing, Tamir Hazan (2017). High-Order Attention Models for Visual Question Answering. NIPS'17.

PDF Code

Patents

Cloud Instance Type Recommendations I. Schwartz, G. Yariv, T. Ohayon · 2025 US App. 18/924,802
Resource Distribution Engine(s) for Allocating and Securing Reclaimable Resources Within a Cloud Environment I. Schwartz, T. Ohayon, G. Yariv, O. Gurfinkel, M. Goldberg, R. Vladimirsky, et al. · 2025 US App. 18/634,377
Interruption Predictions for Cloud Compute Instances I. Schwartz, O. Muchnik, J. Cohen, K. McGrath, A. Shachar · Granted 2024 US 11,915,053
Identifying Anomalous Activities in a Cloud Computing Environment Y. Shen, A. Benameur, A. X. Ough, I. Schwartz · 2024 US App. 18/344,664
Spare Resource Availability Prediction with Limited Historical Data T. Ohayon, I. Schwartz · 2024 US App. 18/046,970
Search System for Providing Web Crawling Query Prioritization Based on Classification Operation Performance I. Guy, I. Schwartz, K. Radinsky · Granted 2023 US 11,636,164

Talks

2025

Contact

idanschwartz at gmail dot com
Computer Science Department, Room 213
Building 503
Bar-Ilan University
Ramat Gan, Israel

Get directions Open in OpenStreetMap

Idan Schwartz

Assistant Professor

Bar-Ilan University

Research & Bio

Interests

Education

The Multimodal Lab

PhD Students

MSc Students

Alumni

News

Research Highlights

TempoControl: Temporal Attention Guidance for Text-to-Video Models

Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Discriminative Class Tokens for Text-to-Image Diffusion Models

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Optimizing Relevance Maps of Vision Transformers Improves Robustness

Factor Graph Attention

Publications

Patents

Talks

Discriminative models can make generative models better (in Hebrew)

Multimodal Attention, Perception, Comprehension

Attention Models for Vision and Language

Contact