Yuta Oshima

I’m a PhD student at The University of Tokyo, mentored by Professor Yutaka Matsuo.

My research goal is to develop interactive vision generation systems that translate human imagination into reality, enabling anyone to create and shape visual worlds with intuitive, flexible, and precise control. Toward this goal, I currently focus on improving the controllability of vision foundation models, particularly diffusion models, through alignment and instruction following for fine-grained visual generation.

selected publications

  1. CVPR 2026 Main
    multibanana_teaser.png
    MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
    Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta
    In the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
  2. NeurIPS 2025
    dlbs_teaser.gif
    Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
    Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta
    In Neural Information Processing Systems (NeurIPS), 2025
  3. NeurIPS 2024
    adopt.png
    ADOPT: Modified Adam Can Converge with Any \beta_2 with the Optimal Rate
    Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo
    In Neural Information Processing Systems (NeurIPS), 2024