Yuta Oshima

I’m a first-year PhD student at The University of Tokyo, mentored by Professor Yutaka Matsuo.

My ultimate goal is to build real-time, interactive world models that seamlessly translate human imagination into reality—empowering anyone to craft and mold their envisioned worlds with intuitive and precise control.

To move toward this vision, my research focuses on video diffusion models and world models. This includes recent work on alignment and subject-driven generation to enhance fine-grained controllability. Furthermore, I address the fundamental challenge of memory mechanisms to ensure long-horizon consistency in generative environments.

selected publications

  1. Preprint
    multibanana.png
    MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
    Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta
    2025
  2. NeurIPS 2025
    dlbs.gif
    Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
    Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta
    In Advances in Neural Information Processing Systems (NeurIPS), 2025
  3. NeurIPS 2024
    adopt.png
    ADOPT: Modified Adam Can Converge with Any \beta_2 with the Optimal Rate
    Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo
    In Advances in Neural Information Processing Systems (NeurIPS), 2024
  4. ICLR 2024 WS
    ssm.png
    SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
    Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, and Yutaka Matsuo
    In 5th Workshop on Practical ML for Limited/Low Resource Settings (ICLR Workshop), 2024