Tyler Zhu

I am a 2nd year PhD student in CS at Princeton advised by Olga Russakovsky. I received my B.S. in EECS from Berkeley in 2022 and my M.S. in 2023, advised by Jitendra Malik. I am a recipient of the Princeton President's Fellowship. I am broadly interested in creating computer vision systems which can learn from and interpret visual data as humans do.

While at Berkeley, I've had the great fortune to have collaborated with and been mentored by a number of wonderful people, including Karttikeya Mangalam, Alvin Wan, and Dan Hendrycks. I was also heavily involved in teaching and outreach, serving on CS 70 course staff multiple times and previously leading Machine Learning @ Berkeley. You can find out more from my main website here.

If you are interested in collaborating, or just want to reach out and chat about research or advice, feel free to reach out to me at [first][last][at]cs[dot]princeton[dot]edu.

CV  /  Google Scholar  /  Twitter  /  Github

profile photo
News
  • [Jan 2025] Our preprint on a multi-encoder representation of videos, MERV, is now available on arXiv and models are available on huggingface!
  • [Mar 2024] Our preprint on large image modeling, xT, is now available on arXiv (May 2024: and was accepted to ICML)! I will also be co-organizing the Transformers for Vision Workshop @ CVPR 2024 after the great experience I had attending last year.
  • [May 2023] Our paper on fast reversible transformers, PaReprop, was accepted as a spotlight at the Transformers for Vision Workshop @ CVPR 2023!
  • [Apr 2023] I am starting my PhD at Princeton in Fall 2023, advised by Professor Olga Russakovsky!
Research

I'm broadly interested in computer vision to create visual systems which can effectively reason and interact with the real world.

Currently, I am most interested at the intersection of video and language. I believe that understanding how to promote videos as a fundamental unit of vision can be key to unlocking the next generation of visual systems. Real-world interaction is also governed by language, thus I am interested in how to bridge the gap between the two modalities. This includes understanding how to learn from videos efficiently, how to reason about them, and how to represent them.

I am also interested in large models, particularly vision models, and how to make them more efficient and broadly useful. Much of my previous work has been on general purpose memory-efficient techniques to make this possible.

Unifying Specialized Visual Encoders for Video Language Models
Jihoon Chung*, Tyler Zhu*, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky
arXiv 2024
code / arXiv / tweet

We propose a framework for using many visual encoders covering broad visual categories like action recognition and spatial understanding as a unified visual encoder for video LLMs. This trend is exciting as it could allow our model to scale visual processing with the number of GPUs and run them all in parallel (sharding one expert per device) while still retaining similar runtimes to just a single expert.

xT: Nested Tokenization for Larger Context in Large Images
Ritwik Gupta*, Shufan Li*, Tyler Zhu*, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam
ICML 2024
code / arXiv / tweet

A simple yet effective framework for adapting vision models trained on small, 224x224 images to larger images with larger context by using an LLM-style encoder to integrate context over larger regions than otherwise possible. We also proposed a set of effective benchmarks for reflecting such improvements on larger images.

PaReprop: Fast Parallelized Reversible Backpropagation
Tyler Zhu*, Karttikeya Mangalam*
Transformers for Vision Workshop @ CVPR 2023 (Spotlight Paper)
code / arXiv / tweet

We overcome the extra overhead of reversible transformers by parallelizing the backward pass using CUDA streams. This speeds up training for models in both vision and language, making them nearly as fast as the base models with incredible memory savings to boot.

The many faces of robustness; A critical analysis of out-of-distribution generalization.
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
ICCV 2021
code / arXiv

Four new datasets measuring real-world distribution shifts, most well-known of which is ImageNet-R(enditions), as well as a new state-of-the-art data augmentation method that outperforms models pretrained with 1000x more labeled data.

Thesis
Making Reversible Transformers Accurate, Efficient, and Fast
Tyler Zhu
Master's Thesis

In this work, we present an in-depth analysis of reversible transformers and demonstrate that they can be more accurate, efficient, and fast than their vanilla counterparts. We introduce a new method of reversible backpropagation which is faster and scales better with memory than previous techniques, and also demonstrate new results which show that reversible transformers transfer better to downstream visual tasks.

Misc
Preference Learning for Text-to-Image Prompt Tuning with RL
Arnav Gudibande*, Tyler Zhu*
Fall 2022 CS 285 Deep RL Final Project

We propose a framework towards automating prompt tuning for learning preferences in text-to-image synthesis using reinforcement learning with human feedback.

Service and Outreach
Guided Resource and Education Program: High School Workshop Initiative
Advisor (behind the scenes)
Machine Learning at Berkeley, Fall 2023

We piloted a free two-day workshop for local Bay Area high school students with little access to coding resources to teach them the basics of machine learning. Our goal was to be inclusive and representative of all backgrounds and experiences, and we were able to reach over 40 students evenly split between male and female participants.

Broadening Research Collaborations Workshop
Co-organizer
NeurIPS 2022

We organized a workshop at NeurIPS 2022 to bring together researchers from different backgrounds and experiences to discuss the challenges and opportunities in non-traditional collaborations beyond the standard academic and industry models.


Source code taken from Jon Barron's lovely website.