Tyler Zhu

DeepSeek V4: Much Ado About Context

Fri, 15 May 2026 11:24:42 -0400

Thanks to Will Hwang for his helpful thoughts and references on recent long context architectures.

A few weeks ago, DeepSeek released their V4 model, headlined by the tag: “Towards Highly Efficient Million-Token Context Intelligence”. Other models like Gemini and Claude have claimed million-token context before, but in practice were only effective for much shorter contexts on any real tasks. Deepseek V4 requires 27% of single-token inference FLOPs and 10% of KV cache that DeepSeek-V3.2 uses, which is already optimized for long-context tasks. These improvements have the community in astonishment (see below).

LeThoughts on JEPA: The Return of SSL

Thu, 09 Apr 2026 22:06:29 -0400

I used to be very up to date on self-supervised learning, but fell out of it as the field itself slowly died down in favor of VLMs and what not after SigLIP/DINO/V-JEPA became the dominant paradigms. This means I haven’t read any SSL papers seriously since 2023.

However, that doesn’t mean I’ve been living under a rock. I’m still well aware of Yann LeCun’s anti-pixel prediction tirade, and in that time, nothing came out that convinced me we could move away from pixel-level supervision. It’s simply such a strong prior to enact for self-supervision: you get multi-view consistency and true spatial grounding at the slight cost of having to model high-frequency pixel details.

Have we scaled vision like language yet?

Sat, 14 Feb 2026 23:32:00 +0800

A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something that struck me by surprise. I’ve been trying to piece it together ever since.

The interaction went something like this:

Audience: “Why aren’t we scaling vision models as large as we do LLMs?”

Lucas: “You know, actually, the largest vision models are on par with the largest language models if you look at [X].”

I can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it’s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 [10].

FSDP for Dummies

Mon, 02 Feb 2026 00:00:00 -0800

I’ve always struggled to understand the intuitions behind Fully Sharded Data Parallel beyond the high level idea of “shard everything.” Without a systems background, the fundamental primitives like “all-reduce” and “reduce-scatter” aren’t in my vocabulary. But FSDP conceptually is not complicated, especially once you state what the goals are (the rest is nearly necessitated by the engineering).

This post is an attempt to deconstruct the algorithm from first principles as a non-systems person. I will bring up the primitives in their specified context, which I think helps reinforces the intuition much better. Most ML researchers have a stronger understanding of the models, params, and optimizer processes than the systems jargon anyways.

Remarks on Spatial Localization in VLMs

Sun, 17 Dec 2023 00:00:00 -0800

Prelude

This all started when I oversaw this tweet from Timothee Darcet (co-first author on DINOv2)

https://x.com/TimDarcet/status/1726320282028360131?s=20

This was in response to people overreacting to how the final problem in computer vision was for AI to tell the difference between a blueberry muffin and a chihuahua, which, to be fair, is a rather funny joke. It turns out that AI models can do this quite well though, and have been able to already even since CLIP came out! So what’s the big deal?

My PhD Interview Experience

Sun, 22 Jan 2023 12:08:43 -0800

Over the last few months, I’ve been fully immersed with the CS PhD application process. I’ll make a later blog post detailing the overall process, but I thought I’d write up a quick post about my recent experiences (and hopefully future!) with the interview portion of the process.

New Year Resolutions for 2023

Sun, 01 Jan 2023 12:29:38 -0800

Seeing as its the new year, I took some time to think about my 2023 resolutions like most people are.