<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Tyler Zhu</title>
    <link>https://tylerzhu.com/blog/</link>
    <description>Recent content on Tyler Zhu</description>
    <generator>Hugo -- 0.154.3</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 15 May 2026 11:24:42 -0400</lastBuildDate>
    <atom:link href="https://tylerzhu.com/blog/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>DeepSeek V4: Much Ado About Context</title>
      <link>https://tylerzhu.com/blog/2026/05/deepseek-v4-much-ado-about-context/</link>
      <pubDate>Fri, 15 May 2026 11:24:42 -0400</pubDate>
      <guid>https://tylerzhu.com/blog/2026/05/deepseek-v4-much-ado-about-context/</guid>
      <description>&lt;p&gt;Thanks to Will Hwang for his helpful thoughts and references on recent long context architectures.&lt;/p&gt;
&lt;p&gt;A few weeks ago, DeepSeek released their V4 model, headlined by the tag: &amp;ldquo;Towards Highly Efficient Million-Token Context Intelligence&amp;rdquo;.
Other models like Gemini and Claude have claimed million-token context before, but in practice were only effective for much shorter contexts on any real tasks.
Deepseek V4 requires 27% of single-token inference FLOPs and 10% of KV cache that DeepSeek-V3.2 uses, which is already optimized for long-context tasks.
These improvements have the community in astonishment (see below).&lt;/p&gt;</description>
    </item>
    <item>
      <title>LeThoughts on JEPA: The Return of SSL</title>
      <link>https://tylerzhu.com/blog/2026/04/lethoughts-on-jepa-the-return-of-ssl/</link>
      <pubDate>Thu, 09 Apr 2026 22:06:29 -0400</pubDate>
      <guid>https://tylerzhu.com/blog/2026/04/lethoughts-on-jepa-the-return-of-ssl/</guid>
      <description>&lt;p&gt;I used to be very up to date on self-supervised learning, but fell out of it as the field itself slowly died down in favor of VLMs and what not after SigLIP/DINO/V-JEPA became the dominant paradigms.
This means I haven&amp;rsquo;t read any SSL papers seriously since 2023.&lt;/p&gt;
&lt;p&gt;However, that doesn&amp;rsquo;t mean I&amp;rsquo;ve been living under a rock.
I&amp;rsquo;m still well aware of Yann LeCun&amp;rsquo;s anti-pixel prediction tirade, and in that time, nothing came out that convinced me we could move away from pixel-level supervision.
It&amp;rsquo;s simply such a strong prior to enact for self-supervision: you get multi-view consistency and true spatial grounding at the slight cost of having to model high-frequency pixel details.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Have we scaled vision like language yet?</title>
      <link>https://tylerzhu.com/blog/2026/02/have-we-scaled-vision-like-language-yet/</link>
      <pubDate>Sat, 14 Feb 2026 23:32:00 +0800</pubDate>
      <guid>https://tylerzhu.com/blog/2026/02/have-we-scaled-vision-like-language-yet/</guid>
      <description>&lt;p&gt;A few years ago at our &lt;a href=&#34;https://sites.google.com/view/t4v-cvpr23?pli=1&#34;&gt;CVPR 2023 Transformers for Vision workshop&lt;/a&gt;, Lucas Beyer said something that struck me by surprise. I&amp;rsquo;ve been trying to piece it together ever since.&lt;/p&gt;
&lt;p&gt;The interaction went something like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Audience:&lt;/strong&gt; &amp;ldquo;Why aren&amp;rsquo;t we scaling vision models as large as we do LLMs?&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lucas:&lt;/strong&gt; &amp;ldquo;You know, actually, the largest vision models are on par with the largest language models if you look at [X].&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it&amp;rsquo;s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 &lt;a href=&#34;#ref-10&#34;&gt;[10]&lt;/a&gt;.&lt;/p&gt;</description>
    </item>
    <item>
      <title>FSDP for Dummies</title>
      <link>https://tylerzhu.com/blog/2026/02/fsdp-for-dummies/</link>
      <pubDate>Mon, 02 Feb 2026 00:00:00 -0800</pubDate>
      <guid>https://tylerzhu.com/blog/2026/02/fsdp-for-dummies/</guid>
      <description>&lt;p&gt;I&amp;rsquo;ve always struggled to understand the intuitions behind Fully Sharded Data Parallel beyond the high level idea of &amp;ldquo;shard everything.&amp;rdquo; Without a systems background, the fundamental primitives like &amp;ldquo;all-reduce&amp;rdquo; and &amp;ldquo;reduce-scatter&amp;rdquo; aren&amp;rsquo;t in my vocabulary. But FSDP conceptually is not complicated, especially once you state what the goals are (the rest is nearly necessitated by the engineering).&lt;/p&gt;
&lt;p&gt;This post is an attempt to deconstruct the algorithm from first principles as a non-systems person. I will bring up the primitives in their specified context, which I think helps reinforces the intuition much better. Most ML researchers have a stronger understanding of the models, params, and optimizer processes than the systems jargon anyways.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Remarks on Spatial Localization in VLMs</title>
      <link>https://tylerzhu.com/blog/2023/12/remarks-on-spatial-localization-in-vlms/</link>
      <pubDate>Sun, 17 Dec 2023 00:00:00 -0800</pubDate>
      <guid>https://tylerzhu.com/blog/2023/12/remarks-on-spatial-localization-in-vlms/</guid>
      <description>&lt;h2 id=&#34;prelude&#34;&gt;Prelude&lt;/h2&gt;
&lt;p&gt;This all started when I oversaw this tweet from Timothee Darcet (co-first author on DINOv2)&lt;/p&gt;
&lt;p&gt;&lt;img alt=&#34;Tweet from Timothee Darcet&#34; loading=&#34;lazy&#34; src=&#34;https://tylerzhu.com/blog/images/spatial-localization/Screenshot_2023-12-16_at_12.51.07_AM.png&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://x.com/TimDarcet/status/1726320282028360131?s=20&#34;&gt;https://x.com/TimDarcet/status/1726320282028360131?s=20&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This was in response to people overreacting to how the &lt;em&gt;final&lt;/em&gt; problem in computer vision was for AI to tell the difference between a blueberry muffin and a chihuahua, which, to be fair, is a rather funny joke. It turns out that AI models can do this quite well though, and have been able to already even since CLIP came out! So what&amp;rsquo;s the big deal?&lt;/p&gt;</description>
    </item>
    <item>
      <title>My PhD Interview Experience</title>
      <link>https://tylerzhu.com/blog/2023/01/my-phd-interview-experience/</link>
      <pubDate>Sun, 22 Jan 2023 12:08:43 -0800</pubDate>
      <guid>https://tylerzhu.com/blog/2023/01/my-phd-interview-experience/</guid>
      <description>&lt;p&gt;Over the last few months, I’ve been fully immersed with the CS PhD application process. I’ll make a later blog post detailing the overall process, but I thought I’d write up a quick post about my recent experiences (and hopefully future!) with the interview portion of the process.&lt;/p&gt;</description>
    </item>
    <item>
      <title>New Year Resolutions for 2023</title>
      <link>https://tylerzhu.com/blog/2023/01/new-year-resolutions-for-2023/</link>
      <pubDate>Sun, 01 Jan 2023 12:29:38 -0800</pubDate>
      <guid>https://tylerzhu.com/blog/2023/01/new-year-resolutions-for-2023/</guid>
      <description>&lt;p&gt;Seeing as its the new year, I took some time to think about my 2023 resolutions like most people are.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
