[{"content":"Thanks to Will Hwang for his helpful thoughts and references on recent long context architectures.\nA few weeks ago, DeepSeek released their V4 model, headlined by the tag: \u0026ldquo;Towards Highly Efficient Million-Token Context Intelligence\u0026rdquo;. Other models like Gemini and Claude have claimed million-token context before, but in practice were only effective for much shorter contexts on any real tasks. Deepseek V4 requires 27% of single-token inference FLOPs and 10% of KV cache that DeepSeek-V3.2 uses, which is already optimized for long-context tasks. These improvements have the community in astonishment (see below).\nKeep getting rate-limited by Claude, so I tried out DeepSeek V4 for the first time.\nAfter 10M+ tokens, holy crap the cost is ... 🤯 pic.twitter.com/u5KfOCxv0X\n\u0026mdash; Jia-Bin Huang (@jbhuang0604) May 2, 2026 I did a deep dive into the core architectural improvements which make such efficient processing of long contexts possible. There aren\u0026rsquo;t any detailed ablations in the paper, so we can only guess why they make these choices, but in my opinion, they are well motivated.\nSurveying the land Previous long-context architectures The gold standard for long-context architectures, as far as I understand it, is hybrid architectures: interleaving layers of local attention, such as sliding window attention (SWA), or linear attention, like Gated DeltaNet, with layers of full attention, or global attention, at different ratios. Recency bias is a useful property in language as recent tokens are usually more relevant to the current query than older tokens, so we\u0026rsquo;re okay with this inductive bias. Fun fact: SWA was first proposed in 2020 with BERT models [1Iz Beltagy, Matthew E. Peters, and Arman Cohan \u0026ldquo;Longformer: The Long-Document Transformer.\u0026rdquo; arXiv:2004.05150. 2020.]! Talk about old.\nFor example, Gemma-3 [2Team, G., Kamath, A., Ferret, J., et al. \u0026ldquo;Gemma 3 Technical Report.\u0026rdquo; 2025.] has a 5:1 local-to-global ratio with SWA, meaning that there are 5 blocks of local attention followed by 1 block of global attention. With a local window of only 1024 tokens, this results in about $20\\%$ of the KV cache that full attention in every layer would require. Kimi Linear [3Zhang, Y., Lin, Z., Yao, X., et al. \u0026ldquo;Kimi Linear: An Expressive, Efficient Attention Architecture.\u0026rdquo; 2025.] instead features a 3:1 ratio of Gated DeltaNet style linear attention with full attention.\nOne of the first mentions of Sliding Window Attention (from Longformer).\nThis achieves a balance between speed/memory and explicit long context access in the KV cache. Note that SWA primarily tackles the memory explosion of the KV cache, and not any form of attention sharpening. Hybrid models still have a quadratic scaling factor, but the coefficient is much smaller, which tends to be good enough! Attempts to completely replace the quadratic factor with something sub-quadratic have not been very successful (see below).\nOne other issue with long context is that people tend to get cute with their attention mechanisms, which leads to theoretically interesting propositions. However, these patterns need to be hardware efficient to be useful in practice, so forms of sparse attention or global+sliding window are often slower despite their theoretical benefits [1Iz Beltagy, Matthew E. Peters, and Arman Cohan \u0026ldquo;Longformer: The Long-Document Transformer.\u0026rdquo; arXiv:2004.05150. 2020.]. This is less of an issue in the modern day though, as people (AI) have gotten better at writing custom kernels.\nComparison of different long-context architectures (from the E2E-TTT paper).\nTo prove my point above, this is a comparison of different long-context architectures from the E2E-TTT paper [4Arnuv Tandon, Karan Dalal, Xinhao Li, et al. \u0026ldquo;End-to-End Test-Time Training for Long Context.\u0026rdquo; 2025.]. Ignoring TTT-E2E (which is a still quite new), Hybrid SWA in the Gemma-3 ratio is the best performing loss at 128K context while also achieving a similar prefill latency to the other purported \u0026ldquo;long-context\u0026rdquo; methods. It\u0026rsquo;s a very strong baseline!\nA wishlist for million-token context Before we dive into DeepSeek V4\u0026rsquo;s specific architecture, let\u0026rsquo;s think about what the main inefficiencies with long context currently are.\nTake a concrete use case: I\u0026rsquo;ve pasted an entire codebase into context and I want to know \u0026ldquo;where does the menu get repositioned on resize?\u0026rdquo; The repo has dozens of files spanning UI, backend, auth, and tests, but only two or three actually touch menu layout. The rest is dead weight that the model still has to pay attention over, token by token.\nThree inefficiencies in long-context processing, illustrated on a single coding query. Thank you Claude.\nThis single example surfaces three distinct inefficiencies:\n1. Most context is irrelevant. My UI code lives in a different directory than my auth or database code, and each makes up a small fraction of the total context. Spending compute attending over api/auth.py when the query is about menu positioning is wasted work. Worse, it dilutes attention away from the tokens that actually matter. Wishlist item #1: select a sparse set of the most relevant context.\n2. Tokens are smaller than concepts. A real BPE tokenizer splits def update_menu_position(event, items): into roughly nine pieces — something like def, update, _menu, _position, (, event, ,, items, ):. The model sees update and _menu as separate units even though update_menu_position is the single concept that matches my query. _menu also shows up again in _menu_color, _menu_hidden, context_menu, so attention has to do extra work just to disambiguate the same subword across different uses. Wishlist item #2: merge tokens into higher-level concepts.\n3. Noise now can be critical later. While we want a sparse context for each query, we also need a high level view of our entire context for our future queries. If I follow up with \u0026ldquo;and how does login interact with the resize handler?\u0026rdquo;, api/auth.py suddenly matters and I need it back. Re-reading the full file from scratch every time the relevance landscape shifts is wasteful, so it would be much cheaper to keep a coarse representation of our entire context. Wishlist item #3: keep a coarse sketch of what you skipped.\nOverview Overview of the DeepSeek-V4 Architecture.\nThe goal of DeepSeek-V4 [5DeepSeek-AI \u0026ldquo;DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence.\u0026rdquo; 2026.] is to achieve million-token context while maintaining the same level of performance as prior generations of models. In this regime, the costly portion of the architecture is the attention mechanism, which scales quadratically with the context length, so the focus is entirely on reducing the KV cache size.\nAs we can see in the diagram above, DeepSeek features a hybrid architecture, interleaving a local Compressed Sparse Attention (CSA) with a global Heavily Compressed Attention (HCA) in a 1:1 ratio. There are other improvements, but most are either present in previous iterations of DeepSeek or aren\u0026rsquo;t as critical for long context.\nCompressed Sparse Attention (CSA) with DeepSeek Sparse Attention (DSA).\nCSA tackles wishlist item #1: selecting a sparse set of the most relevant context. Each query token attends to only the most relevant $k$ compressed KV entries ($k = 128$). This is done with DeepSeek Sparse Attention (DSA) from DeepSeek-V3.2, using a Lightning Indexer to avoid the heavy $QK^T$ computation.\nThe \u0026ldquo;Compressed\u0026rdquo; part of CSA is done with a token-level compressor, which tackles wishlist item #2: merging tokens into higher-level concepts. It compresses every $m$ KV entries into $1$, so the model only attends over $\\frac 1m$ as many entries ($m = 4$).\nFinally, we can reuse the token-level compressor to perform Heavily Compressed Attention (HCA), which tackles wishlist item #3: keeping a coarse sketch of what you skipped. This compresses every $m' \\gg m$ KV entries into $1$, a much more aggressive compression which aims to only keep the high level structure of the context ($m' = 128$).\nWe will now go through each of these components in detail.\nToken-level compression Our goal is to take an input hidden state $H\\in \\mathbb{R}^{n \\times d}$ and reduce its sequence length by a factor of $m$, and optionally reduce its dimensionality to $c \\ll d$. In other words, we want to bucket every $m$ tokens and compress them into a single token.\nWe\u0026rsquo;ll start with the simplest possible ideas and gradually fix their issues to get to the final solution.\nFirst, let\u0026rsquo;s project our hidden state into a lower dimensional space:\n$$ C = H W^{KV} \\quad \\text{where }\\, W^{KV} \\in \\mathbb{R}^{d \\times c} $$1. Average pooling with stride $(m, m)$\nThis is the simplest way to get from $m$ tokens to $1$ \u0026mdash; average them together! Our compressed KV entries are then just $$ C^{\\text{comp}}_ i = \\frac 1m \\sum_{j=mi}^{m(i+1)-1} C_{j}.$$As expected, this sucks.\nWe\u0026rsquo;re losing so much information by blindly averaging info together as informative tokens get drowned out by noise. There\u0026rsquo;s a simple fix: learned weights.\n2. Learnable weighted average\nWe should learn input-dependent weights for combining our tokens together. Let\u0026rsquo;s introduce a separate term for our compression weights: $$ Z = HW^{Z} \\quad \\text{where }\\, W^{Z} \\in \\mathbb{R}^{d \\times c}.$$Now, we can compute weights using softmax just as we do in attention, so\n$$ S_{mi: m(i+1) - 1} := \\text{softmax}([Z_{mi: m(i+1) - 1}]) \\implies C_i^{\\text{comp}} = \\sum_{j = mi}^{m(i+1)-1} S_j \\odot C_j.$$However, now we have a slight issue: if my content spans across our window, we\u0026rsquo;ll be splitting the information up. So we should learn to integrate information across boundaries.\n3. Boundary conditions\nThis only requires a simple fix: we simply need to integrate information from both the previous and current windows. We broaden our compression weights accordingly and sum across both windows.\n$$ S_{{\\color{red}m(i-1)}: m(i+1) - 1} := \\text{softmax}([Z_{{\\color{red}m(i-1)}: m(i+1) - 1}]) \\implies C_i^{\\text{comp}} = \\sum_{j = {\\color{red}m(i-1)}}^{m(i+1)-1} S_j \\odot C_j.$$This has a tricky issue however, which is that there\u0026rsquo;s no positional nuance between when a window is acting as the current window to be compressed vs. as the context for the next window. Otherwise, we\u0026rsquo;ll be compressing the same information twice. This leads to our final solution.\n4. Dual-role compression\nThank you Claude for the visualization again.\nTo tackle this, we introduce a new set of weights for the previous window. Let $a$ refer to our current window, and $b$ refer to the previous window. Then we have four sets of states:\n$$ \\begin{aligned} \\text{KV values:} \u0026\\quad C^a = H W^{aKV}, \\quad C^b = HW^{bKV} \\\\ \\text{KV compression weights:} \u0026\\quad Z^a = HW^{aZ}, \\quad\\,\\;\\; Z^b = HW^{bZ} \\end{aligned} $$Let\u0026rsquo;s also introduce a learnable positional bias $\\mathcal{B}^a, \\mathcal{B}^b \\in \\mathbb{R}^{m \\times c}$, which we\u0026rsquo;ll add to our KV compression weights to help further disambiguate context from current. This leaves us with\n\\[ [S^a_{mi: m(i+1) - 1}, S^b_{m(i-1): mi - 1}] = \\text{softmax}([Z^a_{mi: m(i+1) - 1} + \\mathcal{B}^a; Z^b_{m(i-1): mi - 1} + \\mathcal{B}^b]). \\]Then it remains to integrate this over our windows:\n\\[ C_i^{\\text{comp}} = \\sum_{j = mi}^{m(i+1)-1} S_j^a \\odot C_j^a + \\sum_{j = m(i-1)}^{mi-1} S_j^b \\odot C_j^b \\quad \\in \\mathbb{R}^{\\frac nm \\times c}. \\]And we are done.\nRemarks. I\u0026rsquo;m not personally sure of how important handling the boundary conditions is, but I\u0026rsquo;m assuming this is the primary motivation for DeepSeek\u0026rsquo;s implementation. There\u0026rsquo;s a lot of moving parts as well, and this softmax is slightly cheaper than attention (since we avoid the heavy $QK^T$ computation), so presumably this is worth the overhead for better representational power.\nCompressed Sparse Attention (CSA) Compressed Sparse Attention (CSA) uses the token-level compressor and the former DeepSeek Sparse Attention to select top-$k$ entries for the context. Thus, all we have left to explain is how the Lightning Indexer selects these entries.\nFirst, we perform the same token-level compression as before to get compressed indexers $K^{\\text{IComp}} \\in \\mathbb{R}^{\\frac nm \\times c^I}$, where $c^I \\ll c$ is an even more compressed dimensionality than before $(c^I = 128)$. All we need from this is some idea of the relative importance of each indexer.\nMulti-query attention (MQA) illlustration from the GQA paper.\nTo calculate the importance of each KV entry, we\u0026rsquo;re going to use multi-query attention (MQA) [6Joshua Ainslie, James Lee-Thorp, Michiel de Jong, et al. \u0026ldquo;GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.\u0026rdquo; 2023.]. Unlike standard multi-head attention where we project our query, key, and value into $n_h$ different heads, we only project our query into different heads. This obviously comes at a cost of expressivity, but we don\u0026rsquo;t need much for index selection. We get the search power of 64 query heads, but we only have to store and fetch 1 Key head. This absolutely destroys the memory bandwidth bottleneck that normally plagues long-context decoding.\nFor a input token $t$ with hidden state $\\mathbf{h}_t \\in \\mathbb{R}^d$, we produce multiple indexer queries $\\{\\mathbf{q}_{t,1}^I, \\mathbf{q}_{t,2}^I, \\ldots, \\mathbf{q}_{t,n^I_h}^I\\} \\in \\mathbb{R}^{n^I_h \\times c^I}$ in a low-rank manner to save on compute.\n$$ \\underset{\\substack{\\\\ \\mathbb{R}^d}}{\\mathbf{h}_t} \\xrightarrow[\\text{down-projection}]{W^{DQ}} \\underset{\\substack{\\\\ \\mathbb{R}^{d_c} \\\\ \\text{compressed latent} \\\\ \\text{query vector}}}{\\mathbf{c}_t^Q} \\xrightarrow[\\text{up-projection}]{W^{IUQ}} \\underset{\\substack{\\\\ \\mathbb{R}^{c^I n_h^I}}}{\\mathbf{q}_t^I} = \\underbrace{\\big[ \\mathbf{q}_{t,1}^I ; \\mathbf{q}_{t,2}^I ; \\dots ; \\overset{\\in \\mathbb{R}^{c^I}}{\\mathbf{q}_{t,n_h^I}^I} \\big]}_{n_h^I \\text{ indexer query heads.}}. $$Here, we go from $d=4096$ to $d_c=1024$ as a downsample, to $n^I_h c^I=64\\cdot 128 = 8192$ dimensions split between $n^I_h=64$ indexer query heads (which saves us ~50% compute compared to going directly). We also save on a factor of $n^I_h$ in our attention compared to if we had done standard multi-head attention, including the VRAM savings from having a much cheaper KV cache.\nFinally, now that we have these queries and keys, we obtain the \u0026ldquo;values\u0026rdquo; for our multi-query attention (MQA) as\n\\[ [\\mathbf{w}_{t,1}^I, \\mathbf{w}_{t,2}^I, \\ldots, \\mathbf{w}_{t,n^I_h}^I] = \\mathbf{w}^I_t = \\mathbf{h}_t W^w \\quad\\quad W^w \\in \\mathbb{R}^{d \\times n^I_h}.\\]We use a cheap form of attention with ReLU as our nonlinearity instead of a softmax and sum over our query heads to obtain our final indexer scores:\n\\[ I_{t,s} = \\sum_{h=1}^{n^I_h} \\mathbf{w}_{t,h}^I \\cdot \\mathrm{ReLU}(\\mathbf{q}_{t,h}^I\\cdot K_{s}^{\\text{IComp}}) \\]where $s$ is the index of the preceding compressed block $(s \\ll \\lfloor \\frac tm \\rfloor)$. We take the top-$k$ compressed KV entries based on these scores, where $k=512$ for the Flash model.\nFinally, while having compressed entries is good, we also maintain a sliding window of the most recent $n_{\\text{win}}=128$ raw entries as well, as recent tokens are more relevant and still important to keep around.\nHeavily Compressed Attention (HCA) Heavily Compressed Attention is virtually the same as CSA, except we use a much more aggressive compression factor $m' \\gg m$ to compress every $m'$ KV entries into $1$ ($m'=128$). This means we can also forgo the lightning indexer as the entire context is usually short enough to fit without needing sparsity.\nConclusion DeepSeek V4\u0026rsquo;s architecture is a strong step towards million-token context. It maintains the hybrid design choice of its contemporary models, but is not quite as adventurous to use linear attention models like Gated DeltaNet.\nI personally am excited to understand better why each choice was necessary (as there are no ablations), but the results clearly speak for themselves.\nThere\u0026rsquo;s some minor details about how RoPE needs to be applied in CSA that matter, but maybe I\u0026rsquo;ll save that for another day.\nReferences [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan \"Longformer: The Long-Document Transformer.\" arXiv:2004.05150 2020. [2] Team, G., Kamath, A., Ferret, J., et al. \"Gemma 3 Technical Report.\" 2025. [3] Zhang, Y., Lin, Z., Yao, X., et al. \"Kimi Linear: An Expressive, Efficient Attention Architecture.\" 2025. [4] Arnuv Tandon, Karan Dalal, Xinhao Li, et al. \"End-to-End Test-Time Training for Long Context.\" 2025. [5] DeepSeek-AI \"DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence.\" 2026. [6] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, et al. \"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.\" 2023. ","permalink":"https://tylerzhu.com/blog/2026/05/deepseek-v4-much-ado-about-context/","summary":"\u003cp\u003eThanks to Will Hwang for his helpful thoughts and references on recent long context architectures.\u003c/p\u003e\n\u003cp\u003eA few weeks ago, DeepSeek released their V4 model, headlined by the tag: \u0026ldquo;Towards Highly Efficient Million-Token Context Intelligence\u0026rdquo;.\nOther models like Gemini and Claude have claimed million-token context before, but in practice were only effective for much shorter contexts on any real tasks.\nDeepseek V4 requires 27% of single-token inference FLOPs and 10% of KV cache that DeepSeek-V3.2 uses, which is already optimized for long-context tasks.\nThese improvements have the community in astonishment (see below).\u003c/p\u003e","title":"DeepSeek V4: Much Ado About Context"},{"content":"I used to be very up to date on self-supervised learning, but fell out of it as the field itself slowly died down in favor of VLMs and what not after SigLIP/DINO/V-JEPA became the dominant paradigms. This means I haven\u0026rsquo;t read any SSL papers seriously since 2023.\nHowever, that doesn\u0026rsquo;t mean I\u0026rsquo;ve been living under a rock. I\u0026rsquo;m still well aware of Yann LeCun\u0026rsquo;s anti-pixel prediction tirade, and in that time, nothing came out that convinced me we could move away from pixel-level supervision. It\u0026rsquo;s simply such a strong prior to enact for self-supervision: you get multi-view consistency and true spatial grounding at the slight cost of having to model high-frequency pixel details.\nYann dropped by Princeton this week to give a talk, and I was shocked by how much JEPA work had happened since my hiatus from SSL. All sorts of information-theoretical regularization methods and applications of JEPA had come out, which piqued my interest. I even got to ask him why he was against pixel prediction. His main point was that Kaiming had tried for years before giving up and going to MIT. But I needn\u0026rsquo;t feel bad\u0026mdash;it took Yann 10 years to figure out pixels were not it, so it was okay if it took me 3 days. Funny guy honestly.\nI also happened to just give a reading group on LeWorldModel [1Maes, L., Le Lidec, Q., Scieur, D., et al. \u0026ldquo;LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels.\u0026rdquo; arXiv preprint arXiv:2603.19312. 2026.] this afternoon (link to slides here if you\u0026rsquo;re interested). I\u0026rsquo;m in a very JEPA time of my life, if you can\u0026rsquo;t tell.\nTitle slide for my reading group presentation. I got a bit carried away and forgot which LeGOAT I was supposed to be talking about\u0026hellip;\nIn the wake of that, I\u0026rsquo;ve gotten a chance to revisit the literature from the beginning and understand what JEPA argues for, marketing and Twitter drama aside. This post (and hopefully a series) is my documentation of interesting discoveries I make, some more obvious than others. Spoiler alert: I think JEPA has some legs to it!\nJEPA is an objective, not a prior Based on how Yann markets it, I always assumed JEPA was about forgoing pixel-level supervision and thus its priors. I also figured that it was its own type of self-supervised learning, similar to the other host of contrastive/distillation approaches with its minor tweaks. But they\u0026rsquo;re actually more similar than I had thought.\nHigh level comparison of JEPA to other SSL methods, from [9].\nThis is a schematic for comparison of different self-supervised methods. Under the first category of joint-embedding architectures, you have methods which pull together similar concepts and push away dissimilar ones. This relies heavily on data augmentation to obtain \u0026ldquo;self-similar\u0026rdquo; views of the data to do this without labels. The canonical example is SimCLR [2Chen, T., Kornblith, S., Norouzi, M., et al. \u0026ldquo;A Simple Framework for Contrastive Learning of Visual Representations.\u0026rdquo; ICML. 2020.], but pretty much everything falls under this category: MoCo [3He, K., Fan, H., Wu, Y., et al. \u0026ldquo;Momentum Contrast for Unsupervised Visual Representation Learning.\u0026rdquo; CVPR. 2020.], BYOL [4Grill, J., Strub, F., Altché, F., et al. \u0026ldquo;Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.\u0026rdquo; NeurIPS. 2020.] (i.e. non-contrastive methods), DINO [5Caron, M., Touvron, H., Misra, I., et al. \u0026ldquo;Emerging Properties in Self-Supervised Vision Transformers.\u0026rdquo; ICCV. 2021.], and even multimodal examples like CLIP [6Radford, A., Kim, J. W., Hallacy, C., et al. \u0026ldquo;Learning Transferable Visual Models From Natural Language Supervision.\u0026rdquo; ICML. 2021.] or SigLIP [7Zhai, X., Mustafa, B., Kolesnikov, A., et al. \u0026ldquo;Sigmoid Loss for Language Image Pre-Training.\u0026rdquo; ICCV. 2023.] where $x,y$ are text and image encoders and $D(s_x, s_y)$ is a contrastive loss (i.e., a softmax or sigmoid over each modality per class). The central concern is representation collapse, i.e., preventing everything from becoming the same embedding.\nThe second category is generative architectures, or reconstruction-based methods. These predict directly in pixel space, which prevents collapse issues as long as $z$ has less information capacity than the signal $y$. Masked Autoencoders [8He, K., Chen, X., Xie, S., et al. \u0026ldquo;Masked Autoencoders Are Scalable Vision Learners.\u0026rdquo; CVPR. 2022.] are the canonical reference, where $z$ consists of the mask tokens that the decoder learns to predict from unmasked context $x$.\nThe final category is joint-embedding predictive architectures. The key difference to generative methods is that they predict in latent space, not pixel space. As a result, they also suffer from collapse issues, which can be addressed through similar techniques above like assymetric architectures.\nA slight difference between joint-embedding architectures and others is that the first generally seeks \u0026ldquo;invariance\u0026rdquo; to data-augmentations, whereas the latter two seek representations that are \u0026ldquo;predictive\u0026rdquo; of each other when given some conditioning signal $z$.\nI-JEPA is surprisingly DINO (to me) Let\u0026rsquo;s take the Image-JEPA (I-JEPA) [9Assran, M., Duval, Q., Misra, I., et al. \u0026ldquo;Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.\u0026rdquo; CVPR. 2023.] architecture as an example below. We have a context encoder $f_\\theta$ whose job is to encode a partial crop of the input into enough information for the predictor $g_\\phi$, conditioned on positional tokens (colored), to predict a target representation. In true JEPA fashion, we supervised with an $L_2$ loss defined on the latent representations, where the target encoder is an exponential moving average (EMA) of the context for assymetry.\nOverview of the I-JEPA architecture, where the goal is to predict target region representations based on a specific context.\nLook familiar? If you squint, it\u0026rsquo;s just the DINO prior! [5Caron, M., Touvron, H., Misra, I., et al. \u0026ldquo;Emerging Properties in Self-Supervised Vision Transformers.\u0026rdquo; ICCV. 2021.]\nLet me make this explicit, as it\u0026rsquo;s not immediately clear. In DINO, we also have a student and teacher model. To distinguish their input signals, the student receives local crops of $96\\times 96$ while the teacher receives global crops of $224\\times 224$ (in practice, the student also receives global crops, but we\u0026rsquo;re keeping things simple!) This is the heart of DINO\u0026rsquo;s local-to-global correspondence prior, and is similar to the I-JEPA prior above: we want the student representation to be informative enough that it can match the teacher\u0026rsquo;s global view (with the same tricks used to prevent collapse).\nOverview of the DINO architecture.\nThe only difference, obviously, is in how we supervise the representations. Here, we adopt a knowledge distillation framework (hence the name, distillation with no labels, or DINO) and match the distributions by a cross-entropy loss, i.e.\n$$ \\min_{\\theta_s} -p_t \\log p_s. $$Simplicity is key\u0026mdash;all that\u0026rsquo;s needed to prevent representation collapse is an EMA with a centering and sharpening of the teacher outputs.\nAllow me to indulge in a brief tangent enlightened on me by the DINO paper. Interestingly, DINO is like the ultimate evolution of BYOL (Bootstrap Your Own Latent). BYOL was one of the first SSL works that obtained good results without contrasting between images, instead matching features from a student to a teacher trained with momentum using MSE. However, having a predictor was crucial so that the representations did not collapse (as well as the batch norm\u0026hellip;). DINO adopted virtually the same architecture, but used cross entropy (i.e., distillation), and found that now the predictor (marginally) hurt performance (see Table 7 [5Caron, M., Touvron, H., Misra, I., et al. \u0026ldquo;Emerging Properties in Self-Supervised Vision Transformers.\u0026rdquo; ICCV. 2021.])! So much for JEPA motivation; empiricism wins out in the end. The entire Appendix B of the DINO paper is a great read, honestly.\nTable 14 from Appendix B, with detailed ablations of DINO compared to MoCo and BYOL.\nIn fact, I-JEPA also uses MSE on its representations, and requires a predictor (again, in a slightly different manner). They do an ablation where they use MSE on pixel-loss and it does much worse, but I\u0026rsquo;m not sure I trust how hard they tried for that experiment.\nBack to JEPA business. Now you might say, gee Tyler, I actually think this also sounds a lot like MAE, to which I\u0026rsquo;d say, you\u0026rsquo;re also correct. The paper itself mentions:\nOur encoder/predictor architecture is reminiscent of the generative masked autoencoders (MAE) [36] method. However, one key difference is that the I-JEPA method is non-generative and the predictions are made in representation space.\nIt\u0026rsquo;s more similar in that there is an explicit predictor conditioned on some latent variable $z$, so the predictor is \u0026ldquo;controllable\u0026rdquo; or \u0026ldquo;targeted\u0026rdquo;. But I find the specific architecture is more similar to DINO since generative methods don\u0026rsquo;t have the same pressure to prevent collapse. In other words, to be honest, all these methods have similar priors, which actually are grounded in our data. We\u0026rsquo;re just doing context-prediction, which is like a bi-directional next token prediction, although just not as dense.\nV-JEPA is unsurprisingly MAE Now that I\u0026rsquo;ve primed you, this next part shouldn\u0026rsquo;t be surprising. Although I will say that this point was lost on me the first (and second, and third, and \u0026hellip;) time I tried understanding V-JEPA. [10Bardes, A., Garrido, Q., Ponce, J., et al. \u0026ldquo;Revisiting Feature Prediction for Learning Visual Representations from Video.\u0026rdquo; arXiv preprint arXiv:2404.08471. 2024.]\nOverview of the V-JEPA architecture.\nFor extending to videos, V-JEPA full on adopts the MAE pipeline. This means that they generate masks starting with a few contiguous spatial segments expanded through time (which forms the target), and removing those to get the context. The remaining tokens are fed through the encoder, and then the mask tokens $z$ are added back for a predictor/decoder to predict what the masked tokens should be. These are supervised using $L_1$ against the target representations, i.e., the $y$-encoder, which is just an EMA of the $x$-encoder.\nThe diagram is not terribly clear. In particular, the non-mask tokens are dropped from the predictor, even though it clearly predicts everything (just like MAE). The only difference is that we\u0026rsquo;re supervising in latent space, not pixel space. This makes sense\u0026mdash;there are too many high frequency details for learning great representations, so even MAE does some post processing like normalizing the images as a target instead. Also inputs are provided both on the left and the right, which should be a cardinal sin of diagram creation.\nHere\u0026rsquo;s a picture of the MAE architecture to remind you.\nOverview of the MAE architecture.\nIn this sense, I actually don\u0026rsquo;t disagree with V-JEPA at all then. It does away with many of the high frequency details which plague MAE, but in turn then needs to deal with supervising latent representations. This requires borrowing from the standard SSL literature, for pretty good results.\nParting Thoughts I have two main thoughts. The first is that when I first interpreted Yann\u0026rsquo;s \u0026ldquo;no-pixel-supervision\u0026rdquo; claim, I thought he meant we need to do away with using priors about pixel data, i.e., images. It turns out that he really just means no pixel supervision, literally. Amidst all the marketing that point got lost, but seeing as he\u0026rsquo;s just getting around MAE weakest spot of high frequency pixel noise (even keeping the dense prediction you get compared to I-JEPA), I like it a lot. We\u0026rsquo;re not throwing away structure that we can take advantage of, but actually leveraging it in our prior.\nThe only weird thing is that the latents for some reason aren\u0026rsquo;t as immediately usable as say DINO. Anecdotally in a few of my projects, I\u0026rsquo;ve tried using frozen V-JEPA 2 representations for VLMs and just for representation analysis (primarily in multimodal settings), and they\u0026rsquo;ve been much worse than other frozen representations. From chatting with others I\u0026rsquo;m not the only one with this problem, but maybe they\u0026rsquo;re just not semantic at all.\nThe second is that I\u0026rsquo;m not convinced that this specific framework has to be it. I\u0026rsquo;m sure that goes without saying, but JEPA was not the pioneer in any sense of neither SSL nor latent prediction. It built on a whole host of works that carved the landscape out, and it will continue to be one of many works which approach SSL with its own angle. Only time will tell which approach survives. DINO is already on version 3 and well and alive, as well as Perception Encoder (which is a Hiera model, a descendant of MAE actually!) and SigLIP as alternative approaches.\nThere\u0026rsquo;s also data flywheel approaches like SAM which one could consider as active or curriculum learning, but have been immensely popular while still being \u0026ldquo;semi-supervised\u0026rdquo; in a sense. Recurrent models like Recurrent Masked Auto-Encoders are also refreshing takes on a field that has felt mostly stagnant for new ideas, revisiting our assumptions on an input signal level.\nAll of this is to say: representation learning is far from dead\u0026mdash;it is alive more than ever!\nReferences [1] Maes, L., Le Lidec, Q., Scieur, D., et al. \"LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels.\" arXiv preprint arXiv:2603.19312 2026. [2] Chen, T., Kornblith, S., Norouzi, M., et al. \"A Simple Framework for Contrastive Learning of Visual Representations.\" ICML 2020. [3] He, K., Fan, H., Wu, Y., et al. \"Momentum Contrast for Unsupervised Visual Representation Learning.\" CVPR 2020. [4] Grill, J., Strub, F., Altché, F., et al. \"Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.\" NeurIPS 2020. [5] Caron, M., Touvron, H., Misra, I., et al. \"Emerging Properties in Self-Supervised Vision Transformers.\" ICCV 2021. [6] Radford, A., Kim, J. W., Hallacy, C., et al. \"Learning Transferable Visual Models From Natural Language Supervision.\" ICML 2021. [7] Zhai, X., Mustafa, B., Kolesnikov, A., et al. \"Sigmoid Loss for Language Image Pre-Training.\" ICCV 2023. [8] He, K., Chen, X., Xie, S., et al. \"Masked Autoencoders Are Scalable Vision Learners.\" CVPR 2022. [9] Assran, M., Duval, Q., Misra, I., et al. \"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.\" CVPR 2023. [10] Bardes, A., Garrido, Q., Ponce, J., et al. \"Revisiting Feature Prediction for Learning Visual Representations from Video.\" arXiv preprint arXiv:2404.08471 2024. ","permalink":"https://tylerzhu.com/blog/2026/04/lethoughts-on-jepa-the-return-of-ssl/","summary":"\u003cp\u003eI used to be very up to date on self-supervised learning, but fell out of it as the field itself slowly died down in favor of VLMs and what not after SigLIP/DINO/V-JEPA became the dominant paradigms.\nThis means I haven\u0026rsquo;t read any SSL papers seriously since 2023.\u003c/p\u003e\n\u003cp\u003eHowever, that doesn\u0026rsquo;t mean I\u0026rsquo;ve been living under a rock.\nI\u0026rsquo;m still well aware of Yann LeCun\u0026rsquo;s anti-pixel prediction tirade, and in that time, nothing came out that convinced me we could move away from pixel-level supervision.\nIt\u0026rsquo;s simply such a strong prior to enact for self-supervision: you get multi-view consistency and true spatial grounding at the slight cost of having to model high-frequency pixel details.\u003c/p\u003e","title":"LeThoughts on JEPA: The Return of SSL"},{"content":"A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something that struck me by surprise. I\u0026rsquo;ve been trying to piece it together ever since.\nThe interaction went something like this:\nAudience: \u0026ldquo;Why aren\u0026rsquo;t we scaling vision models as large as we do LLMs?\u0026rdquo;\nLucas: \u0026ldquo;You know, actually, the largest vision models are on par with the largest language models if you look at [X].\u0026rdquo;\nI can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it\u0026rsquo;s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 [10].\n@giffmana was saying at CVPR that large vision models (like VIT-22B) are actually on par with LLMs when controlling for compute or tokens or something? i don’t remember actually but it was an interesting point.\n\u0026mdash; Tyler Zhu (at ICLR!) (@tyleryzhu) August 29, 2023 This mystery still somewhat holds true. I think about it often enough that I\u0026rsquo;ve decided to try and trace down the original point, with Gemini as my somewhat-reliable research assistant.\nX marks the treasure, and potentially AGI Among the possibilities, parameters is clearly not it. If anything, vision models are surprisingly parameter efficient for their capabilities.\nI would argue that vision models haven\u0026rsquo;t found enough signal to benefit from scaling. Despite the best efforts shown in ViT-22B [4] and 4DS [2], works like DINOv3 [10] have found large vision models to be terribly unstable to train. Anecdotally, I found these larger models to have worse performance than their smaller 1B variants — but YMMV.\nA better candidate is FLOPs, i.e. compute invested during training. To compare, we need to compute tokens, since:\n$$\\text{FLOPs} = \\text{tokens} \\times \\text{FLOPs per token} \\approx \\text{tokens} \\times \\text{params}$$For this comparison, I\u0026rsquo;ll sample models released before the interaction (June 2023):\nLlama-2 [12] as the LLM DINOv2-g [9], SigLIP-1 [13], and ViT-22B [4] as the vision encoders Large Language Model FLOPs Llama-2 [12] comes in three sizes identical to its predecessor: 7B, 13B, and 70B. Each is trained on 2.0T tokens (de-duplicated) with a context length of 4k.\nWe can estimate FLOPs as a function of the sequence length, layers, hidden dimension, and heads ($N, L, D, H$).\nPer attention block:\nQKV projections: Each is $[N \\times D] \\times [D \\times D] \\to 2ND^2$. Three of them gives $6ND^2$. Attention scores: $QK^T$ is $[N \\times D] \\times [D \\times N] \\to 2N^2D$. Heads don\u0026rsquo;t affect FLOPs. Value multiplication: $[N \\times N] \\times [N \\times D] \\to 2N^2D$. Output projection: $[N \\times D] \\times [D \\times D] \\to 2ND^2$. Total per attention block: $8ND^2 + 4N^2D$.\nPer feed-forward block: A $[D \\to 4D \\to D]$ connection gives $16ND^2$.\nPer transformer block: $24ND^2 + 4N^2D$. Multiply by $3BL$ for the full model (forward + backward), where $BN$ equals total tokens.\nOur parameter count is $12D^2$ per block ($4D^2$ for Q,K,V,O and $8D^2$ for FFN). When $D \\gg N$, we get a common approximation for total FLOPs:\n$$ \\text{FLOPs} \\approx 72ND^2 = 6 \\times \\text{params} \\times \\text{tokens} $$To stay consistent with how vision researchers calculate FLOPs, we\u0026rsquo;ll track just forward FLOPs (a third of the total):\n$$ \\text{fwd FLOPs} \\approx L(24ND^2 + 4N^2D) \\approx 2 \\times \\text{params} \\times \\text{tokens} $$Below are the model stats, largely from the Llama-1 paper [11]. Note that $D \\approx N$ here, so the approximation doesn\u0026rsquo;t hold perfectly.\nOne detail: Llama-2 70B uses Grouped-Query Attention, where multiple queries share the same KV vectors. With 64 query heads but only 8 KV heads, the QKV FLOPs reduce to $4.5ND^2 + 4N^2D$.\nModel $N$ $L$ $D$ $H$ tokens FLOPs/tkn total fwd FLOPs Llama-2 7B 4096 32 4096 32 2.0 T 14 G 28T GFLOPs Llama-2 13B 4096 40 5120 40 2.0 T 26.56 G 53T GFLOPs Llama-2 70B 4096 80 8192 64 2.0 T 130 G 260T GFLOPs Vision Transformer FLOPs Now the Vision Transformers. I couldn\u0026rsquo;t find enough details about SigLIP\u0026rsquo;s exact compute budget, but the \u0026ldquo;Getting your ViT in shape\u0026rdquo; paper [1] is a great reference. I\u0026rsquo;ll substitute in the SoViT-400m/14 model.\nThat paper has extensive discussion of compute budgets, and shows that FLOPs correlate well with TPU core-hours — so it\u0026rsquo;s a decent proxy.\nHere are the specs for the largest size of each architecture, on a constant 14×14 patch size. These are actual traced numbers, so the comparison with LLMs won\u0026rsquo;t be entirely fair — but close enough.\nModel params patches width depth dim FLOPs/tkn pretrain tokens total fwd FLOPs SoViT-400m/14 428M 256 1152 27 4304 0.86 G 40B (~13e) 10.2 T 9T GFLOPs SoViT-400m/14 428M 1369 1152 27 4304 1.00 G 6.5B (~2e) 8.9 T 9T GFLOPs DINOv2-g 1.01B 256 1408 40 6144 2.08 G 7.1B (~50e) 1.8 T 3.8T GFLOPs ViT-22B 22B 256 6144 48 24576 40.78 G 11.5B (~3e) 2.9 T 120T GFLOPs DINOv2-g uses a true ViT-g shape (slightly modified), not counting the teacher + student. ViT-22B total estimated using the formula above. DINOv2-g pretrain tokens estimated from the only training recipe of 500 epochs of ImageNet-22k (14M images).\nInterestingly, our LLM estimate isn\u0026rsquo;t far off for ViTs. The MLP expansion factor for larger ViTs is usually ~3.7 rather than 4.0. If we account for the actual expansion factor $\\alpha$:\n$$(8 + 4\\alpha) / (4 + 2\\alpha) = 2$$So it turns out not to matter much. The remaining error resolves with the full form above.\nMusings The first thing that surprised me was how total FLOPs is the right baseline for cross-modal comparison. It\u0026rsquo;s somewhat obvious in hindsight, but coming from epoch-land in vision made it less clear at first.\nUnsurprisingly, vision researchers were far from hitting the same total FLOP count as language researchers until we scaled up both model and dataset size. ViT-22B is the closest, at 120T GFLOPs — squarely in Llama territory. This makes me believe our X is indeed total FLOP count.\nHowever, the gains weren\u0026rsquo;t as groundbreaking as what we\u0026rsquo;d expect from an LLM of similar scale. A few observations:\nNot all FLOPs are created equal. ViTs will find it hard to go beyond short sequence lengths to great success. Larger images help, but with diminishing returns (see Figure 7 in [1]). The signals from higher resolutions largely overlap with those present at smaller sizes — the same goes for multi-epoch training, which vision is stuck in.\nWe need more sources of signal, and more tokens of it. This likely means video, or joint text-image data as already done in multimodal models.\nVision tokens are less information-dense. I first heard this from Kaiming in a talk: BERT\u0026rsquo;s optimal masking ratio is 15% [5], compared to MAE\u0026rsquo;s 75% [7] and ST-MAE\u0026rsquo;s 90% [6]. That\u0026rsquo;s a huge gap, suggesting vision tokens are largely redundant.\nThis idea resurfaced in Transfusion [14] and Chameleon [3], where training mixed-modality transformers was difficult due to loss imbalance across modalities. It became a focal point of their follow-up work on MoE-Sparsity [8].\nTraining objectives aren\u0026rsquo;t comparable. LLMs use next-token prediction (conditional soft classification), whereas ViTs use SSL and one-hot classification. It\u0026rsquo;s unclear how these objectives affect scaling, or if there\u0026rsquo;s a unifying factor.\nWe need smarter compute allocation. Given that model performance doesn\u0026rsquo;t differ heavily with FLOPs/token, either our tokens need to be more informative, or we need to invest compute more wisely. Adaptive compute is one direction — but I think we need to first figure out how to use more compute effectively before optimizing how we use it.\nReferences [1] Alabdulmohsin, I., Zhai, X., Kolesnikov, A., and Beyer, L. \u0026ldquo;Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design.\u0026rdquo; NeurIPS 2023.\n[2] Carreira, J., Gokay, D., King, M., et al. \u0026ldquo;Scaling 4D Representations.\u0026rdquo; arXiv preprint 2024.\n[3] Chameleon Team. \u0026ldquo;Chameleon: Mixed-Modal Early-Fusion Foundation Models.\u0026rdquo; arXiv preprint 2024.\n[4] Dehghani, M., Djolonga, J., Mustafa, B., et al. \u0026ldquo;Scaling Vision Transformers to 22 Billion Parameters.\u0026rdquo; ICML 2023.\n[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. \u0026ldquo;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.\u0026rdquo; NAACL 2019.\n[6] Feichtenhofer, C., Fan, H., Li, Y., and He, K. \u0026ldquo;Masked Autoencoders As Spatiotemporal Learners.\u0026rdquo; NeurIPS 2022.\n[7] He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. \u0026ldquo;Masked Autoencoders Are Scalable Vision Learners.\u0026rdquo; CVPR 2022.\n[8] Kilian, M., Mkrtchyan, O., Zettlemoyer, L., et al. \u0026ldquo;Improving MoE Compute Efficiency by Composing Weight and Data Sparsity.\u0026rdquo; arXiv preprint 2026.\n[9] Oquab, M., Darcet, T., Moutakanni, T., et al. \u0026ldquo;DINOv2: Learning Robust Visual Features without Supervision.\u0026rdquo; TMLR 2024.\n[10] Simeoni, O., Vo, H. V., Seitzer, M., et al. \u0026ldquo;DINOv3.\u0026rdquo; arXiv preprint 2025.\n[11] Touvron, H., Lavril, T., Izacard, G., et al. \u0026ldquo;LLaMA: Open and Efficient Foundation Language Models.\u0026rdquo; arXiv preprint 2023.\n[12] Touvron, H., Martin, L., Stone, K., et al. \u0026ldquo;Llama 2: Open Foundation and Fine-Tuned Chat Models.\u0026rdquo; arXiv preprint 2023.\n[13] Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. \u0026ldquo;Sigmoid Loss for Language Image Pre-Training.\u0026rdquo; ICCV 2023.\n[14] Zhou, C., Yu, L., Babu, A., et al. \u0026ldquo;Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.\u0026rdquo; ICLR 2025.\n","permalink":"https://tylerzhu.com/blog/2026/02/have-we-scaled-vision-like-language-yet/","summary":"\u003cp\u003eA few years ago at our \u003ca href=\"https://sites.google.com/view/t4v-cvpr23?pli=1\"\u003eCVPR 2023 Transformers for Vision workshop\u003c/a\u003e, Lucas Beyer said something that struck me by surprise. I\u0026rsquo;ve been trying to piece it together ever since.\u003c/p\u003e\n\u003cp\u003eThe interaction went something like this:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cstrong\u003eAudience:\u003c/strong\u003e \u0026ldquo;Why aren\u0026rsquo;t we scaling vision models as large as we do LLMs?\u0026rdquo;\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cstrong\u003eLucas:\u003c/strong\u003e \u0026ldquo;You know, actually, the largest vision models are on par with the largest language models if you look at [X].\u0026rdquo;\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eI can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it\u0026rsquo;s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 \u003ca href=\"#ref-10\"\u003e[10]\u003c/a\u003e.\u003c/p\u003e","title":"Have we scaled vision like language yet?"},{"content":"I\u0026rsquo;ve always struggled to understand the intuitions behind Fully Sharded Data Parallel beyond the high level idea of \u0026ldquo;shard everything.\u0026rdquo; Without a systems background, the fundamental primitives like \u0026ldquo;all-reduce\u0026rdquo; and \u0026ldquo;reduce-scatter\u0026rdquo; aren\u0026rsquo;t in my vocabulary. But FSDP conceptually is not complicated, especially once you state what the goals are (the rest is nearly necessitated by the engineering).\nThis post is an attempt to deconstruct the algorithm from first principles as a non-systems person. I will bring up the primitives in their specified context, which I think helps reinforces the intuition much better. Most ML researchers have a stronger understanding of the models, params, and optimizer processes than the systems jargon anyways.\nThanks to Gemini and Tim Darcet\u0026rsquo;s off-hand comment about FSDP for clarifying my intuitions [1].\nHow to Train your Machine Learning Model How does one train a machine learning model? In the simplest case, such as doing a Coursera Intro to Machine Learning course, you\u0026rsquo;ll be training on your personal laptop, which likely only has a CPU. This will be very slow, as CPUs aren\u0026rsquo;t optimized for computing, but at least you\u0026rsquo;re training a model and doing machine learning!\nWhat is the limiting factor here? You\u0026rsquo;re capped by your CPU\u0026rsquo;s memory, i.e. the Random Access Memory (RAM). For example, my Macbook Pro has 32GB of Unified Memory. Quite the uphill battle here, since Google Chrome has likely already consumed most of it!\nYou\u0026rsquo;ll need to store almost everything in RAM for training to not be prohibitively slower than it already is. This includes\nthe dataset, which can be 200MB if you\u0026rsquo;re using CIFAR-10, 150GB if you\u0026rsquo;re using ImageNet, or 5TB+ for LLM pretraining sets like RedPajama (if loaded in memory). the model weights, i.e. the model itself. the model activations, for computing the gradients later on. the model gradients (and optimizer states), for updating the model with backpropagation. Every systems optimization which makes training faster and less memory-intensive revolves around trading off and improving these factors.\nNow in practice, to get anywhere with machine learning, you\u0026rsquo;ll need a hardware accelerator. So let\u0026rsquo;s say we were graciously gifted a single GPU for Christmas, which we\u0026rsquo;ll use to speed up our training. GPUs are good for one thing: speeding up matrix multiplications.\nThat\u0026rsquo;s handy, as those are most of our model\u0026rsquo;s computations. However, if we want to take advantage of these speedups, we\u0026rsquo;ll need to move our data and model to the GPU, which is dependent on the CPU\u0026lt;-\u0026gt;GPU transfer time, otherwise known as the PCIe bandwidth. For the L40 GPUs that I generally use, Gemini tells me that they support PCIe 4.0 with a bandwidth of 64 GB/s bi-directional, so moving a a 10GB dataset from system RAM onto the GPU should take ~0.2s.\nnvidia-smi output for a 1-GPU instance\nFor a long time in computer vision, our training loads were small enough that virtually all of the model-related bytes could fit on GPU, so we just had to max out the amount of data we used (but that also has a speedup cap, i.e. the roofline model; see [2]). Oftentimes you\u0026rsquo;ll see jobs with batch size of 256+, even up to 2048, although generally most of the throughput gains were already hit at batch size 128.\nExample of the compute-bound roofline hitting around 64-128 batch size for a ViT-Small, from an old project of mine\nNow let\u0026rsquo;s say you\u0026rsquo;ve been extra nice year so Santa dropped off a whole node of 8 GPUs. How can we speed up our model using more GPUs?\nDDP: Suffering from Success The key goal is to keep the GPUs warm as long as possible. In our case, our models fit on one GPU, so we\u0026rsquo;ll duplicate copies of our model across each device. Data remains the main knob we control. A good rule of thumb is to parallelize only when and what is necessary (we\u0026rsquo;ll quantify this later).\nIn effect, what we\u0026rsquo;re doing is running our model with a much larger batch size, split over all of the devices. This requires us to now manage the following:\nDistributing the data among all of the devices and running a forward pass Computing the gradients from each device on their specific mini-batch Synchronizing the gradients across all devices for backpropagation This scheme is called Distributed Data Parallel, or DDP, as we parallelize our data and nothing else. We shard (split) our data across devices, and the only thing we need to manage are the gradients.\nDDP illustration, taken from [3].\nNote that we don\u0026rsquo;t need to sync beyond the gradients. From the optimizer\u0026rsquo;s perspective, it\u0026rsquo;s simply optimizing a local model. Each replica starts from the same state and gets the same averaged gradients, so they naturally stay in sync [4].\nHow does this happen mechanically? This is powered by communication primitives which are the backbone of distributed programming. In our case, we want to combine (i.e. reductions like sum, min, avg) data across devices and store that result on each device, specifically the gradients. This is an all_reduce operation: it reduces data across all of the devices.\nThe all_reduce is actually a combination of two other primitives: the reduction step, and the gathering step. The first is a reduce_scatter, which reduces across all devices and scatters the result in equal-sized blocks across devices. You may wonder, why do we bother scattering after reducing? The answer is to reduce the amount of communication of course (and avoid unnecessary data transfer)!\nThe second is the all_gather, which gathers scattered results across devices into each device. Visually it should be quite clear now why all_reduce = reduce_scatter -\u0026gt; all_gather. And that\u0026rsquo;s DDP in a nutshell!\nWhat is the tradeoff now compared to the 1 GPU setting? Now we have to worry about communication costs for the gradients. Typically GPU\u0026lt;-\u0026gt;GPU transfer is supported by NVLink, but my L40 GPUs don\u0026rsquo;t support it, so we\u0026rsquo;re still using the PCIe 4.0 x16 bus at 32 GB/s per direction speeds (10-20x slower than NVLink :|). In our case training small models, this isn\u0026rsquo;t much of an issue yet, but this can be prohibitive for 7B parameter LLMs (14GB of gradients -\u0026gt; 0.2-0.4s per step!).\nIn summary, we perform the following:\nForward pass (local) Backward pass (local) All-reduce: Gradients are synchronized across GPUs Optimizer step (local, using synchronized gradients) Tradeoff: Somewhat redundant memory usage for storing copies of the model. Some low-level details Reading [4] sheds some light on low-level implementation details of DDP. One is the practical issue of determining when to trigger an all_reduce, i.e. knowing when all devices are done computing a specific gradient so we can sync. This is done with backward hooks (since backward() is out of DDP\u0026rsquo;s control).\nWhen one gradient is ready, its corresponding DDP hook on that grad accumulator will fire, and DDP will mark that parameter gradient as ready for reduction. \u0026hellip; When all [gradients] are ready, the Reducer will block waiting for all all_reduce operations to finish. When this is done, averaged gradients are written to the param.grad field of all parameters. So after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same.\nHowever, naively calling an all_reduce for every single parameter would be catastrophic for your latency because each call has overhead. There is a fixed GPU data transfer launch time (~20 µs) to initiative a kernel or copy operation, which means many small calls will incur high latency.\nOne natural optimization is bucketing. To minimize communication costs, we group our parameters into buckets which are sent together. The buckets are assigned in approximately reverse order of the model.parameters(), since DDP expects gradients to become ready during the backward pass in approximately that order. (Actually, in recent versions, DDP tracks the param order in the first pass, then rebuilds buckets optimally [3]). This lets DDP pipeline the communication, transferring gradients for the last layers while the earlier ones are calculating. All in all, our overall throughput shouldn\u0026rsquo;t take much of a hit compared to normal training, although in practice Nx devices will lead to ~(N-0.5)x speedups.\nBy default, PyTorch DDP uses a bucket size of 25MB. If your GPU is not hitting high utilization in the backward pass, it means you are latency bound, so it could be useful to set this limit higher to utilize more compute power.\nFSDP: BIG models need BIG sharding Ding dong, it\u0026rsquo;s Most Surely Language at your door. Now they\u0026rsquo;re asking you to train LLMs, which are huge parameter models that barely fit on one GPU. What do you do?\nLet\u0026rsquo;s walk through the lifecycle of a step in the pipeline to see what\u0026rsquo;s needed now.\nDataset: While the overall dataset is large, with prefetching we only ever need a couple of batches in memory at a time. So this is hardly a prohibitive factor. We\u0026rsquo;ll distribute the data, just as in DDP. Model weights: Now we need to shard our weights across our devices to make our model fit into memory. For example, we could have Layer 1 on GPU 0, Layer 2 on GPU 1, etc. No single GPU holds the full model. In practice it will be more granular than this though. Forward pass: Let\u0026rsquo;s say we\u0026rsquo;re at Layer 1. Each device needs to process it\u0026rsquo;s own mini-batch, but the weights for Layer 1 are sharded across all of the devices. Therefore: Each device will broadcast their shard of Layer 1 (i.e. an all_gather) Temporarily, every device holds the full weights for Layer 1, so they do a forward pass on their local data. Immediately afterwards, every device drops the full Layer 1 weights, keeping only their original shard. Backward pass: Just like the forward pass. To compute gradients for e.g. Layer 1, we need to recollect all shards of Layer 1, then compute the gradients before dropping the weights again. Gradient update: This is where we differ from DDP. Remember that each device has a local gradient for Layer 1 from its own mini-batch, so we first need to average our gradients over all mini-batches. However, now the weights are scattered across our devices, so we don\u0026rsquo;t want to synchronize all gradients to every device, just to where the corresponding shards live. In fact, we have a primitive just for this: reduce_scatter! In this way, we simply use each device as storage for the parameters, as GPU storage is much faster than loading from disk (CPU). We\u0026rsquo;re trading network bandwidth (sending the weights around constantly with all_gather) for memory capacity (storing a fraction of the model per device).\nCompared to DDP, we incur an extra all_gather per parameter, but because we\u0026rsquo;ve reduced the memory needed, we might be able to run with a larger batch size and still be faster.\nThe following summary from Tim Darcet distills this very clearly:\nTim Darcet\u0026rsquo;s summary of FSDP vs DDP\nDeeper Details I actually looked at the topology of the GPUs I use using nvidia-smi topo -m. It checks out with the spec, and offers some additional details which are interesting.\nEach cluster node is a Lenovo ThinkSystem SR670 V2 containing:\n2 x Intel Xeon Gold 5320 26-core CPUs 16 x 32GB DDR4 3200MHz RDIMMs (512GB total) 8 x NVidia L40 GPUs 3.5TB of SSD for local scratch One 10Gbps Ethernet uplink (Note: NOT Infiniband) So this makes sense, since each CPU is connected to 4 GPUs. The GPUs within each group have fast interconnect, while the GPUs across groups have slightly slower interconnect.\nReferences https://x.com/TimDarcet/status/1914965730488955334 https://horace.io/brrr_intro.html https://www.youtube.com/watch?v=RQfK_ViGzH0 https://docs.pytorch.org/docs/main/notes/ddp.html https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486 CAPI implementation of SimpleFSDP: https://github.com/facebookresearch/capi/blob/main/fsdp.py SimpleFSDP: https://arxiv.org/abs/2411.00284 ","permalink":"https://tylerzhu.com/blog/2026/02/fsdp-for-dummies/","summary":"\u003cp\u003eI\u0026rsquo;ve always struggled to understand the intuitions behind Fully Sharded Data Parallel beyond the high level idea of \u0026ldquo;shard everything.\u0026rdquo; Without a systems background, the fundamental primitives like \u0026ldquo;all-reduce\u0026rdquo; and \u0026ldquo;reduce-scatter\u0026rdquo; aren\u0026rsquo;t in my vocabulary. But FSDP conceptually is not complicated, especially once you state what the goals are (the rest is nearly necessitated by the engineering).\u003c/p\u003e\n\u003cp\u003eThis post is an attempt to deconstruct the algorithm from first principles as a non-systems person. I will bring up the primitives in their specified context, which I think helps reinforces the intuition much better. Most ML researchers have a stronger understanding of the models, params, and optimizer processes than the systems jargon anyways.\u003c/p\u003e","title":"FSDP for Dummies"},{"content":"Prelude This all started when I oversaw this tweet from Timothee Darcet (co-first author on DINOv2)\nhttps://x.com/TimDarcet/status/1726320282028360131?s=20\nThis was in response to people overreacting to how the final problem in computer vision was for AI to tell the difference between a blueberry muffin and a chihuahua, which, to be fair, is a rather funny joke. It turns out that AI models can do this quite well though, and have been able to already even since CLIP came out! So what\u0026rsquo;s the big deal?\nA set of rather confusing images for an AI model, you might assume. Obtained from an Instagram meme account.\nIf you ask GPT-4V for the answer to this even now (as of December 2023), it will sometimes give you the wrong answer like below. Here, you can see that GPT almost gets them all, a handsome 12/16 correct. However, this only happens if you keep it in this grid like arrangement. In fact, even if you were to feed in these images into CLIP one-by-one, you would see that CLIP gets them correct with very high certainty (see Timothee\u0026rsquo;s tweet above).\nNote: Interestingly, the other pictures are quite easy for GPT-4V though, which honestly may just mean they\u0026rsquo;ve been incorporated into its training data already (but the other three aren\u0026rsquo;t nearly as difficult as this one).\nGPT-4V\u0026rsquo;s attempt at the chihuahua/muffin challenge, getting 12/16 correct\nSo what\u0026rsquo;s the deal then? It seems like Large Vision-Language Models (LVLMs) can easily do recognition if it is clear where to look, but they struggle to figure out where on their own. In other words, they have trouble with localization. This is the point of this simple test. Rather than asking it to identify challenging images in a collage, we ask any LVLM to simply identify the color of each square in a 4x4 grid like below on the left without using code (this is important, we\u0026rsquo;ll get back to this later). It turns out that it struggles significantly with this, which confirms our hypothesis from above that the difficult is not with the recognition.\nA simple 4x4 grid of colored squares - trivial for humans, challenging for LVLMs\nGPT-4V\u0026rsquo;s failed attempt at identifying the color grid without using code\nI have a few hypotheses as to why, but nothing concrete yet.\nOne of them is related to tokenization. Assuming that there is a Vision Transformer (ViT) backbone, it must be blurring the clearly demarcated boundaries between regions as it operates on a patch level, which also has no guarantee of preserving the structured information present. I wonder if this is a fundamental limitation of patchification, or perhaps simply a tradeoff that can be resolved by taking our patches to be smaller (say $P=4$, or even $P=2$?). Another is that there is generally very little spatial guidance from language as to where things in images are located (a guess, no concrete statistics here). There aren\u0026rsquo;t many examples of where a caption explicitly says \u0026ldquo;a person in the top right corner of the image running to the left\u0026rdquo;, or with explicit coordinates relative to the image. This also might be completely wrong, maybe Flickr/Pinterest have lots of these types of captions. Fugue Let\u0026rsquo;s change the game we\u0026rsquo;re playing now. I\u0026rsquo;m going to try and see how we can get these models to successfully solve this question in a reasonable* manner, where reasonable is a definition guided by vibes.\nFirst, we\u0026rsquo;re going to ask it to solve the question allowing it to use code, a la ViperGPT/VisProg. At first, I get a cryptic response which is definitely correct, and it even mentions how it made a mistake previously when it wasn\u0026rsquo;t allowed to use code!\nGPT-4V successfully solves the problem when allowed to use code\nI had to prompt it to give me the code it used, which produced the following.\nfrom PIL import Image import numpy as np # Load the image img = Image.open(\u0026#34;/mnt/data/random_4x4.png\u0026#34;) # Convert the image to RGB if it\u0026#39;s not if img.mode != \u0026#39;RGB\u0026#39;: img = img.convert(\u0026#39;RGB\u0026#39;) # Resize the image to 4x4 pixels img_resized = img.resize((4,4)) # Convert the image to a numpy array img_array = np.array(img_resized) # Initialize a 4x4 text map text_map = np.full((4, 4), \u0026#39; \u0026#39;) # Define a simple function to identify the color def identify_color(rgb): # Set a threshold for color detection threshold = 120 if rgb[0] \u0026gt; threshold and rgb[1] \u0026lt; threshold and rgb[2] \u0026lt; threshold: return \u0026#39;R\u0026#39; elif rgb[0] \u0026lt; threshold and rgb[1] \u0026gt; threshold and rgb[2] \u0026lt; threshold: return \u0026#39;G\u0026#39; elif rgb[0] \u0026lt; threshold and rgb[1] \u0026lt; threshold and rgb[2] \u0026gt; threshold: return \u0026#39;B\u0026#39; else: return \u0026#39;U\u0026#39; # Unknown color # Loop over the array and assign the colors to the text map for i in range(img_array.shape[0]): for j in range(img_array.shape[1]): text_map[i][j] = identify_color(img_array[i][j]) # The text_map contains the 4x4 grid with the identified colors Hey! This code is cheating! It\u0026rsquo;s basically reducing the image naively to a 4x4 grid based on the information I told it, which doesn\u0026rsquo;t amount to any real understanding of the problem. In fact, it also gives some insight into when the model is wrong, it likely is trying to do something of the above but fails to.\nLet\u0026rsquo;s not tell it that the input is of size 4x4 and re-prompt it again to see how it solves it now. Opening up a fresh session, I give it the following prompt.\nUsing code, recognize each image in the collage, then print out a grid text map with the convention R = red, G = green, B = blue.\nUnexpectedly, it still gets it correct! Here\u0026rsquo;s the output code.\nfrom PIL import Image # Load the image img = Image.open(\u0026#34;/mnt/data/random_4x4.png\u0026#34;) # Convert the image to RGB img_rgb = img.convert(\u0026#34;RGB\u0026#34;) # Get the size of the image width, height = img.size # Initialize a grid map grid_map = [] # Define a function to identify the color def get_color_name(rgb): if rgb == (255, 0, 0): return \u0026#39;R\u0026#39; # Red elif rgb == (0, 255, 0): return \u0026#39;G\u0026#39; # Green elif rgb == (0, 0, 255): return \u0026#39;B\u0026#39; # Blue else: return \u0026#39;U\u0026#39; # Unknown # Process the image by dividing it into a 4x4 grid cell_width = width // 4 cell_height = height // 4 for y in range(0, height, cell_height): row = [] for x in range(0, width, cell_width): # Get the color of the center pixel of this cell pixel_rgb = img_rgb.getpixel((x + cell_width // 2, y + cell_height // 2)) # Append the color abbreviation to the row row.append(get_color_name(pixel_rgb)) # Append the row to the grid map grid_map.append(row) # Print the grid text map for row in grid_map: print(\u0026#39; \u0026#39;.join(row)) Now this is more reasonable. It\u0026rsquo;s getting the center pixel of each cell and using that as the color, which seems more robust. But wait, how did ChatGPT realize that it was a 4x4 grid with no prior context about it?\nI realized then that it had to be from the name of the image! It was leaking information to the model. When I tried to get the model to tell me how it knew, it claims that it was done \u0026ldquo;without inspecting the actual image content, just its dimensions and the instructions provided.\u0026rdquo; Very clever from the model…\nI once again prompted the model, but this time with an unassuming image title of just random.png.\nfrom PIL import Image import numpy as np # Load the image image_path = \u0026#39;/mnt/data/random.png\u0026#39; image = Image.open(image_path) # Convert the image to RGB if it is not image = image.convert(\u0026#39;RGB\u0026#39;) # Resize the image to a smaller size to analyze colors more easily # Since it\u0026#39;s a grid, we expect uniform colors in large enough areas, so resizing won\u0026#39;t affect the outcome image = image.resize((3, 3)) # Convert the image to a numpy array image_array = np.array(image) # Define a function to classify the color of a pixel def classify_color(rgb): # Set a threshold to differentiate between colors threshold = 128 # Check if the color is closer to red, green, blue or black if rgb[0] \u0026gt; threshold and rgb[1] \u0026lt; threshold and rgb[2] \u0026lt; threshold: return \u0026#39;R\u0026#39; elif rgb[0] \u0026lt; threshold and rgb[1] \u0026gt; threshold and rgb[2] \u0026lt; threshold: return \u0026#39;G\u0026#39; elif rgb[0] \u0026lt; threshold and rgb[1] \u0026lt; threshold and rgb[2] \u0026gt; threshold: return \u0026#39;B\u0026#39; else: return \u0026#39;U\u0026#39; # U for undefined or mixed color # Analyze each pixel in the 3x3 grid and classify its color grid_text_map = [[\u0026#39;\u0026#39; for _ in range(3)] for _ in range(3)] for i in range(3): for j in range(3): grid_text_map[i][j] = classify_color(image_array[i, j]) grid_text_map Aha! I just caught ChatGPT red-handed now. It\u0026rsquo;s trying to do the same trick it did at first, but assumes that the grid is a 3x3 since it doesn\u0026rsquo;t have any other cue to go off. Accordingly, the answer is off significantly; the dimensions don\u0026rsquo;t even match!\nChatGPT\u0026rsquo;s incorrect output when the image filename doesn\u0026rsquo;t leak information about grid size\nI also tried the other obvious approach, which is to rename this 4x4 grid of squares to random_6x6.png and feed that in. Doing this actually completely short-circuited ChatGPT. It tried to generate code which failed/wouldn\u0026rsquo;t run three times in a row before stopping itself and failing to generate a response. Similarly feeding in a 10x10 grid without the proper hint resolution or even the wrong resolution leads it to give up, but telling it that it is a 10x10 grid lets it get the answer immediately as long as it can use code. Impressive at least!\nTrying this with Google\u0026rsquo;s Gemini through Bard yields poor results. Even when prompting with knowledge that the result is a 4x4 grid and with the image name as a cue, it fails to generate anything sensical with code. Part of the reason is that Gemini actually renames all of its images to image.jpg, so there\u0026rsquo;s no leakage there. Interestingly, it also takes a different approach, instead finding the closest color between red, green, and blue by $\\mathcal{L}_1$ distance for each pixel.\nWhile code-based generations are very promising, the LVLM fails to even get to the code-based part correctly without picking up on shortcuts in our prompt. Are there better ways to prompt it to get this information on its own? In any case, this is a promising shortcoming that needs to be solved in these models before we can think about more difficult tasks.\nCitation If you found this post useful, you can cite it as:\n@misc{zhu2023spatial, author = {Zhu, Tyler}, title = {Remarks on Spatial Localization in VLMs}, year = {2023}, howpublished = {\\url{https://tylerzhu.com/blog/2023/12/spatial-localization-vlms/}}, note = {Blog post} } ","permalink":"https://tylerzhu.com/blog/2023/12/remarks-on-spatial-localization-in-vlms/","summary":"\u003ch2 id=\"prelude\"\u003ePrelude\u003c/h2\u003e\n\u003cp\u003eThis all started when I oversaw this tweet from Timothee Darcet (co-first author on DINOv2)\u003c/p\u003e\n\u003cp\u003e\u003cimg alt=\"Tweet from Timothee Darcet\" loading=\"lazy\" src=\"/images/spatial-localization/Screenshot_2023-12-16_at_12.51.07_AM.png\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://x.com/TimDarcet/status/1726320282028360131?s=20\"\u003ehttps://x.com/TimDarcet/status/1726320282028360131?s=20\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eThis was in response to people overreacting to how the \u003cem\u003efinal\u003c/em\u003e problem in computer vision was for AI to tell the difference between a blueberry muffin and a chihuahua, which, to be fair, is a rather funny joke. It turns out that AI models can do this quite well though, and have been able to already even since CLIP came out! So what\u0026rsquo;s the big deal?\u003c/p\u003e","title":"Remarks on Spatial Localization in VLMs"},{"content":"Over the last few months, I’ve been fully immersed with the CS PhD application process. I’ll make a later blog post detailing the overall process, but I thought I’d write up a quick post about my recent experiences (and hopefully future!) with the interview portion of the process.\nProcess Overview The interview is just another step in the overall process of the PhD application cycle (see this post for an overview). Whether or not schools do interviews and when they send out interviews varies heavily between schools, departments, and even faculty. I’ve heard of many EE departments sending out requests much earlier than CS departments, and personally have received requests at varying times of the month (from early January to other days). Some schools don’t send out interviews at all, which I have been told is quite normal too. There is a great variation between applications, so I wouldn’t be worried if you haven’t or don’t get interviews! I definitely was regardless.\nIf you do get an interview however, this is a great sign! It generally means that you’ve been selected to a shortlist of candidates that faculty are interested in learning more about. For example, for LTI at CMU, Yonatan Bisk has a fantastic overview of the rounds that admissions committees go through, and interviews are among the latest rounds of the process (right before decisions). I want to go over what the interview is about and should be like (from what I’ve gathered online), and then also give some perspective from my interview experiences.\nWhat the Interview is About As I said above, the interview is generally a chance for faculty to get to know you better. Most of the time it is not a technical interview in the way of testing your knowledge (but can be, as I mention below), and rather checking if you are who you claim to be in your SoP.\nThe first main part is a background check. Expect questions about the work you mentioned in your SoP like research projects. You should be able to explain your research starting from a surface level pitch to in-depth technical explanations, and be able to defend the claims you made or the reasons you investigated your hypotheses in the first place. If I’m an interviewer, I want to see that you understand the research you were doing, and also see how much of it you actually did (it’s fine if you didn’t do everything! as long as you’re clear about it). This also gives a good sense of how research-ready you would be in a PhD. Some people suggested making slides, but I haven’t been told to by faculty yet, so I erred on the side of having them but not needing to use them.\n\u0026ldquo;Your interviewer should not know more about your application (excluding recommendations and other confidential info) than you do! That would certainly not look good.\u0026rdquo;\nYou’ll also probably get general background questions, from “Tell me about yourself” to “Why are you pursuing a PhD?”. These help paint a fuller picture of who you are and what your goals are for pursuing the PhD. It is a 5-6 year commitment after all, so I sure hope you know why you want to! Alvin has a great post for deciding if a PhD is even right for you here.\nThe second main part is understanding research alignment. You want to be able to demonstrate that you understand what research ideas you’re interested in pursuing during your PhD, and that there’s some overlap of interest with your potential PIs/advisor. This is ok to be less fully fleshed out, but demonstrating that you understand how to synthesize possible directions is a critical part to get right. Some advice online has been to look at previous papers of your potential advisors, but this hasn’t been useful for me personally.\nIt’s fine if your topic is broad or if you’re still settling on something specific from what I have understood. After all, most PhDs or even professors don’t even know what specifically they want to work in the next 5-6 years (or they don’t stick to what they say). Your interviewers should understand this point, so I wouldn’t stress too much about the exact topic. Being able to talk about it with some concrete evidence/relevant papers is most important.\nThe last part generally is your chance to ask questions to your interviewer. This is just as important to demonstrate interest for the school, as well as to gain a better understanding of the lab and the group as a whole. Ask questions about how the lab is structured, what projects are of current interest, freedom, expectations, advising style, etc. Hopefully there are things you want to know before committing to schools, so this is a good chance to learn. However if you are later admitted, you will still get a chance to talk to current grad students who may better to ask, so don’t go overboard here!\nSome concrete tips I’ve gotten that helped me and may help you:\nUnderstand what your 5 second, 30 second, and 2 minute explanations of projects are. You should always be prepared to explain your research at these three levels of detail, in a one sentence overview or a paragraph detailed explanation. This helps immensely with giving a high level overview before diving into technical details. Know your SoP inside out. Someone else remarked that your interviewer should not know more about your application (excluding recommendations and other confidential info) than you do! That would certainly not look good. Practice talking about what you want to work on to people to flesh out your ideas better. It was hard for me to give feedback to myself on my ideas, so presenting to research mentors and friends helped a lot. My Interview Experiences For myself, I received a few interviews from different schools, and multiple from the same school! Some of them I received pretty early on in January, others towards the end. Even at the same school there were large variations in the kinds of interviews I had.\nOverall, I would say the standard interview experience is 30-45 minutes following the structure above. I was able to pick which of my projects I wanted to discuss in detail, and often spent a good chunk of the time discuss some of the technical aspects as well as the surprises or setbacks I came across in my projects. I’m glad I chose projects where I had a heavy hand in the research which helped me communicate my understanding and role. I was surprised by some other left field questions about my background, but they were pleasant and helped me talk more about other aspects of myself that weren’t directly related to research but still were a big part of my identity and future focus. I also had a technical question in one of my interviews which was surprising, but nothing that required extra prep or that was too difficult.\nDiscussing my research interests went less well as I had a shaky description of the topics I was interested in, but throughout my interviews I was able to refine it further. It’s why I recommend discussing it with friends as a lot of ideas make sense in our heads but become difficult to express to others! However, I still had a great time in my interviews and had fun talking about research.\nI’m writing this post kind of early, so I have yet to see what the feedback from schools is, but when I do I’ll update this post or make a followup post later. If you’re going through interviews soon, congratulations and best of luck! It’s overall a fun experience to talk about research in depth with potential faculty advisors, so try your best to enjoy it.\n","permalink":"https://tylerzhu.com/blog/2023/01/my-phd-interview-experience/","summary":"\u003cp\u003eOver the last few months, I’ve been fully immersed with the CS PhD application process. I’ll make a later blog post detailing the overall process, but I thought I’d write up a quick post about my recent experiences (and hopefully future!) with the interview portion of the process.\u003c/p\u003e","title":"My PhD Interview Experience"},{"content":"Seeing as its the new year, I took some time to think about my 2023 resolutions like most people are.\nIt’s always funny that we dedicate the beginning of every new year for this, because it’s not like I shoved away the idea of resolutions into a locked chest, not to be disturbed until we blow our kazoos and set off our fireworks to celebrate the new year. But having a recognized time for us to stop and slow down to ponder on what has happened is nice. I get so caught up with the day-to-day, it’s a nice chance to recollect my disparate thoughts into one place. And what better place to start than with last year’s attempted resolutions.\nMy Attempted 2022 Resolutions My 2022 Resolutions, briefly jotted down and subsequently forgotten about My resolutions last year were simple. Health, work, and my social life were my most important goals at the end of my senior year. I had always wanted to focus on them, but school in the way, it was always difficult to make them a priority, much less to improve upon how I already approached them.\nThe first was to be healthier. I wanted to both lose weight and put on muscle, ideally developing a consistent exercise routine. I went on a crazy cut before in my freshman year which was entirely successful with intermittent fasting, but it’s harder now that I have to buy my own groceries and cook. I needed to make it effortful. Plus, I wanted to pick up running after spending the first half of 2021 doing that pretty consistently.\nThe next one was have a stronger work ethic. I’ve never been good at this, and I’ve wanted (for a while now) to have an approach that was better than “work really hard on what most drives you right now.” That way I’m not dependent on my day-to-day motivation, so I can be consistent even on hard days, and so that I can work reliably.\nFinally, the last two are related in improving my social life. I wanted to focus more on friends that would soon be moving across the country and keeping in touch with them, and also going out of my way to find new people and communities to be involved in (a very common post-grad hobby). In the same vein, there’s been hobbies I want to focus on outside of school which I’ve put off, so I wanted to continue reading and playing guitar (and getting better at it!).\nWhen I made them, I tried to come up with quantifiable metrics to measure my success for each goal. As it would turn out, it was difficult to come up with good ones, and I only ended up hitting the running target of 100 miles (not thanks to anything I developed myself; more on this later). But even though I made these resolutions, I honestly forgot I even made them until this year as a comparison.\nWhat I\u0026rsquo;ve Learned Over the last year, I finally picked up running as a hobby with Amanda, developed a personal routine for work which is especially needed in a self-led area like research, and went through an extremely stressful period of PhD apps. Through these experiences I\u0026rsquo;ve learned a few things.\nConsistency is King I\u0026rsquo;ve always had the view that showing up with 110% was more important, i.e. giving everything your all, and that it could make up for a lack of presence. But I had it backwards all along. Showing up, as many say, is half the battle. It matters more that I’m at least going out on a run even if it’s only 2 miles or a 12 minute pace than only doing 1.5 mile sprints every week or so.\nDoing something consistently, even if you work from 40% to 80% effort, is often more effective than spurts of working at 100% (plus it feels better too). It helps in maintaining realistic schedules which can accommodate other habits. Compounding is also a powerful concept; getting 1% better every day for a year leads to being 37% better at the end (cc Atomic Habits for that concept), and its a lot easier than doing it over one day. Our bodies also naturally need time to rest and rebuild, so from a physical standpoint, compounding is almost necessary.\nAccountability is Powerful Speaking of showing up, having other people who keep you accountable for delivering on your tasks works wonders for picking up new (or old) habits. I got into running because Amanda and I went through an explicit 10 week program (a couch to 5K system) which was helpful in preventing my lazy self from getting the best of me. The usual friction I’d have if I was doing this myself was overcome by the commitment I’d made to going through this together with someone else.\nThe same accountability also helped me get through my apps cycle and writing my dauntingly-seeming essays and papers. Likewise, starting this blog post and reviving my blog probably wouldn’t have happened without Alvin keeping me accountable, so thanks for that (and to see if this time it’ll stick!). Sometimes all it takes in a small push to get the ball rolling.\nMy 2023 Resolutions Once again, I setup quantifiable metrics for my resolutions, and split them into three main categories: Health, Personal, and Work.\nMy 2023 Resolutions Intentions I wanted to start my resolutions with a set of intentions that would guide me in 2023. As James Clear says in Atomic Habits, we should think about who we want to be, and use that to guide the things we’ll do now. For example, I’m a serial procrastinator and late to most arrangements, and although I could chalk it up to simply “who I am”, I want to change it as simply a step along becoming a more punctual and organized person.\nThese intentions are a way of doing that, and summarize how I want to become a better person. Most of these revolve around setting up systems for myself to become a more consistent, habitual person. It leads to healthier relationships with friends, work, and life for me, and I’m excited to see how it plays out this year.\nHealth These are pretty similar to last year’s, but I’ve increased my goal of running now that I’ve gotten more serious about it (and running a 5K and a 10K at the end of the year!). It’s nothing crazy, since 400 miles works out to about 8 miles a week, which is pretty digestible over 2-3 runs. The rest of the days should be filled in with some muscle-building workouts, while also improving overall health with better diets. (Un)fortunately, that means boba is out again.\nPersonal One thing I’ve always flirting with is doing weekly reflections so it doesn’t take me a year to figure out what things I should do differently. It always feels like it’s too much effort though, so maybe I should find easier ways to do it. I currently use Bear (a slightly fancier Apple Notes) to jot general notes that are persistent (about life and for research logs), which has been helpful.\nAside from that, I want to consistently write again. I’ve always been a fan of blogs (like Evan’s) and enjoyed technical writing (which may be obvious from my homepage…), but never really gotten into the groove of it. Hopefully I’ll publish a mix of longer technical writings and shorter musings of life and ideas related to research so that I’m consistent.\nFinally, I decided to drop reading as a hobby since I don’t think it’s a priority for me (not everyone can be Lex Fridman), but cooking and guitar are two things I want to actively be better about, as well as spending less time on games (which are fun social activities but not a skill I want to actively improve on).\nWork Last but not least, I want to improve my overall approach to work by consistently reading state of the art papers to gather knowledge on the frontier of AI work while also refreshing my fundamentals. This is a mix of reading consistently in a new Zotero + Obsidian setup I worked on over break, while also implementing new papers to get practice with coding new results. This hopefully will keep me exploring and learning new things! I thought about adding in some textbooks to learn from, but I’m sure my coursework is enough for that. Another fun idea I toyed with is setting up a weekly casual reading group with some friends to discuss new papers, which could be fun.\nThat’s it from me for my resolutions. Now that I\u0026rsquo;ve posted them online for accountability, let\u0026rsquo;s see how many I’ll actually accomplish in 2023. :)\n","permalink":"https://tylerzhu.com/blog/2023/01/new-year-resolutions-for-2023/","summary":"\u003cp\u003eSeeing as its the new year, I took some time to think about my 2023 resolutions like most people are.\u003c/p\u003e","title":"New Year Resolutions for 2023"}]