Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran ∗‡1,2 Michael Poli ∗1,2 Daniel Y. Fu 3,4 Dongjun Kim 1
Lea M. Hadzic 1Manling Li 1,5Agrim Gupta 6Stefano Massaroli 2
Azalia Mirhoseini 1Juan Carlos Niebles 1,7†Stefano Ermon 1†Li Fei-Fei 1†
1 Stanford University  2 Liquid AI 3 Together AI 4 UC San Diego
5 Northwestern University 6 Google DeepMind 7 Salesforce Research
grafting.stanford.edu
Abstract

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38–2.64 vs. 2.27 for DiT-XL/2) using <2absent2<2< 2% pretraining compute. We then graft a text-to-image model (PixArt-ΣΣ\Sigmaroman_Σ), achieving a 1.43×\times× speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2×\times× and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: grafting.stanford.edu.

00footnotetext: Equal contribution.   Equal senior authorship.00footnotetext: Part of this work was done at Liquid AI.00footnotetext:    Correspondence to {keshik,poli}@stanford.edu

1 Introduction

Model architecture design plays a central role in machine learning, alongside data, algorithms, compute, and benchmarks. It defines a learnable function and entails key decisions, including the choice of operators (e.g., attention, convolution) and configurations (e.g., model depth, width). Despite this, insight into architectures—what works and what doesn’t—is difficult to obtain due to the prohibitive costs of training models from scratch, especially in today’s foundation model era. As a result, studying new architectures remains a challenge, particularly for generative models. Much like how new software is built on existing code rather than written from scratch, can pretrained models serve as scaffolds for studying new architectures? In this work, we investigate architectural editing of pretrained models to study new architecture designs. We focus on diffusion transformers (DiTs), a class of generative transformers widely used for image and video generation  peebles2023scalable ; videoworldsimulators2024 ; gupta2023photorealistic .

A pretrained model implements a computational graph to perform tasks such as image or video generation. Given a new architectural idea and a pretrained model, we investigate whether the idea can be materialized by modifying its computational graph under small compute budgets. For example, one might hypothesize that a convolutional design could replace Multi-Head Attention (MHA) or Multi-Layer Perceptron (MLP) in a DiT. A simple way to materialize this idea is to replace MHA or MLP operators with a convolutional operator, while preserving model quality. This raises two key questions: (Q1) operator initialization: How to initialize a new operator before integrating it into the computational graph? (Q2) error accumulation: How to mitigate error propagation as multiple operators are integrated into the computational graph?

To address these questions, we present grafting111 Grafting draws inspiration from horticultural grafting, where efficient components (scions) are integrated into established systems (rootstock) to enhance functionality, such as yield and disease resistance eliezer_grafting . , a simple two-stage approach to architecture editing (Fig. 1). Grafting proceeds as follows: (i) activation distillation: This stage transfers the functionality of the original operator to the new one by distilling its activations using a regression objective. (ii) lightweight finetuning: This stage mitigates error propagation caused by integrating multiple new operators by finetuning using limited data. Architectural editing spans multiple strategies—adding, removing, and replacing zhanglolcats ; wang2024the ; bick2024transformers operators. We focus on operator replacement as the core strategy: swapping one operator for another. Other strategies can be viewed as special cases of replacement.

Refer to caption
Figure 1: Grafting overview. (a,b) Model architecture design via grafting. Studying new model architecture designs requires costly pretraining. Grafting materializes new architectures by editing pretrained models under small compute budgets (Sec. 3). (c) Class-conditional image generation. Samples generated by hybrid architectures obtained via grafting (Sec. 4). (d) High-resolution text-to-image generation. 2048×\times×2048 samples generated using a grafted model (Sec. 5). (e) Depth \rightarrow width case study. Samples generated using a model restructured via grafting (depth: 28 \rightarrow 14) (Sec. 6).

The space of architecture editing is vast, raising a practical question: what types of replacements should we study? To ground this investigation, we first establish a self-grafting baseline that replaces existing operators with randomly initialized weights. We show that our two-stage grafting procedure recovers near-baseline model quality, validating the approach. Building on this, we focus on replacing existing operators with efficient alternatives, aiming to reduce model FLOPs while preserving quality. In some cases, we explore replacements that increase model FLOPs to examine broader design choices. To study this systematically, we construct a testbed based on DiT-XL/2. We define a set of architectural edits that enable us to evaluate how different grafting schemes affect model quality. We highlight our design axes: (1) which operator to replace (e.g., MHA, MLP); (2) what to replace it with (e.g., convolutions); (3) how to select layers to edit (e.g., all layers); (4) replacement ratio (full vs. partial). The motivation for replacing MHA and MLP operators stems from both empirical evidence and architectural considerations: for MHA, our attention-locality analysis motivates the use of local operators, while for MLP, we adopt prior architecture ideas fu2023monarch ; komatsuzakisparse ; kaplan2020scaling .

We validate our grafting approach in increasingly challenging generative modeling setups:

Result I: Grafting yields hybrid architecture designs with good quality for class-conditional image generation (Sec. 4.2). We validate grafting using our testbed. For MHA (softmax attention), we explore several alternatives: local gated convolution (Hyena-SE, and our proposed Hyena-X/ Hyena-Y), local attention (sliding window), and linear attention (Mamba-2). For MLPs, alternatives include MLPs with variable expansion ratio (ratios=3, 6), and a convolutional variant (Hyena-X). Interestingly, several interleaved hybrid architecture designs achieve FID scores between 2.38 and 2.64 (DiT-XL/2 256x256 baseline: 2.27), showing that grafting can construct good quality hybrids (Tab. 4222 Strictly speaking, variable expansion ratio MLPs constitute a heterogeneous design rather than a hybrid (i.e. they do not introduce a new operator class); we use ‘hybrid’ throughout the paper for simplicity. . Grafting is simple and lightweight: each experiment completes in under 24 hours on 8×H100 GPUs, using less than 2% of pretraining compute.

Result II: We construct efficient hybrid architectures for high-resolution text-to-image (T2I) generation via grafting (Sec. 5). We validate grafting in a challenging, real-world setting: 2048×\times×2048 resolution T2I generation using PixArt-ΣΣ\Sigmaroman_Σ (DiT) chen2024pixartsigma . This setting reflects key challenges: it operates on long sequences (16,384 tokens), involves a multimodal setup with text conditioning, and lacks training data. We target MHA operators for grafting, as they account for over 62% of generation latency. Using 12k synthetic data, our grafted model achieves a 1.43× speedup with <<<2% drop in GenEval score (47.78 vs. 49.75), showing that grafting scales to high-resolution, T2I generation.

Case Study: Converting model depth to width via grafting (Sec. 6). Motivated by our MLP grafting results, we rewire DiT-XL/2 by parallelizing every pair of transformer blocks, as modern GPUs favor parallel over sequential computation. This reduces model depth by 2×\times× (28\rightarrow14). The grafted model achieves FID=2.77, outperforming other models of comparable depth. To our knowledge, this is the first attempt to convert sequential transformer blocks into parallel in pretrained DiTs, enabling architectures to be restructured.

2 Prerequisites

Diffusion models (DMs). DMs generate data samples by iteratively denoising random noise. This sampling process inversely mirrors the forward data corruption mechanism: 𝐳t=αt𝐳+σtϵsubscript𝐳𝑡subscript𝛼𝑡𝐳subscript𝜎𝑡bold-italic-ϵ\mathbf{z}_{t}=\alpha_{t}\mathbf{z}+\sigma_{t}\bm{\epsilon}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_z + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ where 𝐳=E(𝐱)q(𝐳)𝐳𝐸𝐱similar-to𝑞𝐳\mathbf{z}=E(\mathbf{x})\sim q(\mathbf{z})bold_z = italic_E ( bold_x ) ∼ italic_q ( bold_z ) with E𝐸Eitalic_E representing a pretrained encoder and 𝐱𝐱\mathbf{x}bold_x the data variable. The noise term ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ follows the prior distribution 𝒩(0,I)𝒩0𝐼\mathcal{N}(0,I)caligraphic_N ( 0 , italic_I ). The transition kernel from time 00 to t𝑡titalic_t is given by qt(𝐳t|𝐳)=𝒩(𝐳t;αt𝐳,σt2I)subscript𝑞𝑡conditionalsubscript𝐳𝑡𝐳𝒩subscript𝐳𝑡subscript𝛼𝑡𝐳superscriptsubscript𝜎𝑡2𝐼q_{t}(\mathbf{z}_{t}|\mathbf{z})=\mathcal{N}(\mathbf{z}_{t};\alpha_{t}\mathbf{% z},\sigma_{t}^{2}I)italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z ) = caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_z , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). The choice of αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defines the diffusion variant, such as variance-preserving (ho2020denoising, ), or flow matching (lipman2022flow, ). The training objective (ho2020denoising, ) is as follows:

DM(ϕ)=𝔼q(t)q(𝐳,𝐜)𝒩(ϵ;0,I)[ϵϵϕ(𝐳t,t,𝐜)22],subscript𝐷𝑀bold-italic-ϕsubscript𝔼𝑞𝑡𝑞𝐳𝐜𝒩bold-italic-ϵ0𝐼delimited-[]superscriptsubscriptdelimited-∥∥bold-italic-ϵsubscriptbold-italic-ϵbold-italic-ϕsubscript𝐳𝑡𝑡𝐜22\displaystyle\begin{split}\mathcal{L}_{DM}(\bm{\phi})=\mathbb{E}_{q(t)q(% \mathbf{z},\mathbf{c})\mathcal{N}(\bm{\epsilon};0,I)}[\|\bm{\epsilon}-\bm{% \epsilon}_{\bm{\phi}}(\mathbf{z}_{t},t,\mathbf{c})\|_{2}^{2}],\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( bold_italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_q ( italic_t ) italic_q ( bold_z , bold_c ) caligraphic_N ( bold_italic_ϵ ; 0 , italic_I ) end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_c ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW (1)

where q(t)𝑞𝑡q(t)italic_q ( italic_t ): time sampling distribution, and q(𝐳,𝐜)𝑞𝐳𝐜q(\mathbf{z},\mathbf{c})italic_q ( bold_z , bold_c ): joint distribution of latent 𝐳𝐳\mathbf{z}bold_z and condition 𝐜𝐜\mathbf{c}bold_c.

Diffusion transformers (DiTs). DiTs model the diffusion process by patchifying the input—noised images or latent—into a sequence of 1D tokens with positional embeddings. These tokens are processed through transformer blocks comprising self-attention, feedforward layers, residual connections, and normalization layers. DiTs also incorporate conditioning signals, such as noise timestep (t𝑡titalic_t), class labels (c𝑐citalic_c), or natural language prompts, enabling controllable generation peebles2023scalable ; chen2023pixart .

Datasets and evaluation metrics. For class-conditional image generation, we use ImageNet-1K deng2009imagenet . We follow peebles2023scalable and report Inception Score (IS), FID, sFID, Precision, and Recall using 50k generated samples (250 steps DDPM, cfg=1.5). For text-to-image generation, we report GenEval score ghosh2023geneval .

3 Grafting Diffusion Transformers

3.1 Two-Stage Grafting Approach

Grafting aims to materialize new architectures by editing a pretrained model’s computational graph. Given that we focus on replacing existing operators with alternatives, this raises two questions:

(Q1) How should a new operator be initialized before being integrated into the computational graph? Stage 1: Activation distillation. We cast initialization as a regression task. Operators in a DiT block process [B,N,D]𝐵𝑁𝐷[B,N,D][ italic_B , italic_N , italic_D ] inputs (batch, sequence, hidden) and output tensors of the same shape. Given a pretrained operator fϕlsuperscriptsubscript𝑓italic-ϕ𝑙f_{\smash{\phi}}^{l}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at layer l𝑙litalic_l, we learn a new operator gθlsuperscriptsubscript𝑔𝜃𝑙g_{\smash{\theta}}^{l}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that approximates fϕlsuperscriptsubscript𝑓italic-ϕ𝑙f_{\smash{\phi}}^{l}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT hinton_kd . Since DiT activations are continuous and smooth, this can be posed as a regression problem:

(𝜽)=𝔼q(t)q(𝐳,𝐜)qt(𝐳t|𝐳)[reg(g𝜽l(𝐳t,t,𝐜),fϕl(𝐳t,t,𝐜))]𝜽subscript𝔼𝑞𝑡𝑞𝐳𝐜subscript𝑞𝑡conditionalsubscript𝐳𝑡𝐳delimited-[]subscriptregsuperscriptsubscript𝑔𝜽𝑙subscript𝐳𝑡𝑡𝐜superscriptsubscript𝑓bold-italic-ϕ𝑙subscript𝐳𝑡𝑡𝐜\displaystyle\mathcal{L}(\bm{\theta})=\mathbb{E}_{q(t)q(\mathbf{z},\mathbf{c})% q_{t}(\mathbf{z}_{t}|\mathbf{z})}\big{[}\mathcal{L}_{\text{reg}}(g_{\bm{\theta% }}^{l}(\mathbf{z}_{t},t,\mathbf{c}),f_{\bm{\phi}}^{l}(\mathbf{z}_{t},t,\mathbf% {c}))\big{]}caligraphic_L ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_q ( italic_t ) italic_q ( bold_z , bold_c ) italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z ) end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_c ) , italic_f start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_c ) ) ] (2)

where q(𝐳,𝐜)𝑞𝐳𝐜q(\mathbf{z},\mathbf{c})italic_q ( bold_z , bold_c ) is the joint distribution of latent representation 𝐳𝐳\mathbf{z}bold_z and condition 𝐜𝐜\mathbf{c}bold_c, q(t)𝑞𝑡q(t)italic_q ( italic_t ) is the time sampling distribution, and qt(𝐳t|𝐳)subscript𝑞𝑡conditionalsubscript𝐳𝑡𝐳q_{t}(\mathbf{z}_{t}|\mathbf{z})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z ) is the transition kernel from time 00 to t𝑡titalic_t. regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is a regression objective such as L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In practice, a good initialization requires as few as 8k samples.

(Q2) How can we mitigate error propagation as multiple operators are integrated into the computational graph? Stage 2: Lightweight finetuning. As more operators are replaced, initialization errors propagate, leading to deviations from the pretrained model’s behavior. We apply end-to-end finetuning with limited data to mitigate cumulative errors from stage 1. The fine-tuning objective is given in Equation 1. In practice, we find that competitive performance can be recovered using only 10% of the training data, even when replacing all MHA or MLP layers in DiT-XL/2.

[Uncaptioned image]

(iii) MHA initialization

IS \uparrow FID \downarrow sFID \downarrow Prec. \uparrow Rec. \uparrow Baseline 278.20 2.27 4.60 0.83 0.57 Random Init. 40.86 76.27 10.33 0.33 0.52 L2 269.31 2.58 5.75 0.82 0.58 Huber (δ𝛿\deltaitalic_δ=1.0) 271.30 2.55 5.44 0.82 0.57 L1 273.03 2.51 5.48 0.83 0.58

(iv) MLP initialization

IS \uparrow FID \downarrow sFID \downarrow Prec. \uparrow Rec. \uparrow Baseline 278.20 2.27 4.60 0.83 0.57 Random Init. 2.18 297.34 161.76 0.01 0.00 L2 265.34 2.33 4.38 0.81 0.59 Huber (δ𝛿\deltaitalic_δ=1.0) 262.93 2.38 4.49 0.81 0.59 L1 235.54 2.83 4.69 0.77 0.61

Table 1: Activation statistics and self-grafting (Stage 1) results (DiT-XL/2). (i) Activation ranges (max–min) across all 28 MHA and MLP operators, computed using 1,000 samples. Deeper layers exhibit higher variance in activation values. (ii) Activation distributions (log-scale histograms) for five selected layers (1, 8, 17, 27, 28), used in our initialization study. MLP layers show higher variance in activations than MHA, especially in deeper layers. (iii, iv) Stage 1 results for these layers using L2, Huber, and L1 regression. L1 yields the best FID for MHA (2.51), while L2 performs best for MLP (2.33), which contains 2×\times× more parameters than MHA (10.6M vs. 5.3M). This study shows that high-quality initialization can be achieved by choosing operator-specific regression objectives.

3.2 Self-grafting Baseline

Prior to studying new architectural designs, we introduce self-grafting, a simple control setup where existing operators (e.g., MHA, MLP) are replaced with identical operators whose weights are randomly initialized. This preserves the computational graph’s structure—operator types, receptive fields, and parameter count—while altering the computation performed. Self-grafting serves three purposes: (1) to assess the grafting procedure without architectural changes, (2) to provide a baseline for comparing replacements, and (3) to study factors affecting performance, such as data scale, regression objectives, and hyperparameters.

3.3 Activation Behavior Analysis and Self-grafting Results

We begin by analyzing the activation behavior of MHA and MLP operators across all layers in DiT-XL/2. In both cases, we observe large variance in activation values, particularly in deeper layers (Tab. 1 (i, ii)). When using regression-based distillation for Stage 1, these outliers affect optimization, particularly under the commonly used L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT objective which penalizes all errors quadratically. This motivates a closer look at regression objectives. We study three regression objectives with different level of sensitivity to outliers—L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and Huber 10.1214/aoms/1177703732 —using a self-grafting setup. We select five representative layers (l=1,8,17,27,28𝑙18172728l=1,8,17,27,28italic_l = 1 , 8 , 17 , 27 , 28) for both MHA and MLP, spanning a range of activation values. Each operator is trained with 8K ImageNet-1K deng2009imagenet samples, for 200 epochs with batch size 64 and learning rate 1e41e41\mathrm{e}{-4}1 roman_e - 4. We use δ=1.0𝛿1.0\delta=1.0italic_δ = 1.0 for Huber objective. We then integrate the initialized operators into the pretrained DiT-XL/2 and evaluate quality without any finetuning.

High-quality initialization can be achieved by choosing operator-specific regression objectives.

Stage 1 Stage 2 IS \uparrow FID \downarrow sFID \downarrow Prec. \uparrow Rec. \uparrow Baseline 278.20 2.27 4.60 0.83 0.57 MHA (Full Self-grafting) Random Init. 1.66 289.23 154.00 0.00 0.00 0.63% 117.68 16.78 13.69 0.60 0.61 0.63% 0.63% 148.56 11.26 11.10 0.66 0.60 0.63% 5.0% 270.39 2.70 5.46 0.81 0.57 0.63% 10.0% 287.81 2.49 4.71 0.83 0.56 MLP (Full Self-grafting) Random Init. 1.27 314.72 204.99 0.00 0.00 0.63% 10.0% 277.72 2.54 4.52 0.83 0.57

Table 2: Full self-grafting (Stage 2) results (DiT-XL/2). We report results after replacing all 28 MHA and MLP operators using different amounts of training data. As we increase the training data from 0.63% (8k) to 10.0% (128k), FID improves consistently. Using only 10% of the training data, near-baseline performance is achieved: FID 2.49 for MHA and 2.54 for MLP.

As shown in Tab. 1 (iii,iv), the choice of the regression objective affects performance. For MHA, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT achieves the best FID (2.51), followed by Huber (2.55) and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (2.58). For MLPs, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT performs best (2.33), while L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT underperforms (2.83); notably, MLPs have 2× more parameters than MHA which explains its robustness to outliers bartlett2020benign . This shows that high-quality initialization requires tailored, activation-aware strategies. Further, we evaluate validation loss on held-out samples. For MHA, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT achieves the lowest loss; for MLPs, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT achieves the lowest loss for all blocks (See Sec. B.3).

Full self-grafting with 10% data achieves near-baseline performance. We extend our study to replace all MHA and MLP operators in DiT-XL/2 under the self-grafting setup and evaluate the effect of data on recovery (Tab. 2). For MHA, replacing all 28 layers without adaptation results in a noticeable performance drop, but Stage 2 (lightweight fine-tuning) is highly effective: using just 10% of the training data (128k samples), we achieve an FID of 2.53 vs. 2.27 for the baseline. Similarly, full MLP self-grafting with 10% data yields an FID of 2.54. We use batch size 256, learning rate 1e41\mathrm{e}{{}^{-4}}1 roman_e start_FLOATSUPERSCRIPT - 4 end_FLOATSUPERSCRIPT, and 30k iterations. In both cases, the quality is within 0.3 FID of the baseline, showing that full self-grafting is feasible under modest data and compute budgets.

3.4 Locality Analysis of Self-attention

Refer to caption

Figure 2: Locality of self-attention in DiT-XL/2. We plot Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT values for all 28 MHA operators, averaged over timesteps and samples. At k=32𝑘32k{=}32italic_k = 32, 15 out of 28 layers exhibit values exceeding 0.5, indicating that several MHA operators model local interactions.

MHA scales quadratically with sequence length, making it a computational bottleneck. A natural idea is to replace it with local operators, such as convolution or local attention. However, this will fail if the model relies on long-range dependencies: for example, replacing all MHA operators in DiT-XL/2 with a sliding window attention degrades FID from 2.27 to 53.9. To guide grafting, we quantify MHA locality using a simple band-k𝑘kitalic_k metric. Given an attention matrix AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and a band size of k𝑘kitalic_k, we define a bi-directional band indicator matrix BkN×Nsubscript𝐵𝑘superscript𝑁𝑁B_{k}\in\mathbb{R}^{N\times N}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT as:

(Bk)i,j={1,if |ij|k0,otherwisesubscriptsubscript𝐵𝑘𝑖𝑗cases1if 𝑖𝑗𝑘0otherwise(B_{k})_{i,j}=\begin{cases}1,&\text{if }|i-j|\leq k\\ 0,&\text{otherwise}\end{cases}( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if | italic_i - italic_j | ≤ italic_k end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW

Then, locality within a band of size k𝑘kitalic_k is computed as:

Lk=1Ni,jAi,j(Bk)i,jsubscript𝐿𝑘1𝑁subscript𝑖𝑗subscript𝐴𝑖𝑗subscriptsubscript𝐵𝑘𝑖𝑗L_{k}=\frac{1}{N}\sum_{i,j}A_{i,j}(B_{k})_{i,j}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (3)

We compute Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for all 28 MHA operators in DiT-XL/2 using 50-step DDIM sampling (250 ImageNet samples, sequence length 256, cfg scale 1.5), averaging across timesteps and samples. As shown in Fig. 2, MHA is largely local: on average, for k=32𝑘32k{=}32italic_k = 32, 15 out of 28 layers attend to more than 50% attention mass within the band. The first few layers (l𝑙litalic_l=1,2) display non-local attention patterns. Our analysis provides guidance for replacing MHA operators with efficient local operators.

4 Experiments I: Hybrid Architectures via Grafting

4.1 Testbed and Experiment setup

Building on our self-grafting results, we now ask: can we maintain model quality when existing operators are replaced with efficient alternatives? To investigate this, we study the grafting procedure along four design axes:

  1. 1.

    operator type to replace – MHA or MLP

  2. 2.

    replacement operator type – such as convolutions

  3. 3.

    layer selection strategy – replace operators in all layers or use heuristic-based selection

  4. 4.

    replacement ratio – full or partial

We construct a testbed to systematically evaluate how design decisions affect generative quality under grafting. We focus on efficient replacements that reduce FLOPs, but also include higher-FLOP variants to explore a broader range of architectural edits. We target MHA and MLP operators, which account for a significant portion of FLOPs in DiTs compared to other operators (e.g., normalization, activation, residuals). The rationale for replacing MHA or MLP operators is grounded in both empirical and architectural considerations: for MHA, our attention locality analysis motivates the use of local operators; for MLP, we leverage prior architecture ideas fu2023monarch ; kim2024solar ; qiu2024compute ; petty2024impact ; komatsuzakisparse ; kaplan2020scaling . Given a replacement operator, the decision to graft it to a model with L𝐿Litalic_L transformer layers spans a space of 2Lsuperscript2𝐿2^{L}2 start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT configurations. To make this tractable, we study two layer selection strategies: full (replace all operators) and interleaved (replace operators in a repeating pattern) strategies. The latter is inspired by striped transformer designs poli2024mechanistic ; ku2025systems ; Brixi2025.02.18.638918 . Our testbed is detailed in Tab. 3.

Refer to caption

Figure 3: Our proposed Hyena-X and Hyena-Y, efficient local gated convolution operators used as drop-in replacements for MHA.

We introduce Hyena-X and Hyena-Y—two new efficient gated convolution operators designed as drop-in replacements for MHA. While our study includes several off-the-shelf efficient alternatives, we also contribute new operator designs motivated by our MHA locality analysis. This allows us to test novel architectural ideas via grafting, broadening our study. Both Hyena-X and Hyena-Y are local gated convolutions composed of dense, short causal depth-wise 1D convolutions. Fig. 3 (left) illustrates their structure. We also adapt Hyena-X as an MLP alternative by applying it along the channel dimension. Hyena-X and Hyena-Y scale linearly with sequence length, compared to the quadratic scaling of MHA. Operator details are provided in Sec. E. We provide FLOP calculation for both operators in Sec. F.1.

Operator Type: Which operator types are we replacing? MHA, MLP Efficient Alternative: What do we replace it with? MHA Convolutions: Hyena-SE ku2025systems , Hyena-X/ Hyena-Y (Ours) K𝐾Kitalic_K=4, causal Local Attention: Sliding Window Attention (SWA) Beltagy2020Longformer ; child2019generating w𝑤witalic_w=4, bidirectional Linear Attention: Mamba-2 daotransformers dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT=64, E𝐸Eitalic_E=2 MLP Variable expansion ratio r𝑟ritalic_r=3,6 Hyena-X (Ours) r𝑟ritalic_r=2, K𝐾Kitalic_K=4, causal, mix channels Layer Selection: In which layers is the operator replaced? Full Replace the operator in all layers Interleaved Replace the operator in a repeating pattern (e.g., every 2 or 3 out of 4) Replacement Ratio: What percentage of operators are replaced? 50%, 75%, 100%

Table 3: Grafting testbed with configurations. This table defines the core design axes used in our study: operator type, efficient alternatives, layer selection strategy, and replacement ratio. For each alternative, we report configurations, including kernel size K𝐾Kitalic_K, window size w𝑤witalic_w, state size dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, expand factor E𝐸Eitalic_E, and MLP expansion ratio r𝑟ritalic_r. The baseline DiT-XL/2 operator uses H=16𝐻16H{=}16italic_H = 16 attention heads and MLP expansion ratio r=4𝑟4r{=}4italic_r = 4. Operator variants marked with (Ours) are proposed in this work.

Experiment setup. For our hybrid experiments, we mostly use the hyperparameters determined from our self-grafting studies.

Stage 1: Operator initialization. For each new operator, we perform activation distillation using 8K ImageNet-1K samples. Each operator is trained for 200 epochs with a batch size of 64 and an initial learning rate of 1e41e41\mathrm{e}{-4}1 roman_e - 4. We pre-extract and store all regression features. All operators can be initialized in parallel. Each operator’s training completes in under 30 minutes on a single H100 GPU. Experiment details for stage 1 are included in Sec. B.1.

Stage 2: Lightweight finetuning. For all experiments in Table 3, we use 10% of the ImageNet-1K training data and train for 50K steps. We use a batch size of 256, linearly warming up the learning rate to 1e41e41\mathrm{e}{-4}1 roman_e - 4 over 1000 steps. Experiments typically complete in under 10 hours on 8×\times×H100 GPUs. For specific ablations on increasing data, such as those involving 20% data or 100K steps, runtimes extend up to 24 hours (<<<2% pretraining compute). We provide experiment details in Sec. B.1.

4.2 Results and Insights

(a) Ablation 1: Data scaling (MHA) Refer to caption (b) Ablation 2: Layer selection heuristics (MHA / Hyena-X) Refer to caption

Figure 4: Ablation studies. (a) Data scale: Increasing fine-tuning data from 10% to 20% improves FID. (b) Layer selection strategies: Interleaved replacement outperforms other heuristics.

MHA results. Replacing MHA operators in DiT-XL/2 via grafting yields strong quality-efficiency tradeoffs. We discuss our key insights below:

  • Surprising effectiveness of operators with smaller receptive fields under interleaved grafting. Our findings highlight that at 50% interleaved replacement, several alternatives—including SWA, Hyena-X/Y, and Mamba-2—consistently achieve FID scores within 0.5 of the baseline (2.27). The minimal FID drop observed especially with the SWA and Hyena variants, despite their limited receptive field (K𝐾Kitalic_K=4, w𝑤witalic_w=4), aligns with our locality analysis (Section 2).

  • Replacement strategy: Interleaved vs. Full. Performance generally declines when increasing interleaved replacement from 50% to 75%. However, SWA remains effective at 75% interleaved replacement (FID=3.09). At 100% replacement, performance sharply degrades (all FIDs > 75). This trend aligns with our locality analysis, indicating that only a subset of layers are local and amenable to grafting.

  • Ablations on data scale and layer selection. We study two factors under 50% MHA replacement. (i) Increasing fine-tuning data from 10% to 20% improves FID across all variants (e.g., Hyena-X: 2.74 \rightarrow 2.61; SWA: 2.67 \rightarrow 2.62, Mamba-2: 2.65\rightarrow 2.55) (Fig. 4 (a)). (ii) Under 50% replacement, we compare Hyena-X (interleaved) to three targeted heuristics: top-local (layers with highest band-k𝑘kitalic_k values), low-local (layers with lowest band-k𝑘kitalic_k values), and deep (last 14 layers). Interleaved yields the best FID (2.74), followed by top-local (3.02), low-local (3.18), and deep (4.00). These results confirm that interleaving is effective, and our band-k𝑘kitalic_k metric identifies layers that are more amenable to grafting (Fig. 4 (b)).

MLP results. Replacing MLP operators via grafting is effective. We discuss our key insights below:

  • Variable expansion ratio MLPs are effective under full replacement. MLP alternatives with expansion ratio r𝑟ritalic_r=3 and r𝑟ritalic_r=6 demonstrate good quality under all replacement ratios. Even under full (100%) replacement, both variants maintain good performance, with r𝑟ritalic_r=3 achieving FID=2.66. This highlights that MLP width is a robust dimension for grafting.

  • Convolutional alternatives. Hyena-X which combines dense and local channel mixing, performs competitively at 50% replacement (FID=2.63) but degrades at higher ratios, suggesting that such operators are only effective at moderate ratios.

Ratio IS \uparrow FID \downarrow sFID \downarrow Prec. \uparrow Rec. \uparrow ΔΔ\Deltaroman_ΔFLOPsop. \downarrow ΔΔ\Deltaroman_ΔFLOPsft. \downarrow ΔΔ\Deltaroman_ΔParam \downarrow Baseline 278.20 2.27 4.60 0.83 0.57 MHA Grafting Random Init. 100% 1.66 289.23 154.00 0.00 0.00 Self-grafting 100% 287.81 2.49 4.71 0.83 0.56 Hyena-SE (K𝐾Kitalic_K=4) 50% 274.73 2.73 5.05 0.82 0.56 -49.52% +0.13% +0.22% 75% 231.15 3.62 6.04 0.81 0.54 -74.27% +0.20% +0.33% 100% -99.03% +0.26% +0.43% Hyena-X (K𝐾Kitalic_K=4) 50% 273.30 2.74 5.03 0.83 0.56 -49.90% +0.13% +0.16% 75% 229.11 3.69 6.10 0.81 0.53 -74.85% +0.20% +0.24% 100% -99.81% +0.26% +0.33% Hyena-Y (K𝐾Kitalic_K=4) 50% 273.37 2.72 5.02 0.83 0.55 -49.52% 0.00% +0.05% 75% 228.99 3.66 5.95 0.81 0.53 -74.27% 0.00% +0.08% 100% -99.03% 0.00% +0.11% SWA (w𝑤witalic_w=4) 50% 280.62 2.67 4.90 0.83 0.56 -48.24% 0.00% 0.00% 75% 249.99 3.09 5.54 0.82 0.55 -72.36% 0.00% 0.00% 100% -96.48% 0.00% 0.00% Mamba-2 (dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT=64, E𝐸Eitalic_E=2) 50% 285.08 2.65 4.84 0.83 0.55 -37.59% +77.89% +28.02% 75% 257.66 3.02 5.48 0.82 0.53 -56.38% +116.83% +42.04% 100% -75.17% +155.77% +56.05% MLP Grafting Random Init. 100% 1.27 314.72 204.99 0.00 0.00 Self-grafting 100% 277.72 2.54 4.52 0.83 0.57 Exp. ratio \downarrow (r𝑟ritalic_r=3) 50% 272.14 2.53 4.51 0.83 0.57 -12.50% 0.00% -12.50% 75% 279.72 2.61 4.61 0.83 0.56 -18.75% 0.00% -18.75% 100% 252.11 2.66 4.57 0.81 0.57 -25.00% 0.00% -25.00% Exp. ratio \uparrow (r𝑟ritalic_r=6) 50% 278.00 2.38 4.50 0.83 0.58 +25.00% 0.00% +25.00% 75% 277.94 2.37 4.48 0.82 0.58 +37.50% 0.00% +37.50% 100% 276.86 2.42 4.50 0.82 0.58 +50.00% 0.00% +50.00% Hyena-X (r𝑟ritalic_r=2,K𝐾Kitalic_K=4) 50% 265.60 2.64 4.66 0.83 0.56 +0.01% 0.00% +0.02% 75% 226.13 3.26 4.79 0.81 0.55 +0.02% 0.00% +0.03% 100% +0.02% 0.00% +0.03%

Table 4: Generation quality and efficiency of grafted hybrid architectures in DiT-XL/2. We report quality (IS, FID, sFID, Precision, Recall) and efficiency (ΔΔ\Deltaroman_ΔFLOPs and ΔΔ\Deltaroman_ΔParam) results for MHA and MLP grafting experiments. For each alternative, setups that maintain FID within 0.5 of the baseline and offer the largest FLOPs reduction (or smallest FLOPs increase) are highlighted.  denotes setups with poor generation (FID > 50). ΔFLOPsΔFLOPs\Delta\text{FLOPs}roman_Δ FLOPs and ΔParamΔParam\Delta\text{Param}roman_Δ Param denote the percentage change in FLOPs and parameters, respectively (for MHA and MLP operators). For MHA, total cost is split into ΔFLOPsop.ΔsubscriptFLOPsop.\Delta\text{FLOPs}_{\text{op.}}roman_Δ FLOPs start_POSTSUBSCRIPT op. end_POSTSUBSCRIPT (softmax attention, gating) and ΔFLOPsft.ΔsubscriptFLOPsft.\Delta\text{FLOPs}_{\text{ft.}}roman_Δ FLOPs start_POSTSUBSCRIPT ft. end_POSTSUBSCRIPT (QKV/output projections, featurizers). We do not use this decomposition for MLP variants. Mamba-2 incurs higher ΔFLOPsft.ΔsubscriptFLOPsft.\Delta\text{FLOPs}_{\text{ft.}}roman_Δ FLOPs start_POSTSUBSCRIPT ft. end_POSTSUBSCRIPT due to additional projections. Exact FLOP calculations are provided in Sec. F.1. Key result: Many interleaved designs achieve good quality generation (FID within 0.5 of baseline).
Takeaway 1: Grafting is effective for constructing efficient hybrid architectures with good generative quality under small compute budgets. Interleaved designs are particularly effective.

5 Experiments II: Grafting Text-to-Image Diffusion Transformers

We apply grafting to a more challenging setting: high-resolution text-to-image generation with PixArt-ΣΣ\Sigmaroman_Σ chen2024pixartsigma . This presents three challenges: (1) long sequences (16,384 tokens for 2048×2048 resolution), (2) a multimodal setup with text conditioning, and (3) lack of publicly available training data. These factors make PixArt-ΣΣ\Sigmaroman_Σ a representative setting for evaluating grafting under real-world constraints. PixArt-ΣΣ\Sigmaroman_Σ contains 28 transformer layers similar to DiT-XL/2.

Experiment setup. We replace MHA operators in PixArt-ΣΣ\Sigmaroman_Σ with Hyena-X via grafting, as MHA accounts for over 62% of generation latency. Hyena-X was chosen based on its good quality-efficiency tradeoff in the ImageNet-1K setup, achieving FID 2.61 with 20% data (see Fig. 4 (b)). Interleaved grafting is applied for layers 8, 10, 12, 14, 16, 18, and 20–27; empirically, we found that layers 20–27 can be replaced without significant quality drop. For grafting, we created a small, uncurated synthetic dataset of 12k image-text pairs. The text prompts for this dataset were sampled from the 30k publicly released evaluation set. Stage 1 (activation distillation): 8k uncurated synthetic image-text pairs are used to initialize Hyena-X blocks. We use the L1 regression objective, as we observe similar MHA activation behavior in PixArt-ΣΣ\Sigmaroman_Σ (Fig. D.1). Stage 2 (finetuning): We use LoRA (rank=64) hulora for finetuning. LoRA enables efficient finetuning by managing the high memory demands associated with long sequences (16,384 tokens). The full 12k synthetic dataset is used in this stage. We use 20 step DPM Solver lu2022dpm for generation. Experiment details are provided in Sec. D.2.

Results. The grafted model achieves a 1.43× speedup in wall-clock time, with a small drop in GenEval score (47.78 vs. 49.75). Attribute-specific metrics remain comparable, and qualitative samples show good alignment and quality. Some localized artifacts are observed in textured regions likely due to LoRA’s adaptation capacity and low-quality synthetic data (see failure cases in Fig. D.3D.4).

Takeaway 2: We graft high-resolution text-to-image DiTs, constructing hybrid architectures with meaningful speedups and minimal quality drop.

Model Ratio Obj(1) Obj(2) Count Colors Pos Color Attr. Overall \uparrow Latency (ms) \downarrow Baseline - 81.45 61.62 46.25 77.13 10.75 21.50 49.75 235.46 Hyena-X 29% 80.31 59.34 49.69 68.62 11.50 18.75 48.04 194.95 (1.21×\times×) Hyena-X 50% 80.00 57.07 48.13 70.74 11.25 19.50 47.78 164.58 (1.43×\times×)

Table 5: GenEval results and latency for PixArt-ΣΣ\Sigmaroman_Σ and the grafted variants. The 50% grafted model achieves a 1.43× speedup while retaining strong text-image alignment (GenEval overall score: 47.78 vs. 49.75). Attribute-specific scores remain comparable across models. Latency is measured for a single forward pass on an Nvidia H100 (batch size=2).

6 Case Study: Converting Model Depth to Width via Grafting

Refer to caption

Figure 5: Convert model depth \rightarrow width via grafting: (a) Two sequential transformer layers. (b) Rewiring in parallel via grafting (includes skip connections).

Can we rewire two sequential transformer blocks to run in parallel? Our MLP grafting results showed that MLPs are amenable to grafting, even at 100% replacement with an expansion ratio of r=6𝑟6r=6italic_r = 6, demonstrating that wider computation within an operator is feasible. This success, combined with the fact that modern GPUs favor parallel over sequential computation, motivates a broader question: can we convert deeper, sequential DiT computations into wider, parallel ones via grafting while maintaining quality? To explore this, we rewire DiT-XL/2 by parallelizing every pair of sequential transformer blocks—each pair receives the same input, and their outputs are merged via a linear projection. This reduces model depth by 2×\times× (28 \rightarrow 14) with a 6% increase in parameters.

Experiment Setup. The rewiring schematic is shown in Fig. 5. Stage 1: Activation distillation. Each parallel pair was initialized via activation distillation using L1 regression. The weights for each block in the parallel pair were initialized from their corresponding pre-trained weights, rather than random initialization. Similar to our previous experiments, 8k ImageNet-1K samples were used for this stage. Stage 2: Lightweight finetuning. Given the architectural restructuring, finetuning was performed using 25% of the training data. The learning rate was linearly warmed up to 1e-4 and halved at 75k and 150k iterations. Experiment details can be found in Sec. C.

Results. The goal of this study is to evaluate generative quality (FID) vs. model depth. We report results in Tab. 6. To contextualize our findings, we compare against two categories: (i) DiTs trained from scratch at lower depth, and (ii) pruning methods kim2024bk ; fang2024tinyfusion . Our 14-layer grafted model achieves an FID of 2.77—surpassing DiT variants trained from scratch with similar or increased depth, including DiT-L/2 (depth 24, FID 3.73) and U-ViT-L (depth 21, FID 3.44). It also outperforms pruning baselines such as TinyDiT-D14 with masked knowledge distillation (depth 14, FID 2.86) and BK-SDM (depth 14, FID 7.43), though these baselines have fewer parameters (340M) compared to the grafted variants (712M). To our knowledge, this is the first attempt to restructure DiTs by converting sequential computation into parallel at the transformer block level.

Takeaway 3: Grafting enables architectural restructuring at the transformer block level, allowing depth to be traded for width.

Method Depth A.R Iters IS \uparrow FID \downarrow sFID \downarrow Prec. \uparrow Recall \uparrow Speedup \uparrow Params \downarrow DiT-L/2 peebles2023scalable 24 42.7 1,000K 196.26 3.73 4.62 0.82 0.54 458M U-ViT-L bao2023all 21 48.8 300K 221.29 3.44 6.58 0.83 0.52 287M DiT-B/2 peebles2023scalable 12 64.0 1000K 119.63 10.12 5.39 0.73 0.55 130M BK-SDM kim2024bk 14 82.3 100K 141.18 7.43 6.09 0.75 0.55 2×\times× 340M TinyDiT-D14 fang2024tinyfusion 14 82.3 500K 198.85 3.92 5.69 0.78 0.58 2×\times× 340M TinyDiT-D14 w/ MKD fang2024tinyfusion 14 82.3 500K 234.50 2.86 4.75 0.82 0.55 2×\times× 340M DiT-XL/2 peebles2023scalable 28 41.4 7,000K 278.20 2.27 4.60 0.83 0.57 1×\times× 675M Grafting (Ours) 14 164.6 100K 231.91 3.12 4.71 0.82 0.55 2×superscript\times^{\P}× start_POSTSUPERSCRIPT ¶ end_POSTSUPERSCRIPT 712M Grafting (Ours) 14 164.6 230K 251.77 2.77 4.87 0.82 0.56 2×superscript\times^{\P}× start_POSTSUPERSCRIPT ¶ end_POSTSUPERSCRIPT 712M

Table 6: Generative quality vs. model depth. Generative quality metrics are reported: FID, IS, sFID, Precision, and Recall. A.R. (Aspect Ratio) is defined as model width divided by depth (e.g., 1152/14 = 82.3). Parameters (Params) are reported in millions. For pruning and grafting setups, we report speedup with respect to DiT-XL/2 (depth=28). Off-the-shelf DiT-L/2, U-ViT-L, and DiT-B/2 scores, along with pruning baselines (BK-SDM, TinyDiT-D14, and TinyDiT-D14 w/ MKD), are sourced from  fang2024tinyfusion . MKD refers to Masked Knowledge Distillation, a recovery method used in TinyDiT fang2024tinyfusion . Our grafted models achieve better generative quality at depth=14, surpassing baselines in FID, IS, Precision, and Recall. Speedup is measured for a single forward pass on an Nvidia H100 (batch size=2). More implementation details are provided in Sec. C.

7 Related Work

Diffusion model architectures. Recently, many architectural innovations have been proposed for diffusion models for image and video generation xie2024sana ; yan2024diffusion ; fei2024scalable ; hu2024zigma ; teng2024dim ; zhu2024dig ; Seawead2025Seaweed7BCT ; survey_vdm ; gao2024matten ; wang2024lingen . Many recent works focus on improving the attention mechanism in diffusion models to enhance efficiency and scalability. One major direction is the use of modern linear attention variants, such as DiffuSSM yan2024diffusion , DiS fei2024scalable , Zigma hu2024zigma , DiM teng2024dim , and DIG zhu2024dig . Recently, text-to-image diffusion models such as SANA xie2024sana have also adapted linear attention variants to support high-resolution generation. Another recent direction explores the mixture-of-experts (MoE) idea. DiT-MoE fei2024scaling introduces sparse diffusion transformers with shared expert routing and expert-level balance loss, enabling efficient scaling to 16.5B parameters while achieving competitive performance. We note that methods like STAR thomas2024star have also successfully discovered architectures via evolutionary methods for autoregressive language modeling. While effective, these approaches require training from scratch, making such studies expensive and inaccessible to practitioners. In contrast, grafting focuses on architecture editing of pretrained models to materialize new architectures under small compute budgets.

Architectural editing of pretrained models. Another line of work focuses on linearizing large language models by replacing softmax attention with efficient operators, such as linear attention zhanglolcats ; wang2024the ; bick2024transformers . Similar ideas have also been adopted for diffusion models in  liu2024clear ; liu2024linfusion ; becker2025edit , though these works focus only on ultra-high-resolution settings. These prior efforts typically focus on replacing a single operator type (primarily attention) or are limited to specific application domains. Grafting presents a more general and comprehensive approach for architectural editing. It extends beyond single-operator replacement to enable modifying multiple operator types, exploring diverse architectural alternatives (e.g., both MHA and MLP replacements), and restructuring entire architectures (e.g., converting depth to width). Recently, FFN Fusion bercovich2025ffn explored parallelizing transformer blocks in LLMs, aiming to reduce sequential computation.

8 Conclusion and Discussion

In this work, we introduced grafting, a simple approach to architecture editing. We constructed hybrid models by replacing self-attention and MLPs with efficient alternatives, achieving competitive quality (FID 2.38–2.64 vs. 2.27 baseline). We then applied grafting to a high-resolution text-to-image model (PixArt-ΣΣ\Sigmaroman_Σ), yielding a 43% speedup with less than a 2% drop in GenEval score. We then used grafting to restructure DiT-XL/2, converting every pair of sequential transformer blocks into parallel, reducing model depth by half and yielding better quality (FID 2.77) among 14-layer DiTs. These results demonstrate grafting’s utility in both short- and long-context settings (e.g., ImageNet-1K and PixArt-ΣΣ\Sigmaroman_Σ, respectively), and for architecture restructuring. Overall, grafting proves to be a lightweight approach for exploring diffusion transformer designs under small compute budgets.

Limitations. This work primarily focuses on architectural editing of pretrained Diffusion Transformers (DiTs), specifically targeting Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. Other architectural components, such as normalization layers and activation functions, will be explored in future work. We note that these are latent diffusion models, and grafting components in their corresponding VAEs remains an area for future study. Our experiments primarily focus on DiTs, and generalizing grafting to other model families, such as autoregressive models, is a direction for future research. The PixArt-ΣΣ\Sigmaroman_Σ setup used synthetic data for grafting, which may propagate artifacts and biases into the grafted models. While this work focuses on architectural editing, it remains an open question whether architectures that perform well under grafting also perform well when trained from scratch. Finally, grafting requires access to a pretrained model.

Applications and future work. Grafting holds promise for diverse applications where efficiency is important. This includes adapting models from low-resolution to high-resolution generation, extending capabilities from short-form video understanding/generation to long-form chandrasegaran2024hourvideo ; chen2025eagle , or improving user experience in interactive applications like image editing where even modest speedups (e.g., 10%) are highly valued. We hope that our testbed, insights, and results will encourage the community to actively explore new architecture designs.

Acknowledgments

We thank Liquid AI for sponsoring compute for this project. We also thank Armin W. Thomas, Garyk Brixi, Kyle Sargent, Karthik Dharmarajan, Stephen Tian, Cristobal Eyzaguirre, and Aryaman Arora for their feedback on the manuscript.

References

  • [1] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • [2] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024.
  • [3] Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  • [4] Eliezer Goldschmidt. Plant grafting: new mechanisms, evolutionary implications. Frontiers in Plant Science, 5:727, 12 2014.
  • [5] Michael Zhang, Simran Arora, Rahul Chalamala, Benjamin Frederick Spector, Alan Wu, Krithik Ramesh, Aaryan Singhal, and Christopher Re. Lolcats: On low-rank linearizing of large language models. In The Thirteenth International Conference on Learning Representations, 2025.
  • [6] Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
  • [7] Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. Advances in Neural Information Processing Systems, 37:31788–31812, 2024.
  • [8] Dan Fu, Simran Arora, Jessica Grogan, Isys Johnson, Evan Sabri Eyuboglu, Armin Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré. Monarch mixer: A simple sub-quadratic gemm-based architecture. Advances in Neural Information Processing Systems, 36:77546–77603, 2023.
  • [9] Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In The Eleventh International Conference on Learning Representations, 2023.
  • [10] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • [11] Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\sum: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024.
  • [12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [13] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
  • [14] Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2023.
  • [15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
  • [16] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023.
  • [17] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning and Representation Learning Workshop, 2015.
  • [18] Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
  • [19] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • [20] Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 23–35, 2024.
  • [21] Shikai Qiu, Andres Potapczynski, Marc Anton Finzi, Micah Goldblum, and Andrew Gordon Wilson. Compute better spent: Replacing dense layers with structured matrices. In International Conference on Machine Learning, pages 41698–41716. PMLR, 2024.
  • [22] Jackson Petty, Sjoerd Steenkiste, Ishita Dasgupta, Fei Sha, Dan Garrette, and Tal Linzen. The impact of depth on compositional generalization in transformer language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7232–7245, 2024.
  • [23] Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, et al. Mechanistic design and scaling of hybrid architectures. arXiv preprint arXiv:2403.17844, 2024.
  • [24] Jerome Ku, Eric Nguyen, David W Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X Lu, Dave P Burke, Greg Brockman, et al. Systems and algorithms for convolutional multi-hybrid language models at scale. arXiv preprint arXiv:2503.01868, 2025.
  • [25] Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, Mohammad R.K. Mofrad, Madelena Y. Ng, Jaspreet Pannu, Christopher Ré, Jonathan C. Schmok, John St. John, Jeremy Sullivan, Kevin Zhu, Greg Zynda, Daniel Balsam, Patrick Collison, Anthony B. Costa, Tina Hernandez-Boussard, Eric Ho, Ming-Yu Liu, Thomas McGrath, Kimberly Powell, Dave P. Burke, Hani Goodarzi, Patrick D. Hsu, and Brian L. Hie. Genome modeling and design across all domains of life with evo 2. bioRxiv, 2025.
  • [26] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
  • [27] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  • [28] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Forty-first International Conference on Machine Learning, 2024.
  • [29] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • [30] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • [31] Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. In European Conference on Computer Vision, pages 381–399. Springer, 2024.
  • [32] Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. arXiv preprint arXiv:2412.01199, 2024.
  • [33] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023.
  • [34] Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629, 2024.
  • [35] Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffusion models without attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8239–8249, 2024.
  • [36] Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
  • [37] Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Bjorn Ommer. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  • [38] Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis. arXiv preprint arXiv:2405.14224, 2024.
  • [39] Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xinggang Wang. Dig: Scalable and efficient diffusion models with gated linear attention. arXiv preprint arXiv:2405.18428, 2024.
  • [40] Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiao Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, and Lu Jiang. Seaweed-7b: Cost-effective training of video generation foundation model. 2025.
  • [41] Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models. ACM Comput. Surv., 57(2), November 2024.
  • [42] Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, and Lin Ma. Matten: Video generation with mamba-attention. arXiv preprint arXiv:2405.03025, 2024.
  • [43] Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, et al. Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity. arXiv preprint arXiv:2412.09856, 2024.
  • [44] Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters. arXiv preprint arXiv:2407.11633, 2024.
  • [45] Armin W Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, and Michael Poli. Star: Synthesis of tailored architectures. In The Thirteenth International Conference on Learning Representations, 2025.
  • [46] Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Clear: Conv-like linearization revs pre-trained diffusion transformers up. arXiv preprint arXiv:2412.16112, 2024.
  • [47] Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image. arXiv preprint arXiv:2409.02097, 2024.
  • [48] Philipp Becker, Abhinav Mehrotra, Ruchika Chavhan, Malcolm Chadwick, Luca Morreale, Mehdi Noroozi, Alberto Gil Ramos, and Sourav Bhattacharya. Edit: Efficient diffusion transformers with linear compressed attention. arXiv preprint arXiv:2503.16726, 2025.
  • [49] Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, et al. Ffn fusion: Rethinking sequential computation in large language models. arXiv preprint arXiv:2503.18908, 2025.
  • [50] Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. In Advances in Neural Information Processing Systems, volume 37, 2024.
  • [51] Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models. arXiv preprint arXiv:2504.15271, 2025.
  • [52] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023.

Supplementary Material

  • Section A : Standard Deviation of Experiments

  • Section B : Hybrid Architecture Experiments: Additional details

    • Section B.1 : Experiment details and additional samples

    • Section B.2 : Modulated Regression Targets for MHA

    • Section B.3 : Validation Loss Curves for Self-grafting Experiments

  • Section C : Depth to Width Grafting Experiments: Additional details

  • Section D : Text-to-Image Generation Experiments: Additional details

    • Section D.1 : MHA activation plots

    • Section D.2 : Experiment Details

    • Section D.3 : Generated samples and failure cases

  • Section E : Hyena-X and Hyena-Y operators: Additional details

  • Section F : FLOP calculation

Appendix A Standard Deviation of Experiments

To compute variance associated with our reported results, we repeat two representative experiments—MHA (Hyena-Y) and MLP (width=6)—using three different random seeds (seed = 0, 200, 300). We follow the exact grafting setup used in the main paper for these experiments. We report the mean and standard deviation of IS, FID, sFID, Precision and Recall in Tab. A.1. We observe that the standard deviations are within an acceptable range.

Setup IS FID sFID Precision Recall MHA/ Hyena-Y 273.19 ±plus-or-minus\pm± 0.46 2.73 ±plus-or-minus\pm± 0.01 5.06 ±plus-or-minus\pm± 0.04 0.83 ±plus-or-minus\pm± 0.00 0.55 ±plus-or-minus\pm± 0.00 MLP/ higher width (r=6𝑟6r=6italic_r = 6) 277.91 ±plus-or-minus\pm± 0.95 2.41 ±plus-or-minus\pm± 0.01 4.48 ±plus-or-minus\pm± 0.02 0.82 ±plus-or-minus\pm± 0.00 0.58 ±plus-or-minus\pm± 0.00

Table A.1: Mean and standard deviation of IS, FID, sFID, Precision and Recall calculated for three runs with different random seeds (0,200,300).

Appendix B Hybrid Architecture Experiments: Additional details

B.1 Experiment details and additional samples

We provide all hyperparameters used for the experiments in Tab. B.1. To ensure a fair comparison, we used identical hyperparameters across every hybrid experiment. We include additional qualitative samples generated using our hybrid architectures obtained via grafting in Fig. B.1.

Stage 1: Activation Distillation Initial Learning Rate 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT Weight Decay 0 Epochs 200 Batch Size 64 Clip Norm 10.0 Optimizer AdamW (betas=(0.9,0.999)𝑏𝑒𝑡𝑎𝑠0.90.999betas=(0.9,0.999)italic_b italic_e italic_t italic_a italic_s = ( 0.9 , 0.999 )) Loss Function L1 (MHA), L2 (MLP) Stage 2: Lightweight Finetuning Initial Learning Rate 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT Weight Decay 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT Iterations 50,000 (100 epochs) Batch Size 256 Optimizer AdamW (betas=(0.9,0.999)𝑏𝑒𝑡𝑎𝑠0.90.999betas=(0.9,0.999)italic_b italic_e italic_t italic_a italic_s = ( 0.9 , 0.999 )) Scheduler Linear Warmup over 1K steps, then constant lr