The Hopper Architecture as a Power Delivery Challenge
The NVIDIA H100, the Hopper-architecture GPU that has dominated AI training infrastructure since 2023, operates at the electrical frontier where voltage droop becomes dangerous: 80 billion transistors on an 814 mm² die, fabricated on TSMC's 4N process, delivering up to 700 watts in the SXM5 configuration at sub-volt supply levels. Every characteristic that makes the H100 a powerful AI accelerator also makes it a challenging power delivery problem.
Using published architectural specifications and TSMC process-node data, we built a full-stack PDN model of the H100 in PDNLab. The model captures the complete on-die power delivery network from the global power grid down to all 144 individual Streaming Multiprocessor cores, with the board and package path represented as lumped impedance feeding the die. The resulting model contains 158 interconnected grids, 175 current sources, 146 decoupling capacitors, 794 transmission lines, and over 1,500 connection nodes. This article describes how that model was constructed, the assumptions behind every parameter, and what the results reveal about voltage droop behavior in modern AI accelerators.
From Board to Transistor: The Power Delivery Stack
The H100's power delivery path is a multi-layer system with distinct electrical characteristics at each level. The model captures each layer at the level of detail that matters for droop analysis.
The Die
The GH100 die measures approximately 28.5 mm × 28.5 mm (814 mm²), as published by NVIDIA. It contains 144 Streaming Multiprocessors organized into 8 Graphics Processing Clusters, along with memory controllers, NVLink engines, and a PCIe block. The die is fabricated on TSMC's 4N process, which provides up to 15 metal layers.
Given that the H100 is fabricated on TSMC's 4N process with up to 15 metal layers, the die likely uses a hierarchical power delivery network — a standard approach in high-performance designs at this node. The uppermost thick metal layers (likely in the M13–M15 range) would provide low-resistance global power distribution across the full die area, while intermediate semi-global layers (likely around M7–M8) would distribute power locally within individual SM core regions. NVIDIA does not publish the exact metal layer assignments for the H100's power grid, but this two-level hierarchy is consistent with how advanced-node designs allocate their metal stack for power delivery.
These two grid levels would be connected through dense arrays of stacked vias. A single stacked via spanning several metal layers has a resistance on the order of 12 to 18 Ω, but with millions of vias in parallel across each SM core area, the effective vertical resistance drops to the sub-micro-ohm range.
Die-to-Package Interconnect
The H100 uses flip-chip BGA packaging. Based on the die area (814 mm²) and typical BGA pitch for this process node, we estimate approximately 5,000 to 6,000 solder balls connecting the die to the package substrate, of which roughly 5,000 would be dedicated to power and ground. NVIDIA does not publish exact bump counts. Each solder ball has a resistance on the order of 0.15 mΩ. In aggregate, the bump array would present an effective vertical resistance of approximately 120 nΩ per supply rail.
Package and Board
The SXM5 module uses TSMC's CoWoS-S (Chip-on-Wafer-on-Substrate) packaging, integrating the GH100 die alongside five HBM3 memory stacks on a silicon interposer. The package substrate is a multi-layer organic PCB carrying power from edge-mounted connectors to the die. The SXM5 carrier board supplies power from 30+ voltage regulator modules through heavy copper planes.
For the purposes of this model, the specific dimensions of the package substrate and carrier board are not critical. As explained below, the board and package are represented as lumped impedance rather than spatially resolved grids, because their contribution to sub-nanosecond voltage transients reduces to a series impedance.
Modeling the H100 Power Delivery Network in PDNLab
The modeling approach begins at the die level, where the voltage droop events that matter most actually occur. The 814 mm² GH100 die is decomposed into functional blocks — SM cores, L2 cache, memory controllers, and NVLink I/O — each represented as a grid element in PDNLab with physically derived electrical parameters. The board and package are represented as lumped impedance feeding into this on-die network, because their spatial structure is electrically invisible at the nanosecond timescales where droop concentrates.
Die-Level Block Decomposition and Area Allocation
The first step is dividing the 814 mm² die into functional blocks with estimated area allocations. Using the same approximate 60/15/15/10 power budget split as a proxy for area:
| Functional Block | Area Allocation | Estimated Area | Grid Dimensions in Model |
|---|---|---|---|
| 144 SM Cores | ~60% (~488 mm²) | ~3.39 mm² per SM | 184 µm × 184 µm each |
| L2 Cache (2 partitions) | ~10% (~81 mm²) | ~40.5 mm² per partition | 1.04 cm × 0.18 cm each |
| HBM3 Memory Controllers (10) | ~15% (~122 mm²) | ~12.2 mm² per controller | Distributed along die edges |
| NVLink + PCIe I/O | ~15% (~122 mm²) | Combined I/O block | Grid along bottom edge |
Each block in PDNLab is modeled as a two-dimensional power grid with its own sheet resistance, inductance, capacitance, and attached current sources. The grid dimensions are derived from the area estimates above.
On-Die Capacitance: From Transistor Physics to PDNLab Parameters
A critical parameter for each block is its on-die decoupling capacitance — the intrinsic charge storage that resists voltage droop during current transients. In PDNLab, each block's capacitor element takes a single value: Capacitance (F), representing the total capacitance within that block's area. This value is derived from TSMC 4N transistor-level physics.
The TSMC 4N process has a gate capacitance of approximately 0.18 to 0.22 fF per transistor. With 80 billion transistors distributed across the 814 mm² die, the total intrinsic transistor capacitance is:
| Step | Calculation | Result |
|---|---|---|
| Gate capacitance per transistor | ~0.2 fF (midpoint of 0.18–0.22 fF) | 0.2 × 10⁻¹⁵ F |
| Total transistor capacitance | 80 × 10⁹ × 0.2 × 10⁻¹⁵ F | ~16 µF |
| Unit capacitance density | 16 µF ÷ 814 mm² | ~19.66 nF/mm² |
This unit density of approximately 19.66 nF/mm² lets us estimate the intrinsic capacitance for any block based on its area. The table below shows the calculated value from pure area-based scaling alongside the adjusted value used in the actual PDNLab model:
| Block | Area | Calculated C (area × 19.66 nF/mm²) | Model Value (adjusted) | Notes |
|---|---|---|---|---|
| SM Core (each) | ~3.39 mm² | ~66.6 nF | 1.1 × 10⁻⁷ F (110 nF) | Adjusted upward to account for MiM decoupling capacitors in upper metal layers |
| L2 Cache (each partition) | ~40.5 mm² | ~796 nF | 2 × 10⁻⁶ F (2 µF) | SRAM structures have higher per-area capacitance than logic; MiM caps contribute significantly |
| HBM3 Memory Controller (each) | ~12.2 mm² | ~240 nF | ~3 × 10⁻⁷ F (300 nF) | I/O circuitry with moderate decoupling |
| NVLink I/O | ~122 mm² | ~2.4 µF | ~2.5 × 10⁻⁶ F (2.5 µF) | High-speed SerDes PHYs with local decoupling |
The model values are intentionally higher than the pure transistor-capacitance calculation because the on-die power grid includes additional capacitance sources beyond gate oxide: MiM (Metal-Insulator-Metal) decoupling capacitors integrated within the upper metal layers, inter-wire coupling capacitance in the power grid, and junction capacitance of inactive transistors. The adjusted values reflect a physically reasonable total that accounts for all of these contributions.
Power Grid Parameters: Global and Core Levels
Beyond capacitance, each grid element in PDNLab is characterized by wire geometry and electrical properties. The H100 model uses two distinct grid tiers:
Global On-Die PDN — The global power grid covers the full 28.5 mm × 28.5 mm die area using what we assume to be the uppermost metal layers (likely M13–M15). The parameters below are estimated from published TSMC process-node characteristics for the 5nm/4N family.
| Parameter | Value | Source / Rationale |
|---|---|---|
| Grid size | 2.85 cm × 2.85 cm | GH100 die dimensions (√814 mm²) |
| Wire width | 3 µm | Wide power straps in top global metal |
| Wire spacing | 15 µm | Power strap pitch for HPC designs |
| Sheet resistance | 0.03 Ω/□ | TSMC 4N top global metal (~0.8–1.2 µm thick Cu) |
| Inductance | 3 nH/cm | Closely spaced Vdd/Vss in M15 |
| Capacitance | 2 pF/cm | Global power wire self-capacitance |
SM Core Grids — All 144 Streaming Multiprocessors are modeled as individual grids, each representing its local core-level power distribution. Each SM grid has a dedicated current source and the decoupling capacitor described above.
| Parameter | Value | Source / Rationale |
|---|---|---|
| Grid size | 184 µm × 184 µm | SM core area ≈ 3.39 mm² (488 mm² ÷ 144) |
| Wire width | 1 µm | Semi-global metal routing width |
| Wire spacing | 10 µm | Local power distribution pitch |
| Sheet resistance | 0.3 Ω/□ | TSMC 4N semi-global layers (estimated M7–M8) |
| Inductance | 5 nH/cm | Semi-global layer pair |
| Capacitance | 3 pF/cm | Core grid self-capacitance |
Functional Block Grids
Beyond the 144 SM cores, the model includes dedicated grids for every major functional block on the H100 die. The L2 cache is represented as two grids flanking the SM array, corresponding to the approximately 50 MB L2 cache partitions. Ten grids are arranged along the left and right die edges to represent the HBM3 memory controller pairs. The NVLink bus is modeled as a grid along the bottom edge, carrying 18 current sources representing the fourth-generation NVLink engine PHYs.
This level of functional decomposition ensures that the current distribution across the die reflects the actual H100 architecture — not just the compute cores but the memory subsystem and high-speed I/O that together account for 30 to 40 percent of total power consumption.
The Board and Package as Lumped Impedance
The outermost layer of the model represents the voltage regulators and the board-to-die delivery path. Rather than modeling the SXM5 carrier board and package substrate as full two-dimensional power grids, the model uses an ideal capacitor as the VRM voltage source and transmission lines to carry the aggregate series impedance of the board traces, package substrate planes, and solder ball array into the die.
This is a deliberate modeling choice. The voltage droop events that determine silicon reliability occur on nanosecond timescales. At those frequencies, the board PDN cannot respond fast enough to influence the droop waveform because its electrical distance from the die is too large. The board's function is to replenish on-die decoupling between transient events, and a lumped RLC representation captures that behavior accurately. A spatially resolved board grid would add computational cost without changing the droop results, because spatial variation across the carrier board is electrically invisible at the frequencies where droop concentrates.
This approach is standard practice in high-frequency PDN analysis where the region of interest is the die itself. The lumped path collapses the VRM, board, and package into an equivalent source impedance, keeping the model focused on the on-die network where droop actually forms.
Workload-Based Voltage Droop Investigation of the H100 Model with PDNLab
With the physical network modeled, the next step is to drive it with realistic workloads. In PDNLab, every current source element is assigned a current profile — a piecewise-linear waveform that specifies current draw in Amperes as a function of time. The simulator steps through this profile at each timestep, injecting the specified current into the grid node where the source is placed. The resulting voltage response across the entire network reveals where and when droop occurs.
Generating a current profile requires two pieces of information: the average current that a block draws (derived from its power budget and supply voltage), and the temporal shape of the switching activity within each clock cycle. The average current sets the total charge per cycle, and the shape determines the di/dt — the rate of current change that drives inductive voltage droop. Together, these define a complete current-versus-time waveform that PDNLab uses to simulate transient voltage behavior across the die.
Deriving Current Draw from Published Specifications
The derivation begins with publicly available data. The H100 in its PCIe configuration has a rated TDP of 350 W. Based on publicly available performance analysis and general GPU architectural trends, we can estimate a reasonable power budget split: approximately 60% of total chip power consumed by the SM compute cores, with the remainder split among I/O (~15%), HBM3 memory (~15%), and L2 cache and control logic (~10%). These are rough assumptions — NVIDIA does not publish exact per-block power breakdowns — but they give us a working basis for the model. From this:
| Step | Calculation | Result |
|---|---|---|
| SM core power budget | 60% × 350 W | 210 W |
| Power per SM | 210 W ÷ 144 SMs | 1.458 W |
| Average current per SM | 1.458 W ÷ 0.6 V (Vdd) | 2.43 A |
The supply voltage of approximately 0.6 V is consistent with TSMC 4N nominal operating voltage for high-performance compute designs. The H100 base clock of approximately 690 MHz gives a clock period of roughly 1.45 ns. At 2.43 A average current, the charge per clock cycle delivered by the PDN to each SM is approximately 3.52 nC.
Voltage Droop Margins
At a nominal Vdd of approximately 0.6 V, the H100 must maintain voltage within tight margins to ensure correct logic operation. Industry practice for high-performance computing silicon typically allows a maximum dynamic voltage droop of 5% to 10% of Vdd. For the H100, this translates to:
| Droop Budget | Voltage Drop | Minimum Vdd |
|---|---|---|
| 5% of Vdd | 30 mV | 0.570 V |
| 7% of Vdd | 42 mV | 0.558 V |
| 10% of Vdd | 60 mV | 0.540 V |
A 30 mV droop on a 0.6 V supply is a 5% event. If the supply drops below the minimum operating voltage, logic timing violations can cause silent data corruption or functional failure. The guardband required to absorb this droop directly reduces the maximum achievable clock frequency. For the H100, even small improvements in droop management translate to measurable frequency and throughput gains across thousands of deployed units.
Generating Current Profiles
With the average current known, each SM's current profile is constructed as a triangular waveform: two back-to-back triangular pulses per clock cycle, each with a peak of 5 A and a base width of approximately half the clock period (~0.725 ns). The area under these two triangles in one 1.45 ns clock cycle equals the required 3.52 nC of charge — matching what a flat 2.43 A rectangular profile would deliver, but with over 6× lower di/dt. This triangular shape is a more physically realistic representation of how current ramps in actual transistor switching events, where gate charge and discharge follow smooth RC transitions rather than instantaneous steps.
The profile is specified in PDNLab as a series of (time, current) coordinate pairs that define the piecewise-linear waveform. The simulator interpolates between these points at each timestep, so the profile resolution directly controls the fidelity of the di/dt transient injected into the network.
Scenario Management
The H100 model uses PDNLab's Scenario Manager to define multiple workload configurations, each assigning different current profiles or activation patterns to the current sources across the die. In this article we examine two scenarios:
- Full activation: All 144 SMs fire simultaneously, representing worst-case di/dt during a large GEMM or all-to-all collective operation where every SM is computing in lockstep.
- Staggered SM activation within a local GPC: Individual SMs within a single Graphics Processing Cluster fire at offset phases rather than simultaneously. This represents the more realistic case where compute waves ripple through the SM pipeline within a GPC, producing lower local di/dt than full synchronous switching.
Each scenario can be simulated independently in minutes, enabling rapid exploration of the design space.
Scenario 1: Full Activation — All 144 SM Cores Firing
The most electrically stressful scenario is full activation: all 144 SM cores drawing current simultaneously with the triangular profile described above. Each SM uses a triangular current profile of length 1.45 ns (one clock period) repeated 3 times, for a total activation window of approximately 4.35 ns. Because every SM fires in phase, the voltage noise propagating outward from each current source overlaps constructively across the grid — especially in the die center where the density of active sources is highest and the electrical distance to the nearest bump is greatest.
Despite being the worst-case synchronous switching scenario, the simulation reveals a peak voltage droop of approximately 5.9 mV. The constructive interference of voltage waves from all 144 sources produces a smooth, spatially broad droop envelope rather than a sharp localized spike, because the uniform activation maintains symmetry across the grid.
Scenario 2: Staggered SM Activation Within a Local GPC
A more realistic scenario: individual SMs within a single Graphics Processing Cluster (SM1 through SM6 and SM8) fire at slight offset delays from one another rather than simultaneously. Each SM uses the same triangular current profile, but with staggered start times, and the activation repeats 6 times total for an overall simulation window ending at approximately 12 ns. This models the compute wave rippling through the SM pipeline within a GPC, a pattern characteristic of real AI workloads where warp schedulers dispatch work in rapid succession rather than in perfect lockstep.
Counter-intuitively, this staggered activation produces a larger peak voltage droop of approximately 7.4 mV — compared to 5.9 mV in the full synchronous case. The reason is constructive interference of voltage waves within a localized region: as each SM fires in sequence, the voltage disturbance from one SM has not yet dissipated before the next SM begins drawing current. The overlapping wavefronts pile up in the confined area of the GPC, creating a cumulative droop that exceeds what any single SM could produce alone and, critically, exceeds the uniformly distributed droop of the full-activation scenario. This is a direct consequence of the wave-propagation physics discussed above — noise does not appear instantaneously across the die; it propagates, and staggered local sources can constructively reinforce it.
Engineering Questions This Model Can Answer
An engineer using this model can ask questions that would otherwise require weeks of physical prototyping or full-chip EDA simulation:
- What happens to the droop envelope if the global grid wire width is doubled?
- How much additional MiM capacitance is needed to keep peak droop below 5% of Vdd (30 mV)?
- What is the droop impact of migrating from lateral to vertical power delivery?
- How does the droop distribution change if SM count is reduced for a lower-power SKU?
- What is the comparative droop behavior of a training workload versus inference?
Each of these questions can be answered by modifying a parameter, running the simulation, and observing the result in minutes rather than days. The H100 model files are included as example projects in PDNLab and serve as templates that engineers can adapt to their own chip, package, and board designs.
As the two scenarios above demonstrate, PDNLab enables GPU architects and power integrity engineers to truly visualize the cumulative voltage droop produced by different activation patterns driven by their AI workloads. The ability to see how staggered local switching can produce worse droop than full synchronous activation — a result that is non-obvious from static analysis alone — is a highly valuable insight for understanding the complex electrical environment of a superchip. With PDNLab, these dynamics become explorable in minutes rather than hidden behind weeks of full-chip extraction.
The Trend Is Clear
The electrical challenges quantified in this H100 model are not unique to NVIDIA. NVIDIA's own Blackwell architecture increases power density further. AMD, Intel, and custom silicon programs at hyperscalers are all pushing in the same direction: more compute per unit area, more current per unit time, lower voltage margins. The fundamental scaling trend is not slowing down.
At the same time, the voltage droop problem becomes harder at each generation. Supply voltages have been dropping, from ~0.8 V in earlier nodes to ~0.6 V at 4nm/5nm. The absolute margin for droop shrinks with each reduction. A 5% droop on 0.8 V is 40 mV of headroom. A 5% droop on 0.6 V is only 30 mV. The physics is unforgiving.
The engineering response must be equally systematic. Voltage droop analysis cannot remain a specialized activity performed by a small team at the end of the design cycle. It must become a routine part of architectural exploration, accessible to the engineers making floorplan, packaging, and power delivery decisions from the earliest stages. PDNLab brings that capability to every engineer working on power delivery for AI silicon.
References
- NVIDIA, "NVIDIA H100 Tensor Core GPU Architecture," NVIDIA Whitepaper, 2022.
- NVIDIA, "NVIDIA Hopper Architecture In-Depth," NVIDIA Technical Blog, March 2022.
- M. Naumov et al., "NVIDIA Hopper H100 GPU: Scaling Performance," Hot Chips 34, IEEE, 2022.
- TSMC, "N4P and N4X Technology," TSMC 2022 Technology Symposium.
- S.K. Lim, Design for High Performance, Low Power, and Reliable 3D Integrated Circuits, Springer, 2013.
- M. Swaminathan and A.E. Engin, Power Integrity Modeling and Design for Semiconductors and Systems, Prentice Hall, 2007.