The Hopper Architecture as a Power Delivery Challenge

The NVIDIA H100, the Hopper-architecture GPU that has dominated AI training infrastructure since 2023, operates at the electrical frontier where voltage droop becomes dangerous: 80 billion transistors on an 814 mm² die, fabricated on TSMC's 4N process, delivering up to 700 watts in the SXM5 configuration at sub-volt supply levels. Every characteristic that makes the H100 a powerful AI accelerator also makes it a challenging power delivery problem.

Using published architectural specifications and TSMC process-node data, we built a full-stack PDN model of the H100 in PDNLab. The model captures the complete on-die power delivery network from the global power grid down to all 144 individual Streaming Multiprocessor cores, with the board and package path represented as lumped impedance feeding the die. The resulting model contains 158 interconnected grids, 175 current sources, 146 decoupling capacitors, 794 transmission lines, and over 1,500 connection nodes. This article describes how that model was constructed, the assumptions behind every parameter, and what the results reveal about voltage droop behavior in modern AI accelerators.


From Board to Transistor: The Power Delivery Stack

The H100's power delivery path is a multi-layer system with distinct electrical characteristics at each level. The model captures each layer at the level of detail that matters for droop analysis.

Board PDN (SXM5 Carrier) 30+ VRMs · Heavy copper planes · Modeled as lumped impedance V V Package Substrate (CoWoS-S) Multi-layer organic · Land-side decaps · HBM3 integration ~5,000 bumps Global On-Die PDN (est. M13–M15) 28.5 mm × 28.5 mm · 0.03 Ω/□ sheet resistance Via stacks Core Power Grid (est. M7–M8) 144 SMs ~0.6–0.8 V Level 1 Level 2 Level 3 Level 4 Current flow ← Increasing spatial resolution · Decreasing electrical distance to transistors →
Cross-section of the H100 power delivery stack as modeled in PDNLab. Current flows from board-level VRMs through the package substrate and bump array into the on-die global and core power grids feeding 144 individual SM cores. The specific metal layer assignments are estimated based on TSMC 4N process characteristics. The board and package are modeled as lumped impedance; the on-die network is modeled spatially.

The Die

The GH100 die measures approximately 28.5 mm × 28.5 mm (814 mm²), as published by NVIDIA. It contains 144 Streaming Multiprocessors organized into 8 Graphics Processing Clusters, along with memory controllers, NVLink engines, and a PCIe block. The die is fabricated on TSMC's 4N process, which provides up to 15 metal layers.

Given that the H100 is fabricated on TSMC's 4N process with up to 15 metal layers, the die likely uses a hierarchical power delivery network — a standard approach in high-performance designs at this node. The uppermost thick metal layers (likely in the M13–M15 range) would provide low-resistance global power distribution across the full die area, while intermediate semi-global layers (likely around M7–M8) would distribute power locally within individual SM core regions. NVIDIA does not publish the exact metal layer assignments for the H100's power grid, but this two-level hierarchy is consistent with how advanced-node designs allocate their metal stack for power delivery.

These two grid levels would be connected through dense arrays of stacked vias. A single stacked via spanning several metal layers has a resistance on the order of 12 to 18 Ω, but with millions of vias in parallel across each SM core area, the effective vertical resistance drops to the sub-micro-ohm range.

Die-to-Package Interconnect

The H100 uses flip-chip BGA packaging. Based on the die area (814 mm²) and typical BGA pitch for this process node, we estimate approximately 5,000 to 6,000 solder balls connecting the die to the package substrate, of which roughly 5,000 would be dedicated to power and ground. NVIDIA does not publish exact bump counts. Each solder ball has a resistance on the order of 0.15 mΩ. In aggregate, the bump array would present an effective vertical resistance of approximately 120 nΩ per supply rail.

Package and Board

The SXM5 module uses TSMC's CoWoS-S (Chip-on-Wafer-on-Substrate) packaging, integrating the GH100 die alongside five HBM3 memory stacks on a silicon interposer. The package substrate is a multi-layer organic PCB carrying power from edge-mounted connectors to the die. The SXM5 carrier board supplies power from 30+ voltage regulator modules through heavy copper planes.

For the purposes of this model, the specific dimensions of the package substrate and carrier board are not critical. As explained below, the board and package are represented as lumped impedance rather than spatially resolved grids, because their contribution to sub-nanosecond voltage transients reduces to a series impedance.


Modeling the H100 Power Delivery Network in PDNLab

PDNLab™
The H100 power delivery network as rendered in PDNLab. Drag to rotate. The hierarchical model decomposes the full delivery path — from board-level VRMs through the package substrate and bump array down to 144 individual SM core grids on the die — into tractable grid elements, each with physically derived electrical parameters.

The modeling approach begins at the die level, where the voltage droop events that matter most actually occur. The 814 mm² GH100 die is decomposed into functional blocks — SM cores, L2 cache, memory controllers, and NVLink I/O — each represented as a grid element in PDNLab with physically derived electrical parameters. The board and package are represented as lumped impedance feeding into this on-die network, because their spatial structure is electrically invisible at the nanosecond timescales where droop concentrates.

Die-Level Block Decomposition and Area Allocation

The first step is dividing the 814 mm² die into functional blocks with estimated area allocations. Using the same approximate 60/15/15/10 power budget split as a proxy for area:

Functional BlockArea AllocationEstimated AreaGrid Dimensions in Model
144 SM Cores~60% (~488 mm²)~3.39 mm² per SM184 µm × 184 µm each
L2 Cache (2 partitions)~10% (~81 mm²)~40.5 mm² per partition1.04 cm × 0.18 cm each
HBM3 Memory Controllers (10)~15% (~122 mm²)~12.2 mm² per controllerDistributed along die edges
NVLink + PCIe I/O~15% (~122 mm²)Combined I/O blockGrid along bottom edge

Each block in PDNLab is modeled as a two-dimensional power grid with its own sheet resistance, inductance, capacitance, and attached current sources. The grid dimensions are derived from the area estimates above.

On-Die Capacitance: From Transistor Physics to PDNLab Parameters

A critical parameter for each block is its on-die decoupling capacitance — the intrinsic charge storage that resists voltage droop during current transients. In PDNLab, each block's capacitor element takes a single value: Capacitance (F), representing the total capacitance within that block's area. This value is derived from TSMC 4N transistor-level physics.

The TSMC 4N process has a gate capacitance of approximately 0.18 to 0.22 fF per transistor. With 80 billion transistors distributed across the 814 mm² die, the total intrinsic transistor capacitance is:

StepCalculationResult
Gate capacitance per transistor~0.2 fF (midpoint of 0.18–0.22 fF)0.2 × 10⁻¹⁵ F
Total transistor capacitance80 × 10⁹ × 0.2 × 10⁻¹⁵ F~16 µF
Unit capacitance density16 µF ÷ 814 mm²~19.66 nF/mm²

This unit density of approximately 19.66 nF/mm² lets us estimate the intrinsic capacitance for any block based on its area. The table below shows the calculated value from pure area-based scaling alongside the adjusted value used in the actual PDNLab model:

BlockAreaCalculated C (area × 19.66 nF/mm²)Model Value (adjusted)Notes
SM Core (each)~3.39 mm²~66.6 nF1.1 × 10⁻⁷ F (110 nF)Adjusted upward to account for MiM decoupling capacitors in upper metal layers
L2 Cache (each partition)~40.5 mm²~796 nF2 × 10⁻⁶ F (2 µF)SRAM structures have higher per-area capacitance than logic; MiM caps contribute significantly
HBM3 Memory Controller (each)~12.2 mm²~240 nF~3 × 10⁻⁷ F (300 nF)I/O circuitry with moderate decoupling
NVLink I/O~122 mm²~2.4 µF~2.5 × 10⁻⁶ F (2.5 µF)High-speed SerDes PHYs with local decoupling

The model values are intentionally higher than the pure transistor-capacitance calculation because the on-die power grid includes additional capacitance sources beyond gate oxide: MiM (Metal-Insulator-Metal) decoupling capacitors integrated within the upper metal layers, inter-wire coupling capacitance in the power grid, and junction capacitance of inactive transistors. The adjusted values reflect a physically reasonable total that accounts for all of these contributions.

Power Grid Parameters: Global and Core Levels

Beyond capacitance, each grid element in PDNLab is characterized by wire geometry and electrical properties. The H100 model uses two distinct grid tiers:

Global On-Die PDN — The global power grid covers the full 28.5 mm × 28.5 mm die area using what we assume to be the uppermost metal layers (likely M13–M15). The parameters below are estimated from published TSMC process-node characteristics for the 5nm/4N family.

ParameterValueSource / Rationale
Grid size2.85 cm × 2.85 cmGH100 die dimensions (√814 mm²)
Wire width3 µmWide power straps in top global metal
Wire spacing15 µmPower strap pitch for HPC designs
Sheet resistance0.03 Ω/□TSMC 4N top global metal (~0.8–1.2 µm thick Cu)
Inductance3 nH/cmClosely spaced Vdd/Vss in M15
Capacitance2 pF/cmGlobal power wire self-capacitance

SM Core Grids — All 144 Streaming Multiprocessors are modeled as individual grids, each representing its local core-level power distribution. Each SM grid has a dedicated current source and the decoupling capacitor described above.

ParameterValueSource / Rationale
Grid size184 µm × 184 µmSM core area ≈ 3.39 mm² (488 mm² ÷ 144)
Wire width1 µmSemi-global metal routing width
Wire spacing10 µmLocal power distribution pitch
Sheet resistance0.3 Ω/□TSMC 4N semi-global layers (estimated M7–M8)
Inductance5 nH/cmSemi-global layer pair
Capacitance3 pF/cmCore grid self-capacitance

Functional Block Grids

Beyond the 144 SM cores, the model includes dedicated grids for every major functional block on the H100 die. The L2 cache is represented as two grids flanking the SM array, corresponding to the approximately 50 MB L2 cache partitions. Ten grids are arranged along the left and right die edges to represent the HBM3 memory controller pairs. The NVLink bus is modeled as a grid along the bottom edge, carrying 18 current sources representing the fourth-generation NVLink engine PHYs.

This level of functional decomposition ensures that the current distribution across the die reflects the actual H100 architecture — not just the compute cores but the memory subsystem and high-speed I/O that together account for 30 to 40 percent of total power consumption.

The Board and Package as Lumped Impedance

The outermost layer of the model represents the voltage regulators and the board-to-die delivery path. Rather than modeling the SXM5 carrier board and package substrate as full two-dimensional power grids, the model uses an ideal capacitor as the VRM voltage source and transmission lines to carry the aggregate series impedance of the board traces, package substrate planes, and solder ball array into the die.

This is a deliberate modeling choice. The voltage droop events that determine silicon reliability occur on nanosecond timescales. At those frequencies, the board PDN cannot respond fast enough to influence the droop waveform because its electrical distance from the die is too large. The board's function is to replenish on-die decoupling between transient events, and a lumped RLC representation captures that behavior accurately. A spatially resolved board grid would add computational cost without changing the droop results, because spatial variation across the carrier board is electrically invisible at the frequencies where droop concentrates.

This approach is standard practice in high-frequency PDN analysis where the region of interest is the die itself. The lumped path collapses the VRM, board, and package into an equivalent source impedance, keeping the model focused on the on-die network where droop actually forms.


Workload-Based Voltage Droop Investigation of the H100 Model with PDNLab

With the physical network modeled, the next step is to drive it with realistic workloads. In PDNLab, every current source element is assigned a current profile — a piecewise-linear waveform that specifies current draw in Amperes as a function of time. The simulator steps through this profile at each timestep, injecting the specified current into the grid node where the source is placed. The resulting voltage response across the entire network reveals where and when droop occurs.

Generating a current profile requires two pieces of information: the average current that a block draws (derived from its power budget and supply voltage), and the temporal shape of the switching activity within each clock cycle. The average current sets the total charge per cycle, and the shape determines the di/dt — the rate of current change that drives inductive voltage droop. Together, these define a complete current-versus-time waveform that PDNLab uses to simulate transient voltage behavior across the die.

Deriving Current Draw from Published Specifications

The derivation begins with publicly available data. The H100 in its PCIe configuration has a rated TDP of 350 W. Based on publicly available performance analysis and general GPU architectural trends, we can estimate a reasonable power budget split: approximately 60% of total chip power consumed by the SM compute cores, with the remainder split among I/O (~15%), HBM3 memory (~15%), and L2 cache and control logic (~10%). These are rough assumptions — NVIDIA does not publish exact per-block power breakdowns — but they give us a working basis for the model. From this:

StepCalculationResult
SM core power budget60% × 350 W210 W
Power per SM210 W ÷ 144 SMs1.458 W
Average current per SM1.458 W ÷ 0.6 V (Vdd)2.43 A

The supply voltage of approximately 0.6 V is consistent with TSMC 4N nominal operating voltage for high-performance compute designs. The H100 base clock of approximately 690 MHz gives a clock period of roughly 1.45 ns. At 2.43 A average current, the charge per clock cycle delivered by the PDN to each SM is approximately 3.52 nC.

Voltage Droop Margins

At a nominal Vdd of approximately 0.6 V, the H100 must maintain voltage within tight margins to ensure correct logic operation. Industry practice for high-performance computing silicon typically allows a maximum dynamic voltage droop of 5% to 10% of Vdd. For the H100, this translates to:

Droop BudgetVoltage DropMinimum Vdd
5% of Vdd30 mV0.570 V
7% of Vdd42 mV0.558 V
10% of Vdd60 mV0.540 V

A 30 mV droop on a 0.6 V supply is a 5% event. If the supply drops below the minimum operating voltage, logic timing violations can cause silent data corruption or functional failure. The guardband required to absorb this droop directly reduces the maximum achievable clock frequency. For the H100, even small improvements in droop management translate to measurable frequency and throughput gains across thousands of deployed units.

Generating Current Profiles

With the average current known, each SM's current profile is constructed as a triangular waveform: two back-to-back triangular pulses per clock cycle, each with a peak of 5 A and a base width of approximately half the clock period (~0.725 ns). The area under these two triangles in one 1.45 ns clock cycle equals the required 3.52 nC of charge — matching what a flat 2.43 A rectangular profile would deliver, but with over 6× lower di/dt. This triangular shape is a more physically realistic representation of how current ramps in actual transistor switching events, where gate charge and discharge follow smooth RC transitions rather than instantaneous steps.

The profile is specified in PDNLab as a series of (time, current) coordinate pairs that define the piecewise-linear waveform. The simulator interpolates between these points at each timestep, so the profile resolution directly controls the fidelity of the di/dt transient injected into the network.

Scenario Management

The H100 model uses PDNLab's Scenario Manager to define multiple workload configurations, each assigning different current profiles or activation patterns to the current sources across the die. In this article we examine two scenarios:

  • Full activation: All 144 SMs fire simultaneously, representing worst-case di/dt during a large GEMM or all-to-all collective operation where every SM is computing in lockstep.
  • Staggered SM activation within a local GPC: Individual SMs within a single Graphics Processing Cluster fire at offset phases rather than simultaneously. This represents the more realistic case where compute waves ripple through the SM pipeline within a GPC, producing lower local di/dt than full synchronous switching.

Each scenario can be simulated independently in minutes, enabling rapid exploration of the design space.

Scenario 1: Full Activation — All 144 SM Cores Firing

The most electrically stressful scenario is full activation: all 144 SM cores drawing current simultaneously with the triangular profile described above. Each SM uses a triangular current profile of length 1.45 ns (one clock period) repeated 3 times, for a total activation window of approximately 4.35 ns. Because every SM fires in phase, the voltage noise propagating outward from each current source overlaps constructively across the grid — especially in the die center where the density of active sources is highest and the electrical distance to the nearest bump is greatest.

Despite being the worst-case synchronous switching scenario, the simulation reveals a peak voltage droop of approximately 5.9 mV. The constructive interference of voltage waves from all 144 sources produces a smooth, spatially broad droop envelope rather than a sharp localized spike, because the uniform activation maintains symmetry across the grid.

PDNLab voltage droop simulation results for the H100 with all cores active
Simulation results. The voltage heatmap shows spatial droop across the die over time. SM cores at the die center experience the deepest droop, as their current path through the global grid to the nearest bump array is longest. (Click to expand)

Scenario 2: Staggered SM Activation Within a Local GPC

A more realistic scenario: individual SMs within a single Graphics Processing Cluster (SM1 through SM6 and SM8) fire at slight offset delays from one another rather than simultaneously. Each SM uses the same triangular current profile, but with staggered start times, and the activation repeats 6 times total for an overall simulation window ending at approximately 12 ns. This models the compute wave rippling through the SM pipeline within a GPC, a pattern characteristic of real AI workloads where warp schedulers dispatch work in rapid succession rather than in perfect lockstep.

Counter-intuitively, this staggered activation produces a larger peak voltage droop of approximately 7.4 mV — compared to 5.9 mV in the full synchronous case. The reason is constructive interference of voltage waves within a localized region: as each SM fires in sequence, the voltage disturbance from one SM has not yet dissipated before the next SM begins drawing current. The overlapping wavefronts pile up in the confined area of the GPC, creating a cumulative droop that exceeds what any single SM could produce alone and, critically, exceeds the uniformly distributed droop of the full-activation scenario. This is a direct consequence of the wave-propagation physics discussed above — noise does not appear instantaneously across the die; it propagates, and staggered local sources can constructively reinforce it.

PDNLab voltage droop simulation results for staggered SM activation within a local GPC
Staggered SM activation within a single GPC. Despite lower instantaneous di/dt per timestep, the phase-offset firing pattern produces constructive interference of voltage waves in the local region, resulting in a deeper peak droop (~7.4 mV) than the full synchronous scenario (~5.9 mV). (Click to expand)

Engineering Questions This Model Can Answer

An engineer using this model can ask questions that would otherwise require weeks of physical prototyping or full-chip EDA simulation:

  • What happens to the droop envelope if the global grid wire width is doubled?
  • How much additional MiM capacitance is needed to keep peak droop below 5% of Vdd (30 mV)?
  • What is the droop impact of migrating from lateral to vertical power delivery?
  • How does the droop distribution change if SM count is reduced for a lower-power SKU?
  • What is the comparative droop behavior of a training workload versus inference?

Each of these questions can be answered by modifying a parameter, running the simulation, and observing the result in minutes rather than days. The H100 model files are included as example projects in PDNLab and serve as templates that engineers can adapt to their own chip, package, and board designs.

As the two scenarios above demonstrate, PDNLab enables GPU architects and power integrity engineers to truly visualize the cumulative voltage droop produced by different activation patterns driven by their AI workloads. The ability to see how staggered local switching can produce worse droop than full synchronous activation — a result that is non-obvious from static analysis alone — is a highly valuable insight for understanding the complex electrical environment of a superchip. With PDNLab, these dynamics become explorable in minutes rather than hidden behind weeks of full-chip extraction.


The Trend Is Clear

The electrical challenges quantified in this H100 model are not unique to NVIDIA. NVIDIA's own Blackwell architecture increases power density further. AMD, Intel, and custom silicon programs at hyperscalers are all pushing in the same direction: more compute per unit area, more current per unit time, lower voltage margins. The fundamental scaling trend is not slowing down.

At the same time, the voltage droop problem becomes harder at each generation. Supply voltages have been dropping, from ~0.8 V in earlier nodes to ~0.6 V at 4nm/5nm. The absolute margin for droop shrinks with each reduction. A 5% droop on 0.8 V is 40 mV of headroom. A 5% droop on 0.6 V is only 30 mV. The physics is unforgiving.

The engineering response must be equally systematic. Voltage droop analysis cannot remain a specialized activity performed by a small team at the end of the design cycle. It must become a routine part of architectural exploration, accessible to the engineers making floorplan, packaging, and power delivery decisions from the earliest stages. PDNLab brings that capability to every engineer working on power delivery for AI silicon.


References

  1. NVIDIA, "NVIDIA H100 Tensor Core GPU Architecture," NVIDIA Whitepaper, 2022.
  2. NVIDIA, "NVIDIA Hopper Architecture In-Depth," NVIDIA Technical Blog, March 2022.
  3. M. Naumov et al., "NVIDIA Hopper H100 GPU: Scaling Performance," Hot Chips 34, IEEE, 2022.
  4. TSMC, "N4P and N4X Technology," TSMC 2022 Technology Symposium.
  5. S.K. Lim, Design for High Performance, Low Power, and Reliable 3D Integrated Circuits, Springer, 2013.
  6. M. Swaminathan and A.E. Engin, Power Integrity Modeling and Design for Semiconductors and Systems, Prentice Hall, 2007.