Cluster Topology and PHY Co-Design at 448 Gbps

Achieving 448 Gbps signaling may require system architects and standards bodies, as well as PHY and interconnect designers, to rethink bandwidth provisioning. AI-scale computing provides an opportunity to rethink system architectures, and it may be time to bridge the interconnect, PHY, and compute cluster domains with a single actionable performance metric.

This work introduces a hardware-centric definition of compute cluster bisection bandwidth as a performance metric for AI-scale 448 Gbps systems. Unlike traditional abstractions, this metric is grounded in physical interconnect layout and IO port availability, enabling system architects to evaluate bandwidth provisioning through real, bidirectional link paths. Because it focuses on the minimal physical bisection of a topology, this approach supports clean, cost-aware comparisons across architectures and avoids assumptions about traffic patterns or runtime behavior.¹

This article presents a top-down co-design methodology that begins with cluster topology and works downward into the electromagnetic and physical layers. This inversion of the conventional signal integrity (SI) workflow allows one to engage directly with system-level performance metrics, bridging the gap between PHY feasibility and workload-driven cluster efficiency. The framework leverages practical hardware configuration item breakdowns to support scalable accelerator fabrics, enabling link-by-link and pad-by-pad tradeoff analysis across compute domains.

The methodology is demonstrated in the context of 448 Gbps pathfinding and 400G serial interface development, where mmWave test fixtures and parallel PHY validation tools now operate at 100 GHz. These capabilities invite a shift in engineering roles where PHY-level SI engineers take on traditional roles of hardware systems engineering to support the scale, efficiency, and reuse demands of modern GPU cluster networks. The cost and complexity of integration at 224 and 448 Gbps bandwidths will require systemic adaptability and reuse of lab infrastructure.

Not all mesh topologies are appropriate for accelerator mesh fabrics, and not all accelerator fabrics qualify as efficient manypoint networks. This framework distinguishes between them by evaluating how efficiently a topology uses its available IO to reach across the cluster. By aligning physical interconnect behavior with compute cluster performance, the methodology enables topology-aware, IO-efficient, and PHY-agnostic co-design of high bandwidth AI systems.

Accelerator Mesh Co-Design Challenges

In conventional system design, bisection bandwidth is often defined through abstract or runtime-centric lenses. Graph-theory models treat it as the capacity of the smallest edge cut, assuming unit link capacity and ignoring physical feasibility, while MPI benchmarks measure throughput across a cut during execution, entangled with routing, congestion, and software stack effects. Vendor marketing can cause further confusion by presenting inflated aggregate bandwidth numbers based on non-minimal cuts, often detached from real-world constraints. These approaches, while useful in their respective domains, fail to provide a clean, actionable metric for hardware-centric co-design. They obscure the physical realities of interconnect layout, IO port availability, and packaging constraints — precisely the factors that dominate system-level performance in scaling up accelerator mesh fabrics. Unlike graph-theory edge cut models or MPI runtime metrics that OEM architects rely on, this definition enforces a minimal physical bisection normalized by IO, creating a common language that bridges system-level abstractions and SI realities.

Bisection bandwidth is here defined as the count of real, bidirectional hardware links crossing a minimal physical bisection of the compute node cluster, normalized by the number of IO ports per compute node. This definition is flat, minimal, and grounded in actual hardware, making it cost-aware and topology-sensitive. It favors designs that use IO to reach more destinations, not merely to push faster links to the same endpoints with linear data rate scaling. By avoiding assumptions about traffic patterns, software stack behavior, or promotional bias, this definition offers a layer-aligned view through interconnect, PHY, and cluster topology to be co-designed.

Topology Analysis

Figure 1 introduces the concept of direct node-to-node mesh topology in the context of a SI channel design challenge. It originates from the 2023 SIJ article² comparing PCB versus cabled backplane implementations for 112 Gbps links. It highlights how cabled backplanes can extend physical reach beyond the limits of traditional PCB routing, enabling more nodes to be interconnected directly in a mesh fabric. This is especially relevant for scaling beyond blade-local accelerator node meshes, where the ability to span blades with high-speed links becomes critical. By showing how topology and physical channel design intersect, Figure 1 provides the beginnings of the many-points network concept, where interconnect architecture is co-designed with physical layer constraints to support larger, more efficient accelerator clusters.

Figure 1. Backplane topologies lead to node-to-node meshing. Additionally, Figure 1 introduces co-packaged connectors and the physical layer challenges associated with scaling 400 Gbps links to the PCB baseboard. It highlights how co-packaged connector technology addresses reach limitations by enabling high-density, low-loss interconnects directly at the package level.³ This is critical for implementing blade-level meshes, where local interconnects must support high bandwidth with minimal signal degradation.

For extended reach beyond the blade, co-packaged connector meshes can be paired with cabled backplanes or alternate board orientations within the cabinet. For example, a pizza-box switch oriented vertically (see Figure 1b) allows every direct-attach copper cable to remain short and directly connect to a horizontal compute blade stack, which is similar to how liquid cooling manifolds or 48 V bus bars are routed. Standard 19-in. rack mount form factors favor rapid development and high-volume manufacturing cost benefits, but they also make it challenging for short path lengths between nodes.

Measurement Data for 448 Gbps Pathfinding

It is impossible to perform system-level analysis (and co-design) without a means for physical layer characterization. Currently, industry standards groups are convening to consider ways to achieve 448 Gbps signaling. Some components originally built for 224 Gbps analysis are proving helpful in this area. Figure 2 presents four plots from Samtec’s mmWave test fixture characterization work, each demonstrating aspects of SI performance at or beyond 100 GHz. These plots represent real, deployable measurement capabilities that directly support 400G PHY IEEE and OIF standardization efforts in the industry.^5-7

Figure 2. Millimeter wave test fixtures, such as the Bulls Eye® ISI Evaluation Board,⁴ find use in 400G industry pathfinding. (BE90 EVB: Bulls Eye 90 GHz evaluation board. COM: channel operating margin.)

The data in Figure 2 shows that the mmWave test platform demonstrates insertion loss from approximately 6.1 dB to 42.5 dB at 82 GHz Nyquist across 40 to 442 mm paths, maintaining a near-linear response with ~1 dB ILD bandwidth to 90 GHz. Differential return loss remains around 15 dB through 90 GHz. Unlike fixtures that rely on tightly coupled differential lines, this design achieves ~55 dB adjacent single-ended lane isolation and leverages low-skew true/complement pairs (~1.5 ps P/N skew) to suppress skew-induced inter-symbol interference (ISI) during coupled-line propagation.

The fixture’s ability to maintain linear insertion loss and avoid resonant roll-off up to 100 GHz (see Figure 2c) is critical for validating next generation SerDes channels, especially for PAM4 and PAM6 signaling. When designing interconnects for large-scale accelerator meshes, this level of measurement fidelity is essential because every millimeter of reach matters and every dB of margin is precious.

The mmWave test fixtures used in these next generation serial standard pathfinding studies, such as the Samtec Bulls Eye® ISI Evaluation Board (see Figure 2b), function as 100 GHz single-ended buses that are 8 bits wide for validating x4 SerDes quads. The well isolated single-ended design prevents skew from being converted to ISI during propagation in a coupled transmission. While originally designed for multi-lane SerDes characterization, these fixtures are broadly useful for parallel PHY validation as well due to the connector test points supporting wider coaxial count arrays. This flexibility enables high-fidelity signal capture across multiple lanes simultaneously.

Figure 3 presents measurements from a Samtec BE70 ISI Evaluation Board (a 2×8 lane, 224Gbps channel emulator originally designed for validating 4-lane differential SerDes quads). The board functions as an 8-bit single-ended bus, and its parallel lane structure, channel loss slope, and dynamic range align closely with the Nvidia NVLink chip-to-chip (C2C) die-to-die SI budget.⁸ The measured reach of approximately 250 mm of PCB stripline plus 300 mm of cable significantly exceeds the 60 mm reach typically associated with NVLink C2C off-package links, suggesting that the electrical feasibility of NVLink C2C may extend well beyond its original design envelope. When paired with low-skew, single-ended coaxial cabling and precision RF fixture design techniques implemented in HVM PCB processes and materials, the evaluation board demonstrates C2C scale electrical budgets at blade-scale physical reaches.

Figure 3. When paired with low-skew, single-ended coaxial cabling and precision RF fixture design techniques implemented in HVM PCB processes and materials, the ISI evaluation board demonstrates C2C scale electrical budgets at blade-scale physical reaches.

Accelerator Mesh Co-Design

Figure 4 illustrates the layered bandwidth stack from the electromagnetic (EM) domain to compute cluster topology. The bottom three layers (electromagnetic domain, circuit assemblies, and statistical channel modeling) were established in previous work as foundational to PHY and interconnect characterization.¹ The top layer added here represents the compute cluster performance domain, specifically for IO-bound workloads. This addition is enabled by the hardware-centric bisection bandwidth definition, which connects physical interconnect behavior to system-level performance through the stack. This alignment allows the construction of a closed-form analytic framework by inserting a realistic hardware configuration breakdown. Node counts, blade layouts, and IO port mappings can be thought of as configurations that translate abstract bandwidth metrics into actionable system architecture decisions.

Figure 4. Four levels of potential co-design encompass the accelerator mesh to the EM domain.

Bisection BW Detailed Explanations and Scaling

Understanding how bisection bandwidth scales across different mesh topologies is central to evaluating interconnect efficiency in accelerator clusters. Figure 5 visualizes node arrangements, and the corresponding bisection cut lines for several canonical topologies: 2D torus, 3D torus, hypercube, and all-to-all. Each diagram shows how many links cross the bisection plane, which directly determines the system’s ability to move data between two halves of the cluster. These link counts are derived from the physical layout and connectivity rules of each topology. For example, a 2D torus with wraparound links (see Figure 5a) yields a bisection link count of 2 × √N, while a 3D torus scales as 2 × N^(2/3), and a hypercube scales linearly as N/2.

Figure 5. Bisection links showing node mesh topologies and scaling. The bisection cut intersects a line, a surface, or a dimensional axis, depending on the topology.

The approach shown in Figure 5 for defining bisection bandwidth is intentionally PHY-agnostic. It does not assume a specific signaling rate, encoding scheme, or protocol. Instead, it focuses on the number of bidirectional links crossing the minimal cut, which can then be multiplied by the per-link bus width and data rates later. This makes it a clean and minimal abstraction for system-level bandwidth provisioning. By avoiding assumptions about traffic patterns or runtime behavior, it enables comparisons across topologies and hardware implementations.

Figure 6 reinforces this concept by plotting the total bisection link count and the normalized bisection bandwidth per node IO. The latter metric (system bisection bandwidth divided by total outbound IO bandwidth) reveals how efficiently a mesh topology uses its available IO to reach across the cluster. It highlights that some topology types (such as all-to-all) scale bandwidth faster than IO port count, while others (such as star) remain fixed regardless of port count.

Figure 6. Analyzing HW-centric bisection BW (links) vs. node count (normalized to the number of IO used) helps designers evaluate how efficiently a topology uses IO.

The normalized view in Figure 6 is especially powerful for co-design. It allows system architects to evaluate not just how much bandwidth a topology provides, but how efficiently it uses IO to scatter data away from each node. In AI-scale workloads, where collective operations and many-to-many exchanges dominate, this efficiency can matter more than raw data rate. A topology that doubles its bisection bandwidth without doubling its IO port count is inherently more scalable while also being more difficult to implement for interconnect. The plots in Figure 6 show that mesh topologies can achieve this, especially when paired with co-packaged interconnects and optimized PHYs. Here is the essence of the co-design framework: to enable topology-aware, IO-efficient, and PHY-agnostic co-design of accelerator mesh fabrics.

Scalable Accelerator Reference Architecture for Mesh Interconnects (SARAMI)

Figure 7 presents a tabular breakdown of node counts mapped to realistic OCP OAM blade form factors, serving as an example SARAMI. This figure translates the abstract concept of scalable accelerator mesh fabrics into tangible hardware configurations through a simple assumption of hardware configuration items breakdown, showing how compute nodes scale within blade and cabinet constraints. It overlays actual dimensions, power envelopes, and packaging densities. By grounding the possible mesh topologies to real-world form factors, system architects can quickly evaluate how many nodes can be supported per blade, how blades stack within a cabinet, and how interconnect provisioning aligns with physical layout and scale-up network goals. This then allows closely coupled PHY/interconnect solutions to be considered as candidate technologies for the links.

Figure 7. SARAMI reference architecture table overlays actual dimensions, power envelopes, and packaging densities.

Conclusion and Next Steps

This work presents a reference framework for evaluating PHY and interconnect scalability in accelerator mesh fabrics intended for GPU and accelerator scale up networks. It is grounded in a hardware-centric definition of bisection bandwidth that enables PHY-agnostic interconnect provisioning. By aligning electromagnetic, circuit, and statistical channel domains with compute cluster performance, this article presents a blueprint for a simple closed-form analytic framework that spans from SI to system-level bandwidth provisioning tailored for AI parallel compute workloads.

Future generations of AI hardware systems may need to consider blade orientation in the cabinet as a degree of freedom. For instance, the same traditional 19-in. racks that enable rapid integration and HVM cost containment of circuit card assemblies also favor highly localized blade-level compute cluster mesh interconnect in implementation. In systems where liquid cooling is required for every 1 kW node, and airflow constraints are no longer dictated by board orientation, orthogonal board orientations in cabinets can help reach more destinations with IO by reducing geometrical path length between compute nodes.

This work offers a practical lens for future system design, where interconnect topology, PHY performance, and packaging geometry are co-optimized. The importance of high system bisection bandwidth becomes especially clear when viewed through the lens of modern AI problem classes. Workloads such as large-scale model parallelism, deep reinforcement learning, graph neural networks, physics-informed simulations, and large-batch data parallelism all rely on frequent, high-volume communication across distributed nodes. Ranging from all-to-all exchanges of activations and gradients to structured mesh updates in scientific machine learning, these patterns demand interconnect architectures and mesh fabrics that can sustain bandwidth across the entire cluster.^9,10 The proposed blueprint framework directly addresses this need by overlaying topology-aware co-design that scales bandwidth with node count and IO efficiency as well as link data rate, ensuring that the interconnect does not become the bottleneck in compute-intensive, communication-heavy AI systems.

References

A. Mittal, “Rethinking Distributed Computing for the AI Era. Communications of the ACM,” (2025).
A. Josephson, B. Gore, and J. Sprigler. “Selecting a Backplane: PCB vs. Cable for High-Speed Designs” | Signal Integrity Journal.
B. Gore, et al. “Beyond 200G: Brick Walls of 400G links per Lane,” DesignCon 2025.
Bulls Eye® ISI Evaluation Boards.
A. Josephson and L. Mei. “Developing a Test Fixture for 200G with Pathway to 400G,” Korea Test Conference, July 14, 2025.
Josephson, Andrew et al. “Measured 400 Gb/s per lane channel files,” IEEE 802.3 New Ethernet Applications Ad Hoc 2025.
Josephson, Andrew. “Measurement Results for a 448 Gbps Physical Channel” OIF 448G AI Workshop, April 2025.
Y. Wei et al., “9.3 NVLink-C2C: A Coherent Off Package Chip-to-Chip Interconnect with 40Gbps/pin Single-ended Signaling,” 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2023, pp. 160-162, doi: 10.1109/ISSCC42615.2023.10067395.
Y. Zhou, Y. Li, and Y. Wang, “Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication,” ACM Transactions on Architecture and Code Optimization, 2024.
Zhang, Y.; Li, Y.; Wang, Y.; et al, “WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips,” Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), 2024. https://dl.acm.org/doi/epdf/10.1145/3695053.3731101.

ManyPoint Networks: A System Co-Design Framework for 448 Gbps AI Fabrics and Beyond

Related Resources

Related Articles

Report Abusive Comment