AI News
Real Time

AI Super-Connected Era: Why CPO is the Next Trillion-Dollar Semiconductor Opportunity

Since ChatGPT burst onto the scene at the end of 2022, AI has successively ignited one Semiconductor super-cycle after another, involving computing po...

Since ChatGPT burst onto the scene at the end of 2022, AI has successively ignited one Semiconductor super-cycle after another, involving computing power (GPUs), stoRAGe power (mEMOry), and command scheduling power (CPUs), creating multiple trillion-dollar market-cap companies.

If there remains a SECtor in AI infrastructure yet to produce a trillion-dollar "explosive contender," we at Dolphin Insights believe the next most promising area is super-connectivity in the AI era. If computing power solves the "IQ" problem of AI, and storage power solves the "memory" problem, then transport power addresses how to move long-term and short-term memories into and out of the brain center at rocket-like speeds.

Or, to borrow the words of AI luminary Jensen Huang, as computing power and memory bottlenecks are gradually alleviated—and energy remains a decade-long peRSIstent challenge—the next core bottleneck is high-speed networking interconnectivity in the AI era. The traditional cloud-era network infrastructure is utterly incapable of matching the network bandwidth transmission demands required by the trillion-parameter models, Mixture of Experts (MoE) architectures, and partial ACTivation scenarios characteristic of the Agentic AI era.

In this piece, we explore the photonic-electronic transmission Technology direction—CPO (Co-Packaged Optics)—driven by the shift in AI network transmission speeds, to delve into network transmission in the AI era. Our study of CPO is divided into:

I. What is CPO, and can it truly replace traditional copper connections?
II. Can it completely replace the currently mainstream pluggable optical modules?
III. Under this trend, how will the competitive landscape change among upstream and downstream companies in the industry?

In this article, we will first sort out the bASIc issues of the industry chain.

Main Text:

I. What is CPO?

In traditional data center architectures, a crucial component is the "optical module." Its function is to convert incoming optical signals transmitted via fiber optics into electrical signals for the Data Center, or to convert electrical signals generated within the data center into optical signals for transmission through fiber optics. It acts as a "bridge" and a "translator" in data transmission.

Functionally, a CPO (Co-Packaged Optics) architecture encompasses the functions of a traditional optical module but has the following two distinct differences:

  1. Structural Differences
    Traditional optical modules are pluggable, resembling the RJ45 connector on a home network cable. CPO is entirely different; it integrates the optical engine responsible for photoelectric conversion directly with the chip (primarily the switch ASIC) onto the Same packAGIng Substrate or interposer.

  2. APPlication Scenario Differences
    Optical modules are typically used for inter-rack connectivity (Scale-Out). CPO can be applied both between racks and within a rack (Scale-Up). When used between racks, it replaces traditional optical modules; when used within a rack, it replaces the currently mainstream copper connections.

We can observe that recently, both NVIDIA (NVDA.US) and Broadcom (AVGO.US) are actively promoting their CPO switch solutions.

So, why is CPO technology receiving such significant attention? As data centers' demand for computing power persistently rises, the bandwidth requirements for data transmission are exploding. Data centers are also evolving towards hyperscale computing clusters, and in this process, legacy data transmission technologies will create numerous obstacles:

  1. Bandwidth Bottleneck
    For inter-rack scenarios, the limited panel space on traditional switches and the difficulty in further miniaturizing pluggable optical module sizes restrict the number of ports a single switch can provide, failing to support increasingly higher bandwidth demands. Currently, pluggable modules can support up to 1.6 Tbps per module, with a single switch panel supporting a maximum of 51.2 Tbps bandwidth. future modules might reach 3.2 Tbps, pushing switches to a maximum of 102.4 Tbps, which is nearly the limit for pluggable optics.

  2. Signal Integrity Bottleneck
    For intra-rack scenarios, as transmission rates increase, using traditional copper cables results in severe signal attenuation and distortion over long distances, with transmission distances also becoming more constrained. Currently, copper cables support a maximum bandwidth of 1.8 TB/s (like Nvidia's NVLink copper cables), strictly limited to under 2 meters. However, bandwidth demand per GPU is advancing towards 3.6 TB/s.

  3. Thermal and power consumption Bottlenecks
    As transmission rates CLImb, the power consumption of traditional communication links skyrockets, making heat dissipation increasingly difficult. Given the immense energy constraints facing US data center construction, power consumption issues bring significant cost pressure. CPO theoretically addresses these issues effectively; according to NVIDIA, power efficiency can improve by 3.5 times with CPO.

II. Specifically, What are the Data Transmission Scenarios in a Data Center?

Here, we break down the data transmission technology routes across different scenarios and stages within a data center:

  1. Scale-up: Primarily Intra-Rack Interconnects
    This mainly involves hardware interconnects within a rack, especially inside servers, including connections among CPUs, GPUs, NICs, DDR memory, and hard drives. Currently, these connections predominantly use copper as the medium—including PCIe slots and memory slots (PCB copper traces), SATA cables, etc., connecting CPUs, GPUs, and NICs. CPO has the potential to disrupt this mainstream approach.

  2. Scale-out: Primarily Inter-Rack Interconnects
    This involves connections between racks or between servers and switches. These connections primarily rely on optical fiber and pluggable optical modules. Here too, CPO is a key development trend and is progressing faster than intra-rack scenarios.

  3. Furthermore, there are inter-data center connections and connections between data centers and the outside world, which are beyond the scope of this discussion.

  4. From the perspective of industry giants, CPO is currently targeting inter-rack scenarios initially but may address intra-rack scenarios in the future.

III. CPO is Still in the Early Stages of Promotion; What are the Main Bottlenecks?

  1. Maturation of advanced packaging Technologies
    Fundamentally, CPO differs significantly from traditional approaches like pluggable optical modules. Traditional optoelectronic components do not drastically differ from General optoelectronic components and modules in production technology. However, CPO requires packaging the optical engine onto a substrate or interposer, relying heavily on advanced packaging technologies like CoWoS. Simultaneously, CPO differs from Standard advanced packaging because it integrates not just electronic integrated circuits (EICs) but also photonic integrated circuits (PICs). This heterogeneous integration necessitates techniques like TSMC's COUPE technology for hybrid bonding. The challenge is twofold: these advanced packaging techniques are extremely difficult, with both NVIDIA and Broadcom dependent on limited TSMC capacity. Additionally, the supply of necessary components and equipment—such as optical coupling devices, hybrid bonding tools, testing equipment, and materials like ABF substrates—may also face constraints. Moreover, production yields for these advanced packaging technologies, especially heterogeneous integration, still need significant improvement, resulting in costs far exceeding pluggable solutions. TSMC is actively improving yields, but it will take time.

  2. Maintenance and Repair Issues
    Traditional pluggable solutions are easy to maintain and repair due to their "pluggable" nature. CPO is entirely different; its optical module, substrate, interposer, and even the chip are directly packaged together, making maintenance significantly more challenging. However, these issues can be mitigated through design improvements, such as building in fault tolerance or deploying Operational redundancies.

  3. Thermal Management Issues
    High-density packaging of the optical engine and the chip leads to significant localized temperature increases during operation, potentially exceeding the tolerance limits of the lasers. Thus, thermal management is a major issue. Solving this requires introducing more efficient cooling solutions, which also involves cost implications.

  4. Standardization Issues
    Currently, to seize market initiative, companies like NVIDIA and Broadcom are actively launching their complete, proprietary CPO switch solutions. Meanwhile, industry standards (interface standards, packaging standards, etc.) have yet to be established. This prevents upstream and downstream players from conducting R&D, production, and configuration based on unified standards, posing another hurdle to commercial promotion.

In summary, solutions exist for these issues, but they depend on technological maturation and standard-setting, all of which require time. Fundamentally, CPO technology needs to demonstrate an advantage in overall cost.

This raises a further question: Regardless of the approach, cost is always a core consideration. However, other routes, both more advanced and more conservative, are also in development. What is the relationship among them? Here, we differentiate the various technological routes.

IV. Comparison of Technological Routes

  1. CPO (Co-Packaged Optics)
    As discussed, CPO involves packaging the optical engine and the chip on the same substrate. The chip here can be a switch chip (ASIC) or a compute chip like a GPU, but usually refers to a switch chip.

  2. NPO (Near-Packaged Optics)
    NPO is a step below CPO. It doesn't achieve packaging on the same substrate or interposer scale; instead, the optical engine and the chip are packaged merely on the same PCB motherboard. In China, companies like Alibaba (BABA.US) and Huawei are promoting NPO solutions. This can be seen more as a compromise in the absence of advanced packaging capacity, potentially becoming a mainstream approach in the Chinese market for a period, which could affect the penetration of NVIDIA's solutions there.

  3. OIO (Optical I/O)
    OIO can be considered an advanced form of CPO, but without involving switch chips—it relates primarily to compute chips. OIO entails packaging the optical engine with the compute chip or even integrating them directly at the chip level. This is aimed squarely at intra-rack scenarios.

Let's clarify the data center architecture here:
A data center can be viewed as the interconnection of the following parts:

  • Servers focus on computing tasks, housing compute chips like GPUs and CPUs, along with memory and hard drives.

  • Switches handle network communication between servers and from servers to the outside world, performing data exchange via ASIC chips.

  • Additionally, there are storage systems. In current mainstream data center architectures, storage devices are primarily distributed across server nodes, placed inside the servers, and integrated with them.

Based on this architecture, we can envision the application scenarios for CPO. Given this, let's discuss why CPO deployment begins with switch chips. To use an analogy: a switch can be seen as the overpass within a data center. Consequently, the switch bears the greatest pressure in terms of data transmission bandwidth, port density, and the accompanying power consumption bottleneck, making the demand for CPO most urgent there.

  1. CPC (Co-Packaged Copper)
    CPC refers to integrating high-speed copper connectors directly onto the packaging substrate. This route offers significant cost advantages but still fails to resolve copper's bandwidth bottleneck and signal attenuation issues. Thus, its application scenarios are limited, potentially applicable for some connections within a rack, such as between GPU/CPU nodes and switches or storage chips. Currently, NVIDIA's intra-rack solutions still use copper connections but may switch to optical interconnects in the future.

  2. LPO (Linear-Drive Pluggable Optics)
    LPO is a streamlined version of pluggable optics. By removing the internal DSP/CDR chip and retaining only the analog chips—Driver and TIA (the roles of these components will be explained later)—it achieves direct signal driving. Essentially, it eliminates the high-power-consumption DSP chip from the optical module, forgoing signal error correction. Simultaneously, it strengthens the analog chips so that, regardless of signal accuracy, the electrical signal from the switch ASIC charges directly through analog amplification to drive the laser. However, problems persist here as well. Since the PCB traces are not omitted (causing signal attenuation), and signal quality requirements are even higher, long-distance transmission remains limited. Furthermore, signal integrity issues become particularly prominent when rates move to higher dimensions (above 1.6T). In other words, simplifying the structure comes at the cost of performance.

In summary, while compromise routes like NPO, CPC, and LPO exist, they will inevitably face bottlenecks as data centers advance towards higher speeds and larger clusters. CPO represents a next-generation solution that must eventually be pursued.

  1. What is an Optical Circuit Switch (OCS), and Will it Threaten CPO's Position?
    This discussion inevitably leads to the Optical Circuit Switch (OCS). The core feature of this type of switch is the complete absence of optical-electrical conversion throughout the process. Using an optical switch matrix, it establishes physical optical paths directly within the optical domain. Intuitively, one can imagine it composed of ARRays of tiny mirrors (micromirror arrays) that adjust their angles according to instructions, reflecting light in different directions.

    Superficially, OCS directly forwards optical signals, replacing the optical-to-electrical and electrical-to-optical conversion processes in traditional switches. It might seem that with this technology, CPO wouldn't be needed (at least not for switches). But that is not the case.

    Let's outline how switch architecture is built in a data center:

    Thus, the picture becomes clear: Although currently, CPO-based switches like NVIDIA's Quantum X800-Q3450 and Broadcom's Tomahawk 6 - Davisson are Spine switches, competing directly with Google's (GOOG.US) promoted OCS, which also targets replacing traditional Spine switches, the endgame suggests a more complementary relationship. While OCS has the opportUnity to replace Spine switches, further downstream, CPO remains necessary—from the photoelectric conversion between the optical engine and ASIC in the more numerously deployed Leaf switches, to connections between motherboards within servers (via NIC ASICs or technologies like NVSwitch), and further down to connections between compute chips on the motherboard and between compute chips and the NIC ASIC.

    • (1) Within a Motherboard: First, core computation in a data center is executed by GPUs. After computation, data needs to be passed to the CPU. Post-CPU processing, it's transmitted to the Network Interface Card (NIC, containing an ASIC), or alternatively, directly from GPU to NIC. These steps can occur on a single motherboard, or at least within a single server.

    • (2) Within a Rack: Next, data must travel from the server to the rack's switch. Multiple servers within a rack can interconnect at high speeds, but the rack requires a top-of-rack switch (ToR switch) for external communication, exchanging data between the rack's contents and the outside world.

    • (3) Between Racks: A data center comprises a cluster of multiple racks. How is communication between racks scheduled? This requires the Spine switch. The Spine switch manages high-speed connections between all Leaf switches and connections external to the data center, acting as the hub of the entire switch network within the data center.

    • (4) OCS primarily aims to replace the Spine switch. Firstly, Spine switches are expensive and power-hungry, making the need for alternatives most pressing. Secondly, OCS has limited functionality; it can only forward signals (reflect light), like a mirror. A traditional switch has more complete functions—it needs to unpack data packets, examine IP addresses, and decide where to forward them. Therefore, using OCS only as a Spine switch is feasible because it merely executes instructions without decision-making capability. However, attempting to replace Leaf switches with OCS would necessitate adding other components to perform packet processing functions, such as smart Network Interface Cards (SmartNICs). This would complicate the architecture, potentially not being the optimal solution.

V. Which Industry Chain Segments are Involved?

(I) First, Let's Analyze the Principles and Architecture of CPO
CPO can be seen as an upgraded optical engine, whose function is photoelectric conversion. It primarily comprises:

  1. Photonic Circuit Part

    • Modulator: Writes electrical signals (0/1 digits) into optical signals by controlling the intensity and phase of light.

    • Detector: This is a PD (Photodiode), converting optical signals back into electrical signals.

    • Waveguide: Can be understood as microscopic optical fibers printed onto the chip.

  2. Electronic Circuit Part

    • Driver: Amplifies the weak electrical signal from the switch or server into a signal powerful enough to precisely control the laser's emission. The driver's ouTPUt goes to the modulator.

    • TIA (Transimpedance Amplifier): Amplifies the extremely weak electrical signal generated by the PD and converts it into a voltage signal processable by subsequent circuits. Thus, the TIA follows the PD.

  3. Light Source (Laser)
    The modulator cannot emit light on its own but can control it. Hence, a light-emitting component—the laser—is needed in conjunction.

Additionally, two other components exist:

  1. DSP and CDR, both used for repairing electrical signals. DSP compensates for physical damage to the electrical signal, while CDR extracts a precise clock from the damaged signal and reconditions the data timing. The DSP chip typically integrates CDR functionality. Similar to LPO, CPO removes the high-power, high-cost, latency-inducing DSP from the optical engine. However, in the CPO scheme, some DSP functions are integrated into the switch ASIC (unlike LPO's brute-force analog amplification approach), and the CDR is integrated into the high-speed SerDes.
    And what is a high-speed SerDes? It includes a Serializer and a Deserializer, located inside the ASIC chip. They are used, respectively, to pack parallel data from within the chip into a high-speed serial data stream, or to unpack a high-speed serial data stream back into multiple lower-speed parallel data streams.

(II) Next, Let's Examine the Segments Involved in the Entire CPO Industry Chain:

  1. First, the CPO Assembly
    The optical engine within the CPO encompasses the photonic and electronic circuit parts mentioned above. The optical engine and the ASIC chip constitute the main part of the CPO switch. The core question here is: who will manufacture this CPO? Traditional optical modules, composed of optical components, discrete devices, etc., as independent modules, could be fully provided by specialized manufacturers, such as well-known names like Zhongji Innolight (300308.SZ), Eoptolink Technology (300502.SZ), and Coherent. So, what about CPO? Clearly, they can no longer dominate. We tend to believe the industrial value distribution under CPO will evolve as follows:

    • (1) Switch Vendors and Platform Providers Mastering Core Technology: Data center system platform providers and switch chip vendors like NVIDIA, google, Broadcom, and Marvell Technology (MRVL.US) will DeFine the architecture and standards and sell complete products.

    • (2) Foundries: Wafer fabrication and packaging/testing houses like TSMC (TSM.US), ASE Technology Holding (ASX.US), and Amkor Technology (AMKR.US) will undertake wafer manufacturing, optoelectronic integration, and advanced packaging foundry services.

    • (3) Upstream Component Suppliers: Device manufacturers like Coherent Corp. (COHR.US) and Lumentum Holdings (LITE.US) will continue producing and supplying optoelectronic devices.

    • (4) Traditional Optical Module Manufacturers: Companies like Zhongji Innolight and Eoptolink will, during the transition period, provide intermediate routes like NPO and LPO, and continue offering optical engine modules under compromise CPO designs conceived for maintainability considerations.

  2. Beyond the Core CPO Optical Engine, Several Other Components Require Attention

    Next, four crucial fiber optic components, rarely used in traditional pluggable optical module routes:

    Why are these components rarely used in traditional optical modules?

    • (1) In traditional setups, fibers directly plug into standardized interfaces. Under CPO, fibers must achieve high-precision coupling with waveguides on the surface of the photonic chip, necessitating FAUs.

    • (2) Traditional methods use direct modulation, insensitive to the light wave's polarization state. Previously, PMF was prohibitively expensive for broad industrial application. However, CPO uses an external laser source; laser polarization states can cause massive energy loss, making PMF essential.

    • (3) Traditional modules typically have just one transmit and one receive fiber, simple enough for manual handling without needing Fiber Shuffle. However, CPO's high fiber count connecting to the backplane necessitates Fiber Shuffle.

    • (4) Similarly, traditional modules don't require many connectors. Under CPO, achieving 400G and above requires parallel transmission over 8 or even 16 fibers, while panel space is limited, necessitating multi-fiber connectors like MPO.

    • (2) Fiber Array Unit (FAU): Used for precisely mounting optical fibers to achieve high-precision alignment between fibers and waveguides.

    • (3) Polarization Maintaining Fiber (PMF): A specialized optical fiber used to maintain the polarization state of light waves constant.

    • (4) Fiber Shuffle: Used to arrange optical fibers, capable of rearranging the order of fibers within complex, high-density equipment.

    • (5) Multi-Fiber Push On (MPO) Connector: Used for interconnecting multi-core optical fibers.

    • EML Laser: Traditional route, integrating the laser and modulator together. Its advantage is suitability for high bandwidth (>200G) and long-reach communications. This route is monopolized by giants like Lumentum, II-VI (Coherent), and Sumitomo.

    • CW Laser: Emerging route, completely separating the laser. It offers advantages in cost and power consumption and is more compatible with future CPO routes. The supply of CW lasers is relatively flexible. Chinese manufacturers like Yuanjie Technology, Shijia Photons, and Changguang Huaxin have achieved mass production of 70mW/100mW products and secured large orders.

    • (1) Lasers
      CPO can only integrate the photoelectric conversion components; directly integrating the laser remains challenging, hence external lasers are still required. Concurrently, CPO significantly increases the power requirements for lasers (at least 3-4 times), leading to significantly higher performance and reliability demands and thus a substantially increased value proposition.
      However, technological route choices exist here:

We will analyze the market size and the Investment opportunities within the industrial segments related to CPO in our next article.

★★★★★
★★★★★
Be the first to rate this article.

Comments & Questions (0)

Captcha
Please be respectful — let's keep the conversation friendly.

No comments yet

Be the first to comment!