

## Big problems require big solutions

High
Performance
Computing
(Supercomputers)

AI
Machine Learning
(Custom clusters)

Hyperscaler Cloud Computing (SaaS) When one computer won't do

Two computers are better than one

**solution**: connect multiple computers with a network

#### We need a better interconnect

## Networks are not designed for memory access

- We have known this for a long time
- Explicit communication
  - Software APIs (e.g. sockets, RDMA)
- Emphasis on bandwidth
  - large transfers are more efficient

### **Memory Fabrics**

**Memory Semantics** 

Load + Store Instructions
Ordering

Cache Coherency

[ Nice to have, not Mandatory ]

Fundamentally different software from networks

Network Latency 3µs (=3000ns)

10x

Fabric Latency 300ns

## Fabric-Attached Memory



DRAM: 128GB x 16

Total: 2TB

## Expander



DRAM: 128GB x 16 Expander: 512GB x 2

Total: 3TB

### Pool



Total: 6TB

#### Pool



## **Shared Memory Pool**



## **Memory Fabrics**





Many have tried, none have succeeded



## CXL





Physical Layer: PCIe 5

Microsoft

Coherency: yes

**Meta** 



DELLEMO Google HPE WHUAWEI



ON INVIDIA.





intel



IBM



nix Unifabri× \'|.\\'| \70LLEY

AMPERE

ellisys

KIOXIA



SYNOPSYS°



IMPERIAL G的製作



enfabrica ERICSSON : ExpectedIT



ARTERIS (adence celestial A)

JPC connectivity







**H3C** 









Lenovo



SANDISK" ScaleFlux

LIGHTELLIGENCE® & LIOID













































#### The Battle Of the Protocols

- CXL announced 2020
- UALink announced in April 2025
- NVLink Fusion announced in May 2025
- Scale-Up Ethernet announced in May 2025

#### Who will win?



# UALINK [April 2025]

Members: 69

Physical Layer: XXXX

Coherency: **no** 























































































































#### **Use Cases**

High
Performance
Computing
(Supercomputers)

AI
Machine Learning
(Custom clusters)

Hyperscaler Cloud Computing (SaaS)

MPI openMP

openshmem

xCCL collectives

Spark

Microservices

### **Software Solutions**

API for controlling the remote memory

- Allocate/free
- Access control
- Notifications

Necessary primitives

- Atomics
- Barriers

## Summary

- Disaggregated memory is now a reality
- Memory expanders, pools, and shared pools are changing system design in fundamental ways
- CXL is successful but has competition
- The winner(s) have yet to be decided

## More memory is better



## Cache Coherency

- Cache is used to hold a local copy of data
  - Generally lower latency than main memory
  - Shorter distance for data movement
- Cached shared memory requires cache coherency
  - If two compute units update their local copies simultaneously, who wins?
- Some protocols do not support coherency by design (NVLink, UALink)
- Hardware coherency has not yet been implemented by chip vendors (CXL)
- Rely on software coherency for now

# NVLink Fusion [May 2025]

Members: NVidia

Physical Layer: XXXX

Coherency: yes

# Scale-Up Ethernet [May 2025]

Members: **Broadcom** 

Physical Layer: **Ethernet** 

Coherency: yes