Skip to content

Commit 3eea1a6

Browse files
committed
document tpu multinic
Change-Id: I535e6deebb36a861c32ded91f33549815e7f0275
1 parent d1f1d32 commit 3eea1a6

File tree

1 file changed

+229
-0
lines changed

1 file changed

+229
-0
lines changed
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
---
2+
title: "GKE and Cloud TPU v6e (Trillium)"
3+
date: 2025-05-27T11:30:40Z
4+
---
5+
6+
If you use TPU Trillium and you want to improve the network performance of your Pods you can balance your network traffic over the VM NICs.
7+
8+
The `ct6e-standard-4t` machine type is backed by two physical NICs, since the main interface of the VM is used for all the applications and Pods on the host, you can create two additional vNICs on the VM that will be attached to each of the physical NICs, and pass them to the Pod directly, so you can multiplex your traffic to consume the total capacity of the physical NICs.
9+
10+
```sh
11+
# Create two additional VPC networks
12+
gcloud compute --project=${PROJECT?} \
13+
networks create \
14+
tpu-net-1 \
15+
--mtu=8896 \
16+
--subnet-mode=custom
17+
18+
gcloud compute --project=${PROJECT?} \
19+
networks subnets create \
20+
tpu-net-1-sub \
21+
--network=tpu-net-1 \
22+
--region=${REGION?} \
23+
--range=192.168.0.0/24
24+
25+
gcloud compute --project=${PROJECT?} \
26+
networks create \
27+
tpu-net-2 \
28+
--mtu=8896 \
29+
--subnet-mode=custom
30+
31+
gcloud compute --project=${PROJECT?} \
32+
networks subnets create \
33+
tpu-net-2-sub \
34+
--network=tpu-net-1 \
35+
--region=${REGION?} \
36+
--range=192.168.1.0/24
37+
38+
gcloud container node-pools create POOL_NAME \
39+
--location=${LOCATION} \
40+
--cluster=${CLUSTER_NAME} \
41+
--node-locations=${NODE_ZONES} \
42+
--machine-type=${MACHINE_TYPE} \
43+
--tpu-topology=${TPU_TOPOLOGY} \
44+
--additional-node-network network=tpu-net-1,subnetwork=tpu-net-1-sub \
45+
--additional-node-network network=tpu-net-2,subnetwork=tpu-net-2-sub \
46+
--enable-gvnic
47+
```
48+
49+
Apply the following manifest to install DraNet:
50+
51+
```sh
52+
kubectl apply -f https://raw.githubusercontent.com/google/dranet/refs/heads/main/install.yaml
53+
```
54+
55+
Once DraNet is running you'll be able to obtain the network resources exposed by the dranet Pods, in order to avoid noise, DraNet has a flag that allow to set client side filter to control the exposed resources, in this case, we can set the flag to ignore network devices that are `virtual`, the manifest will look like:
56+
57+
```yaml
58+
containers:
59+
- args:
60+
- /dranet
61+
- --v=4
62+
- --filter=attributes["dra.net/virtual"].BoolValue == false
63+
image: ghcr.io/google/dranet:stable
64+
```
65+
66+
First, we tell DraNet what kind of NICs we're interested in and how Pods can claim them. In order to simplify our workloads we can create a `DeviceClass` that matches only the resources exposed by DraNet.
67+
68+
**DeviceClass (dranet):** This selects NICs managed by DraNet.
69+
70+
```yaml
71+
apiVersion: resource.k8s.io/v1beta1
72+
kind: DeviceClass
73+
metadata:
74+
name: dranet
75+
spec:
76+
selectors:
77+
- cel:
78+
expression: device.driver == "dra.net"
79+
```
80+
81+
**ResourceClaimTemplate (worker-rdma-nic-template):** This will request the two additional NICs, since we created the additiona networks with the prefix `tpu-net` we can levarage the powerful CEL expressions to match on that prefix.
82+
83+
Another important factor is the capacity of DraNet to pass Interface configuration options that allow to tune the interfaces for maximum performance, per example, [Big TCP](https://lwn.net/Articles/884104/).
84+
85+
In addition, if you have GVNIC enabled you can use some private ethtool flags that improve the performance for TCP like [enable-max-rx-buffer-size](enable-max-rx-buffer-size).
86+
87+
```yaml
88+
apiVersion: resource.k8s.io/v1beta1
89+
kind: ResourceClaimTemplate
90+
metadata:
91+
name: tpu-net-interfaces
92+
spec:
93+
spec:
94+
devices:
95+
requests:
96+
- name: tpu-net-interface
97+
deviceClassName: dranet
98+
count: 2
99+
selectors:
100+
- cel:
101+
expression: device.attributes["gce.dra.net"].networkName.startsWith("tpu-net")
102+
config:
103+
- opaque:
104+
driver: dra.net
105+
parameters:
106+
interface:
107+
mtu: 8896
108+
gsoMaxSize: 65536
109+
groMaxSize: 65536
110+
gsoIPv4MaxSize: 65536
111+
groIPv4MaxSize: 65536
112+
disableEbpfPrograms: true
113+
ethtool:
114+
privateFlags:
115+
enable-max-rx-buffer-size: true
116+
```
117+
118+
To test the network performance we'll use [neper](https://github.com/google/neper), a tool created by the Google kernel teams to test network performance.
119+
120+
```yaml
121+
apiVersion: apps/v1
122+
kind: StatefulSet
123+
metadata:
124+
name: neper
125+
spec:
126+
selector:
127+
matchLabels:
128+
app: neper
129+
serviceName: neper
130+
replicas: 2
131+
template:
132+
metadata:
133+
labels:
134+
app: neper
135+
spec:
136+
nodeSelector:
137+
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
138+
cloud.google.com/gke-tpu-topology: 4x4
139+
initContainers:
140+
- name: "network-optimization-sysctls"
141+
image: "busybox"
142+
securityContext:
143+
privileged: true
144+
command:
145+
- sh
146+
- -c
147+
- |
148+
echo 5000 > /proc/sys/net/ipv4/tcp_rto_min_us
149+
echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save
150+
echo 0 > /proc/sys/net/ipv4/tcp_slow_start_after_idle
151+
echo 131072 > /proc/sys/net/core/optmem_max
152+
echo "4096 41943040 314572800" > /proc/sys/net/ipv4/tcp_rmem
153+
containers:
154+
- name: neper
155+
image: ghcr.io/google/neper:stable
156+
securityContext:
157+
privileged: true
158+
resources:
159+
requests:
160+
google.com/tpu: 4
161+
limits:
162+
google.com/tpu: 4
163+
resourceClaims:
164+
- name: tpu-net-interface
165+
resourceClaimTemplateName: tpu-net-interfaces
166+
```
167+
168+
We'll get two pods running:
169+
170+
```sh
171+
$ kubectl get pods
172+
NAME READY STATUS RESTARTS AGE
173+
neper-0 1/1 Running 0 10m
174+
neper-1 1/1 Running 0 22s
175+
```
176+
177+
Using neper-1 as a server `kubectl exec -it neper-1 -- sh`, checks first the additional IPs assigned, in this case these IPs are 10.9.9.11 and 10.10.0.11
178+
179+
```sh
180+
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
181+
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
182+
inet 127.0.0.1/8 scope host lo
183+
valid_lft forever preferred_lft forever
184+
2: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1460 qdisc noqueue state UP qlen 1000
185+
link/ether 16:41:72:68:11:67 brd ff:ff:ff:ff:ff:ff
186+
inet 10.68.2.12/24 brd 10.68.2.255 scope global eth0
187+
valid_lft forever preferred_lft forever
188+
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP qlen 1000
189+
link/ether 42:01:0a:09:09:0b brd ff:ff:ff:ff:ff:ff
190+
inet 10.9.9.11/32 scope global eth1
191+
valid_lft forever preferred_lft forever
192+
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP qlen 1000
193+
link/ether 42:01:0a:0a:00:0b brd ff:ff:ff:ff:ff:ff
194+
inet 10.10.0.11/32 scope global eth2
195+
valid_lft forever preferred_lft forever
196+
```
197+
198+
then run one TCP stream server per NIC:
199+
200+
```sh
201+
for i in 0 1; do
202+
tcp_stream -C$((52279 + i)) --port=$((38339 + i)) --skip-rx-copy -rw -Z -B16384 --test-length=60 --suicide-length=120 -F100 --num-threads=16 --num-flows=32 -D0 --logtostderr &> test$i.log &
203+
done
204+
```
205+
206+
and neper-0 as a client `kubectl exec -it neper-0 -- sh` to connect to each TCP server:
207+
208+
```sh
209+
tcp_stream -C52279 --port=38339 --skip-rx-copy -rw -Z -B16384 --test-length=60 --suicide-length=70 -F100 --num-threads=16 --num-flows=32 --client -H 10.9.9.11 -D0 --logtostderr &> test0.log &
210+
tcp_stream -C52280 --port=38340 --skip-rx-copy -rw -Z -B16384 --test-length=60 --suicide-length=70 -F100 --num-threads=16 --num-flows=32 --client -H 10.10.0.11 -D0 --logtostderr &> test1.log &
211+
```
212+
213+
The first test instance recorded a throughput of ~180.17 Gbps, and the second instance simultaneously achieved ~174.73 Gbps.
214+
215+
```sh
216+
grep throughput test*
217+
test0.log:throughput_opt=Mb
218+
test0.log:throughput=180165.51
219+
test0.log:throughput_units=Mbit/s
220+
test0.log:local_throughput=180165511242
221+
test0.log:remote_throughput=177503231653
222+
test1.log:throughput_opt=Mb
223+
test1.log:throughput=174727.08
224+
test1.log:throughput_units=Mbit/s
225+
test1.log:local_throughput=174727081480
226+
test1.log:remote_throughput=175469311719
227+
```
228+
229+
The sum of these two independent tests gives the total aggregated throughput of 354.9 Gbps.

0 commit comments

Comments
 (0)