VRF Shenanigans with Systemd

I have been using VRFs on my Linux routers for some time. While the idea of it sounds pretty good, the actual implementation is pure pain and suffering.

If you want to do firewalling, for example, you will quickly find out that the source interface information is lost when that traffic is in a VRF. Nftables rules with iifname where the interface is within a VRF, will not match, because the interface name will be the VRF interface instead.

For software that is not VRF-aware, you end up having to prefix the command call with ip vrf exec <vrf-name>. It is clever, but quite ugly, especially if we talk about services managed via systemd.

The commonly recommended way in this case is to edit the systemd unit and change the ExecStart= line to include the ip vrf exec command. The only problem with this is that if you have software that distributes the systemd unit, you would either edit the provided unit, which is ugly, or you create an override, meaning that if upstream makes changes to the ExecStart command, you would miss those changes.

With all that said, I set off to do some research and see if it could be done better.

Understanding what iproute2 is doing under the hood

I started by checking what iproute2 is doing under the hood for the ip vrf exec command.

A quick look at the source code¹ reveals where the magic is happening. Before iproute2 forks to execute the requested command, it loads a small eBPF program and attaches it to the respective cgroup.

The eBPF program is actually very small:

1
2
3
4
5
6
7
8
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_3, idx),
BPF_MOV64_IMM(BPF_REG_2,
	    offsetof(struct bpf_sock, bound_dev_if)),
BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3,
	    offsetof(struct bpf_sock, bound_dev_if)),
BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
BPF_EXIT_INSN(),

It might look a bit daunting at first, but after a short read about the inner workings of eBPF, I could understand what it does.

When iproute2 calls bpf_program_load(), it sets the type as BPF_PROG_TYPE_CGROUP_SOCK and when it attaches the program with bpf_program_attach(), it sets the attach type as BPF_CGROUP_INET_SOCK_CREATE. In the documentation for the BPF_PROG_TYPE_CGROUP_SOCK program type² we can see that when our program gets called, it will receive a struct bpf_sock as context.

According to the eBPF ABI documentation³, register R0 contains the exit value for our program, registers R1 to R5 the arguments for the function call, R6 to R9 registers that we can use as we please. I will ignore R10 as our program doesn’t make use of it.

Back to the program, we can see what is happening:

On line 1, R1 gets copied into R6. R1 contains our context (struct bpf_sock), passed by the kernel when the hook is called.
On line 2, the interface index is loaded into R3.
On line 3, the offset for the bound_dev_if member in the struct is loaded into R2.
On line 4, we copy our interface index into the bound_dev_if member of the struct.
On line 5, we set the return code to 1, allowing the operation to proceed.
On line 6, we finally return.

I don’t know about you, but to me, this sounds like a lot of extra work for no gain.

Testing my own eBPF program for socket binding

Thinking that the original iproute2 eBPF program was unnecessarily long, I decided to try my own very simplified version:

1
2
3
4
BPF_ST_MEM(BPF_W, BPF_REG_1, idx, 
        offsetof(struct bpf_sock, bound_dev_if)),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),        

It ran just fine on my local machine with Linux 6.11, but when I tried on a remote machine that has Linux 6.1, the program would fail to load.

I soon found out that support for changing the context using BPF_ST_MEM was only introduced on Linux 6.4⁴. This limitation was somewhat easy to overcome, I just added an extra step to my program

1
2
3
4
5
BPF_MOV64_IMM(BPF_REG_2, idx),
BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_2, 
        offsetof(struct bpf_sock, bound_dev_if)),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),

and with that, the program loaded and worked just fine.

Improving the situation with non-VRF-Aware programs

Digging further, I found out about the BPFProgram= option in systemd, which allows the attachment of BPF programs to the cgroup of the unit. This got me thinking if I could replace the entire ip vrf exec hack with a BPFProgram= line. So I came up with a plan:

Load and pin the eBPF program when the VRF gets configured
Add the BPFProgram= line via a systemd unit override
Profit

During my research to implement this plan, I came across a blogpost from JerryXiao that implemented something very similar.

While this solution looks very clever, I was still unhappy that an extra python program is involved and you need an extra service unit to load the BPF program.

The more elegant solution

When looking around in the systemd repository for VRF related things, I stumbled on a GitHub issue requesting native VRF support.

A thought came into my mind: what if I tried implementing this feature? Took me roughly three weeks to get it working, and the result can be seen here.

Starting with systemd v260, there is native support for binding all created sockets to a given interface, which conveniently can be a VRF interface. One just needs to add a BindNetworkInterface= to the service unit and you’re all set. The best part of it is that this directive supports templating, so you can BindNetworkInterface=%i and have systemctl start sshd@mgmt, for example.

A minimal example is provided below:

[Unit]
Description=Ping within a VRF

[Service]
Type=simple
ExecStart=/usr/bin/ping 8.8.8.8
BindNetworkInterface=vrf-test

[Install]
WantedBy=multi-user.target

While the Pull Request from above contains a lot of code, the gist of it is very simple:

Parse the interface name from the BindNetworkInterface parameter
Check if it’s a valid interface
Resolve the interface name into an interface index
Load the BPF program and set the interface index
Attach the BPF program to the cgroup
PROFIT

The BPF program itself is as simple as it gets:

1
2
3
4
5
6
7
8
const volatile __u32 ifindex = 0;

SEC("cgroup/sock_create")
int sd_bind_interface(struct bpf_sock *ctx) {
        /* Bind the socket to the VRF interface */
        ctx->bound_dev_if = ifindex;
        return 1;
}

If you look carefully, it’s just the C version of my very simplified BPF program.

Conclusion

Working with VRFs on Linux is full of caveats, especially when it comes to programs that don’t offer native support for VRFs. The traditional approach of wrapping commands with ip vrf exec works, but creates some challenges whe it comes to systemd-managed services, where the unit overrides eventually conflict with upstreams changes.

With systemd v260, this capability is now built-in. Adding BindNetworkInterface=vrf-mgmt to a service unit automatically binds all its sockets to that interface, without needing wrappers, systemd generators or loading eBPF programs. The feature even supports templating (%i), making it easy to run multiple instances of a service, each one in a different VRF.

TL;DR

Using ip vrf exec on systemd units is pure pain.

I dug into what iproute2 actually does and found it’s just loading a tiny eBPF program that binds sockets to an interface. That’s… surprisingly simple?

So I decided to implement native VRF support in systemd. Now you can just add BindNetworkInterface=vrf-mgmt into your service unit and call it a day. The feature will be available starting with systemd v260. The best part is that you can just add a systemd override to an upstream service unit. If upstream ships changes to their service unit, you do not have to worry about updating your override.

Understanding what iproute2 is doing under the hood#

Testing my own eBPF program for socket binding#

Improving the situation with non-VRF-Aware programs#

The more elegant solution#

Conclusion#

TL;DR#