eBPF_flat_p2

Introduction

In this post we will pick up where we left off and write the backend or kernel space eBPF code for our program, flat to monitor the network latency in a very efficient way.

Make sure to check the previous posts to get up to speed with what we are about to build.

The Big Picture

As described in part 1 of this series, our kernel space code needs to have a view on both ingress and egress traffic, hence we’ll continue using the tc classifier program type.

The focus is on unicast Ethernet traffic whether TCP or UDP for both IPv4 and IPv6 protocols.

Analyzing TCP is easier since there are SYN and SYN/ACK flags available to us to distinguish a bi-directional flow. UDP is connection-less and lacks these flags, making it hard to accurately distinguish flows. Nonetheless, we are going to be fine (read accurate) as long as we are not dealing with QUIC and protocols alike.

Next, we need a data structure or struct, to store the required packet information like IP addresses, source and destination ports, TTL, timestamp, etc… to process them in our user space code.

Lastly, an eBPF map to share that struct between the user and kernel space programs efficiently.

First UPDATE - 12 October 2023

While finalizing part 3 of this series, I realized this post needs some improvements. So, here’s what’s changed:

Initially, I wrote this post using the BPF_MAP_TYPE_PERF_EVENT_ARRAY map type and then migrated to BPF_MAP_TYPE_RINGBUF since it provides better performance, event ordering, and has less memory overhead
Added a better explanation for tc actions here
Improved formatting
Added a reference section for useful resources and further reading

Second UPDATE - 26 October 2023

Maxim Bogushevich kindly notified me of this commit that provides an amended comment for the bpf_skb_pull_data function which has been corrected in this update.

The Kernel Space Program

Alrighty. Time to write some C code. The code in this post will sit inside the bpf/flat.c file.

The Required Libraries

These are all the libraries that our program requires to run:

#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

#include <linux/bpf.h>
#include <linux/bpf_common.h>

#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/in.h>
#include <linux/in6.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/pkt_cls.h>

#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>

Just by looking at their names, you can probably tell what most of them do. Do not worry if you are not familiar with C, most of these libraries are just providing a human-friendly name (symbolical constant) for the data that we are going to process. For example, ETH_P_IP instead of 0x0800 to signify IPv4 and IPPROTO_TCP instead of 6 for TCP.

We could also write most of these types ourselves without including all these libraries. If you are curious to see an example, check out this repository from Cilium.

If you have followed the Setup an eBPF Development Environment article, you can use the “go to” feature of your IDE on these symbolic constants and types to see where they are coming from. For instance ctrl+click or F12 in VSCode.

The License

I described the why and the how of adding a license in the previous post. We will use the same principle here by adding this line to our program:

char _license[] SEC("license") = "Dual MIT/GPL";

Defining The Packet Data Structure

A data structure is needed to store the information we need for each packet. Specifically:

Source IP
Destination IP
Source port
Destination port
Protocol (TCP/UDP)
TTL or Time to Live
SYN flag to determine if we have a TCP SYN packet
SYN/ACK flag to determine if we have a TCP SYN/ACK packet
ts to timestamp when a packet is received

We will name it packet_t and This is what it amounts to:

struct packet_t {
    struct in6_addr src_ip;
    struct in6_addr dst_ip;
    __be16 src_port;
    __be16 dst_port;
    __u8 protocol;
    __u8 ttl;
    bool syn;
    bool ack;
    uint64_t ts;
};

One cool trick is to use the in6_addr struct from in6.h to store both our IPv4 and IPv6 addresses in one field without needing to have distinct types for each address family.

You may wonder how is this possible? Aren’t they different after all?

This method is called IPv4-Mapped IPv6 Address as described in RFC4291. This data structure allows us to embed an IPv4 address in an IPv6 address as depicted below:

 |                80 bits               | 16 |      32 bits        |
 +--------------------------------------+--------------------------+
 |0000..............................0000|FFFF|    IPv4 address     |
 +--------------------------------------+----+---------------------+

The in6_addr type inside the IPv6 header is defined like this:

struct in6_addr {                /* Offset  Size */
    union {
        __u8     u6_addr8[16];   /* 0       16   */
        __be16   u6_addr16[8];   /* 0       16   */
        __be32   u6_addr32[4];   /* 0       16   */
    } in6_u;                     /* 0       16   */

    /* size: 16, cachelines: 1, members: 1 */
    /* last cacheline: 16 bytes */
};

Notice that it is a union type. It gives us three options to utilize it:

__u8 u6_addr8[16]: By consuming sixteen u8 bits or 1 byte chunks
__be16 u6_addr16[8]: By consuming eight u16 bits or 2 bytes chunks
__be32 u6_addr32[4]: By consuming four u32 bits or 4 bytes chunks

To embed IPv4 addresses in IPv6, we will go with the third option, four u32s by saying store the IPv4 address in the third (last) index, pad the prior 16 bits with all Fs and 80 bits prior to that with zeroes. I will demonstrate this as we progress.

The rest of the types of in packet_t struct are pretty self-explanatory. src and dst port fields are defined as __be16 since we will be dealing with Big (network) endian.

Defining The eBPF Map

Maps are data structures that can be used to share data between different eBPF programs and also between eBPF and user space programs using syscalls. In our case, the data we store inside packet_t struct will be passed onto our user space program using an eBPF map for further processing.

Let’s define a map and call it pipe:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 512 * 1024);  /* 512 KB */
} pipe SEC(".maps");

__uint(type, BPF_MAP_TYPE_RINGBUF) is a macro provided by BPF that defines an unsigned integer field in the struct. The first argument, type, is the name of the field, and the second argument, BPF_MAP_TYPE_RINGBUF, is the value of the field. This line is essentially defining the type of the eBPF map, a ringbuf map type with a size of 512 kilobytes.

BPF_MAP_TYPE_RINGBUF is a type of eBPF map introduced in Linux kernel 5.8, known as the BPF ring buffer. It is designed to very efficiently exchange data between user and kernel space programs. It’s a multi-producer, single-consumer (MPSC) queue that can be safely shared across multiple CPUs simultaneously as opposed to BPF_MAP_TYPE_PERF_EVENT_ARRAY which uses per-CPU buffers.

We also define a size of 512 KBs for our map using __uint(max_entries, 512 * 1024).

SEC(".maps") is another eBPF macro that places the struct in a specific section of the ELF binary. This is necessary because the BPF loader uses the ELF sections to determine what types of objects (maps, programs, etc…) are in the binary.

You can view all the available maps as of this writing here

In terms of performance, the eBPF ring buffer beats the perf buffer for all practical purposes. It supports efficient reading of data from user space program through a memory-mapped region without extra memory copying and/or syscalls into the kernel. It also provides epoll notifications support and an ability to busy-loop for minimal latency.

How Should Everything Look By Now

So far, our program should look like this:

#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

#include <linux/bpf.h>
#include <linux/bpf_common.h>

#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/in.h>
#include <linux/in6.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/pkt_cls.h>

#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>

char _license[] SEC("license") = "Dual MIT/GPL";

struct packet_t {
    struct in6_addr src_ip;
    struct in6_addr dst_ip;
    __be16 src_port;
    __be16 dst_port;
    __u8 protocol;
    __u8 ttl;
    bool syn;
    bool ack;
    uint64_t ts;
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 512 * 1024);  /* 512 KB */
} pipe SEC(".maps");

Hunting Packets

Alright, we have reached the fun part. Let’s go cherry-pick some packets.

Let’s start off by defining our eBPF program hook point; tc classifier and naming it flat as well as specifying that we are receiving a pointer to a network packet metadata aka __sk_buff.

SEC("tc")
int flat(struct __sk_buff* skb) {
 // rest of the code goes here
}

Inside the function body, we do our first sanity check and in case the skb is non-linear, pull the data of each packet in a linear region of memory by calling the bpf_skb_pull_data helper function, passing it the packet’s metadata and 0 as the len to make all bytes in the linear part of skb readable and writable. If there’s an error (return of a negative number), we tell the kernel to continue processing the packet normally without using our program by returning TC_ACT_OK which is basically 0:

if (bpf_skb_pull_data(skb, 0) < 0) {
  return TC_ACT_OK;
}

Sometimes packet data is non-linear, meaning that the buffers that are storing packet data are not contiguous in the memory

Understanding TC Actions

It is important to understand the actual purpose of tc actions. Here are the three most common actions you will encounter:

TC_ACT_OK: Stops processing the packet any further and passes it to the Linux network stack for further processing. This means if you have multiple eBPF programs, they will not get to see and process the packet
TC_ACT_PIPE: Passes/let’s the next action or eBPF program process the packet
TC_ACT_SHOT: Stops processing the packet and drops it

The rest of the actions can be found here.

Checking For Broadcast And Multicast Packets

Let’s check the packet type and if it is a broadcast or a multicast packet we pass it to the Linux kernel network stack, showing our disinterest:

if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) {
  return TC_ACT_OK;
}

Checking Packet Boundaries

Next, we load the extents of the packet data (not to be confused with the payload), in other words, where it starts (head also known as data) and where it ends (tail also known as data_end) in memory:

void* head = (void*)(long)skb->data;
void* tail = (void*)(long)skb->data_end;

After that, to make the verifier happy, we conduct a bound check to ensure the start of our packet data (head) plus the size of the Ethernet header does not extend beyond the end of the packet data (tail). If it does, it would mean that the packet is either malformed or not an Ethernet frame. In that case we pass the packet up to Linux network stack.

if (head + sizeof(struct ethhdr) > tail) {
  return TC_ACT_OK;
}

To guarantee safety, BPF needs us to first ensure that we have not reached the end of the packet’s linear part (tail) and only then access packet data. As a result, most packet data accesses must be preceded by bound checks.

Defining Headers

Alright. Let’s define the header types that we would require:

struct ethhdr* eth = head;  // Assign eth header since we have it
struct iphdr* ip;
struct ipv6hdr* ipv6;
struct tcphdr* tcp;
struct udphdr* udp;

To learn more about frame, packet and segment headers, check out this post.

Next, we initialize an empty (zero) packet_t struct as well as an offset so that as we dissect the packets, moving up the network layers, we keep track of where we are to not cross the socket buffer boundaries, making the verifier mad.

struct packet_t pkt = { 0 };

uint32_t offset = 0;

IP Packets

The EtherType field inside the Ethernet header determines which layer 3 protocol is encapsulated in the payload of the frame. In specific 0x0800 for IPv4 and 0x86DD for IPv6.

In Linux, the h_proto field in the Ethernet header represents the EtherType. If you recall from the previous post, this field is in fact Big Endian so we will use the bpf_ntohs helper function to convert the network (Big) endianness to host system’s (usually Small endian) for further analysis.

switch (bpf_ntohs(eth->h_proto)) {
    // we do our checks here
}

IPv4

If we have an IPv4 (ETH_P_IP) packet:

We will increase the offset by adding the size of the Ethernet header and IP header to conduct a bound check
Construct our IP header by moving to the byte after the Ethernet header
If the packet is not TCP or UDP, pass the packet up to the Linux kernel network stack
Embed the IPv4 into IPv6:
- Place the source and destination IP of our packet into the last index of in6_addr type using the u6_addr32 field
- Pad the 16 bits before it with Fs using the u6_addr16 field
- The 80 bits before that will be implicitly all zeroes, so we don’t need to do that explicitly
Fill in the protocol and TTL into pkt struct
Break out the switch statement to go for layer 4 checks

case ETH_P_IP:
  offset = sizeof(struct ethhdr) + sizeof(struct iphdr);

  if (head + offset > tail) {
      return TC_ACT_OK;
  }

  ip = head + sizeof(struct ethhdr);

  if (ip->protocol != IPPROTO_TCP && ip->protocol != IPPROTO_UDP) {
      return TC_ACT_OK;
  }

  // Place the source and destination IP of our packet into the last index of `in6_addr`
  pkt.src_ip.in6_u.u6_addr32[3] = ip->saddr;
  pkt.dst_ip.in6_u.u6_addr32[3] = ip->daddr;

  // Pad the 16 bits before IP address with all Fs as per the RFC
  pkt.src_ip.in6_u.u6_addr16[5] = 0xffff;
  pkt.dst_ip.in6_u.u6_addr16[5] = 0xffff;

  pkt.protocol = ip->protocol;
  pkt.ttl = ip->ttl;

  break;

IPv6

If we have an IPv6 (ETH_P_IPV6) packet:

We will increase the offset by adding the size of the Ethernet header and IPv6 header to conduct a bound check
Construct our IPv6 header by moving to the byte after the Ethernet header
If the packet is not TCP or UDP, pass the packet up to the Linux kernel network stack
Fill in the protocol (nexthdr) and TTL (hop_limit) into pkt struct
Break out the switch statement to go for layer 4 checks

case ETH_P_IPV6:
  offset = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);

  if (head + offset > tail) {
      return TC_ACT_OK;
  }

  ipv6 = (void*)head + sizeof(struct ethhdr);

  if (ipv6->nexthdr != IPPROTO_TCP && ipv6->nexthdr != IPPROTO_UDP) {
      return TC_ACT_OK;
  }

  pkt.src_ip = ipv6->saddr;
  pkt.dst_ip = ipv6->daddr;

  pkt.protocol = ipv6->nexthdr;
  pkt.ttl = ipv6->hop_limit;

  break;

default:  // We did not have an IPv4 or IPv6 packet!
  return TC_ACT_OK;
}

TCP and UDP Packets

First, let’s conduct our bound check to ensure that we are indeed dealing with a TCP or UDP packet:

if (head + offset + sizeof(struct tcphdr) > tail || head + offset + sizeof(struct udphdr) > tail) {
    return TC_ACT_OK;
}

Now, in order to process TCP or UDP segments, we need to look at the protocol field that was extracted in the previous step.

switch (pkt.protocol) {
  // we do our checks here
}

TCP

If we have a TCP (IPPROTO_TCP) packet:

We will use the offset and head to fill up our TCP header
If the packet is a SYN or SYN/ACK, extract the source and destination ports as well as the SYN and ACK flags
Use bpf_ktime_get_ns helper function to get the time elapsed since system boot to accurately timestamp the packets in order to calculate their latency in the user space program later on
Send the data to the user space program using the bpf_ringbuf_output helper function. More on that later.

case IPPROTO_TCP:
    tcp = head + offset;

    if (tcp->syn) {
        pkt.src_port = tcp->source;
        pkt.dst_port = tcp->dest;
        pkt.syn = tcp->syn;
        pkt.ack = tcp->ack;
        pkt.ts = bpf_ktime_get_ns();

        if (bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0) < 0) {
            return TC_ACT_OK;
        }
    }
    break;

If you are wondering why we are just checking for tcp->syn and not tcp->ack, remember that a TCP SYN/ACK packet is just a packet with both syn and ack bits set to 1. Hence no need to do a redundant check.

UDP

If we have a UDP (IPPROTO_UDP) packet:

We will use the offset and head to fill up our UDP header
Extract the source and destination ports
Use bpf_ktime_get_ns helper function to get the time elapsed since system boot to accurately timestamp the packets in order to calculate their latency in the user space program later on
Send the data to the user space program using the bpf_ringbuf_output helper function.

case IPPROTO_UDP:
    udp = head + offset;

    pkt.src_port = udp->source;
    pkt.dst_port = udp->dest;
    pkt.ts = bpf_ktime_get_ns();

    if (bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0) < 0) {
        return TC_ACT_OK;
    }
    break;

default:  // We did not have a TCP or UDP segment
    return TC_ACT_OK;
}

Sending Data to User Space Program

In the previous step, we saw the use of the bpf_ringbuf_output helper function. Simply put, bpf_ringbuf_output is used to exchange data from an eBPF program in the kernel to a user space program or vice versa. The data is sent via a ring buffer.

The data can be anything that the eBPF program wants to send, and is typically related to the context in which the eBPF program is running.

Let’s break down the arguments of this helper function:

bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0)

&pipe is a reference to the BPF map we defined (it has to be of type BPF_MAP_TYPE_RINGBUF)
&pkt and sizeof(pkt) are pointer to the data (packet_t struct) to be sent and its size. This could be any data that the eBPF program wants to send to user space program

ringbuf provides two APIs to exchange data between the kernel and user space programs:

bpf_ringbuf_output that allows for a one-to-one migration from bpf_perf_event_output
bpf_ringbuf_reserve and bpf_ringbuf_submit that are more efficient than bpf_ringbuf_output since if the buffer is full, call to reserve fails and there will be less computation. Learn more about them here.

The Complete Program

Here is the complete program:

#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

#include <linux/bpf.h>
#include <linux/bpf_common.h>

#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/in.h>
#include <linux/in6.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/pkt_cls.h>

#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>

char _license[] SEC("license") = "Dual MIT/GPL";

struct packet_t {
    struct in6_addr src_ip;
    struct in6_addr dst_ip;
    __be16 src_port;
    __be16 dst_port;
    __u8 protocol;
    __u8 ttl;
    bool syn;
    bool ack;
    uint64_t ts;
};


struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 512 * 1024); /* 512 KB */
} pipe SEC(".maps");


SEC("tc")
int flat(struct __sk_buff* skb) {

    if (bpf_skb_pull_data(skb, 0) < 0) {
        return TC_ACT_OK;
    }

    if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) {
        return TC_ACT_OK;
    }

    void* head = (void*)(long)skb->data;
    void* tail = (void*)(long)skb->data_end;

    if (head + sizeof(struct ethhdr) > tail) {
        return TC_ACT_OK;
    }

    struct ethhdr* eth = head;
    struct iphdr* ip;
    struct ipv6hdr* ipv6;
    struct tcphdr* tcp;
    struct udphdr* udp;

    struct packet_t pkt = { 0 };

    uint32_t offset = 0;

    switch (bpf_ntohs(eth->h_proto)) {
    case ETH_P_IP:
        offset = sizeof(struct ethhdr) + sizeof(struct iphdr);

        if (head + offset > tail) {
            return TC_ACT_OK;
        }

        ip = head + sizeof(struct ethhdr);

        if (ip->protocol != IPPROTO_TCP && ip->protocol != IPPROTO_UDP) {
            return TC_ACT_OK;
        }

        pkt.src_ip.in6_u.u6_addr32[3] = ip->saddr;
        pkt.dst_ip.in6_u.u6_addr32[3] = ip->daddr;

        pkt.src_ip.in6_u.u6_addr16[5] = 0xffff;
        pkt.dst_ip.in6_u.u6_addr16[5] = 0xffff;

        pkt.protocol = ip->protocol;
        pkt.ttl = ip->ttl;

        break;

    case ETH_P_IPV6:
        offset = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);

        if (head + offset > tail) {
            return TC_ACT_OK;
        }

        ipv6 = head + sizeof(struct ethhdr);

        if (ipv6->nexthdr != IPPROTO_TCP && ipv6->nexthdr != IPPROTO_UDP) {
            return TC_ACT_OK;
        }

        pkt.src_ip = ipv6->saddr;
        pkt.dst_ip = ipv6->daddr;

        pkt.protocol = ipv6->nexthdr;
        pkt.ttl = ipv6->hop_limit;

        break;

    default:
        return TC_ACT_OK;
    }

    if (head + offset + sizeof(struct tcphdr) > tail || head + offset + sizeof(struct udphdr) > tail) {
        return TC_ACT_OK;
    }

    switch (pkt.protocol) {
    case IPPROTO_TCP:
        tcp = head + offset;

        if (tcp->syn) {
            pkt.src_port = tcp->source;
            pkt.dst_port = tcp->dest;
            pkt.syn = tcp->syn;
            pkt.ack = tcp->ack;
            pkt.ts = bpf_ktime_get_ns();

            if (bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0) < 0) {
                return TC_ACT_OK;
            }
        }
        break;

    case IPPROTO_UDP:
        udp = head + offset;

        pkt.src_port = udp->source;
        pkt.dst_port = udp->dest;
        pkt.ts = bpf_ktime_get_ns();

        if (bpf_ringbuf_output(&pipe, &pkt, sizeof(pkt), 0) < 0) {
            return TC_ACT_OK;
        }
        break;

    default:
        return TC_ACT_OK;
    }

    return TC_ACT_OK;
}

Conclusion

In this post we wrote the kernel space code of our program, saw how to handle both IPv4, IPv6 as well as TCP and UDP segments. Also used BPF maps to pass the data to user space program efficiently.

In the next post, we will write our user space program to process the data we pass onto it to display some meaningful information.

Thanks for reading!

Building an Efficient Network Flow Monitoring Tool with eBPF - Part 2