作者:六炅
本文為作者投稿,Seebug Paper 期待你的分享,凡經采用即有禮品相送! 投稿郵箱:paper@seebug.org

Preface

CodeQL是一款不錯的代碼分析掃描工具,于我而言對漏洞挖掘有很大的幫助。使用它也有一定時間了,之前一直接觸的是開源項目,所以借助CodeQL進行數據庫創建和分析還是相對簡單的,不會有過多的限制。最近在進行Java反序列化利用鏈挖掘時,接觸了gadgetinspector,它通過分析字節碼來獲取AST語法樹并根據預定條件生成可能的調用鏈。于是我想如果借助CodeQL這類來分析應該會更方便些,可是在沒有源碼的情況對于編譯型語言,無法從正常途徑創建數據庫。雖然網上已有部分此類的小工具,但還是希望通過了解CodeQL的部分原理來找尋無源碼創建數據庫的方式并加深對靜態代碼分析的理解。

以下內容以Java語言作為分析對象,分析的結論并不保證與實際完全相符。文章所用的測試項目可在github獲取,包含.idea/你可以用IDEA打開,只需修改文件路徑即可自己調試分析。

Pre Request

  • CodeQL CLI 2.9.1
  • Jdk 16
  • Windows OS

Building Database

CodeQL的歷史文檔中(加入Github之前),有大致描述其創建數據庫的過程,如下圖,見[1]

它的工作流程大致是,在javac編譯目標代碼時,通過Extractor與其進行交互。Extractor會根據每一個java文件的內容生成一個trap文件,后續再根據trap文件生成實際的數據庫。同時它會將處理的每一個java文件拷貝一份保存在數據中,便于后續展示查詢結果時能看到代碼的上下文。

而針對不同的語言都有各自獨立的Extractor,文檔中解釋這樣做的好處,畢竟不同語言特性不同。

CodeQL可以通過以下命令創建一個數據庫,這里以一個最簡單的maven項目為例,該項目僅包含一個輸出Hello Worldjava文件

codeql database create -l java -c "mvn clean compile" C:\Users\trganda\Documents\databases\app

創建好的數據庫的目錄結構如下

|-- codeql-database.yml
|-- db-java # 數據庫關系文件
|-- log # 各類日志
|   |-- agent.9008554372696040130.log
|   |-- agent.11337701303103251140.log
|   |-- build-tracer.log
|   |-- database-create-20220509.114127.634.log
|   |-- database-index-files-20220509.114151.283.log
|   |-- ext 用于javac的配置文件
|   |   |-- javac.args
|   |   |-- javac.env
|   |   |-- javac.orig
|   |   `-- javac.properties
|   |-- javac-errors.log
|   |-- javac-extractor-1683275.log
|   `-- javac-output-11812.log
`-- src.zip # 源碼文件

Analyze Build Process

由官方文檔using-indirect-build-tracingdatabase-create-20220509.114127.634.log可以看到數據庫的創建過程其實是被分為了多個子步驟的。各步驟執行的命令如下

codeql database init --language=java --source-root=C:\Users\trganda\Documents\app --allow-missing-source-root=false --allow-already-existing -- C:\Users\trganda\Documents\databases\app

codeql database trace-command --working-dir=C:\Users\trganda\Documents\app --index-traceless-dbs --no-db-cluster -- C:\Users\trganda\Documents\databases\app mvn clean compile

codeql database finalize --mode=normal --no-db-cluster -- C:\Users\trganda\Documents\databases\app
    |-- codeql database trace-command --working-dir=C:\Users\trganda\Documents\app --no-tracing -- C:\Users\trganda\Documents\databases\app C:\Program Files\codeql\java\tools\pre-finalize.cmd
    |-- codeql dataset import --dbscheme=C:\Program Files\codeql\java\semmlecode.dbscheme -- C:\Users\trganda\Documents\databases\app\db-java C:\Users\trganda\Documents\databases\app\trap\java
    |-- codeql database cleanup --mode=normal -- C:\Users\trganda\Documents\databases\app
    `-- codeql dataset cleanup --mode=normal -- C:\Users\trganda\Documents\databases\app\db-java

Extractor有關的為第二條命令,下面來看看它具體做了什么。

codeql命令對應的執行文件位于其安裝目錄下codeql.cmd,內容如下

@echo off
rem Wrapper provided for users who explicitly configured VS Code to point to codeql.cmd
"%~dp0\codeql.exe" %*
exit /b %errorlevel%

win平臺,它借助exe文件來處理要執行的命令,這不是我們想看到的。好在還有另一個shell腳本文件codeql,為linux平臺提供服務。可以通過它來了解codeql.exe的內部邏輯

它的大概意思是,設置環境變量CODEQL_PLATFORMCODEQL_JAVA_HOMECODEQL_DIST后,執行codeql.jar。再回過頭細看database-create-20220509.114127.634.log里面會記錄使用成功加載javaextracotrSuccessfully loaded extractor Java),位于java\tools目錄下

|-- COPYRIGHT
|-- LICENSE
|-- codeql-extractor.yml
|-- semmlecode.dbscheme
|-- semmlecode.dbscheme.stats
`-- tools
    |-- autobuild-fat.jar
    |-- autobuild.cmd
    |-- autobuild.sh
    |-- codeql-java-agent.jar
    |-- compiler-tracing.spec
    |-- linux
    |   `-- ...
    |-- pre-finalize.cmd
    |-- pre-finalize.sh
    |-- semmle-extractor-java.jar
    `-- tracing-config.lua

這里可以看到一些jar包和腳本,以及配置文件codeql-extractor.ymlcodeql-java-agent.jaragent,在整個編譯期開始前注入jvm中并用于執行extractor操作。而其它的部分內容,通過日志的信息,可以猜測其含義,這里暫不細糾。

既然是jar包,那么就能比較容易的去分析它。這里將codeql.jarjava\tools目錄下的autobuild-fat.jarcodeql-java-agent.jarsemmle-extractor-java.jar拖入IDEAjd-gui

IDEARun/Debug Configurations中新增2Jar Application,配置分別如下

codeql database init

Path to JAR: C:\Program Files\codeql\tools\codeql.jar
VM options: --add-modules jdk.unsupported
Program arguments: database init --language=java --source-root=<your working path> --allow-missing-source-root=false --allow-already-existing -- <your database path>
Working directory: <your working path>
Enviroment variables: CODEQL_DIST=C:\Program Files\codeql;CODEQL_JAVA_HOME=C:\Program Files\codeql\tools\win64\java;CODEQL_PLATFORM=win64

codeql database trace-command

Path to JAR: C:\Program Files\codeql\tools\codeql.jar
VM options: --add-modules jdk.unsupported
Program arguments: database trace-command --working-dir=<your working path> --index-traceless-dbs --no-db-cluster -- <your database path> mvn clean compile
Working directory: <your working path>
Enviroment variables: CODEQL_DIST=C:\Program Files\codeql;CODEQL_JAVA_HOME=C:\Program Files\codeql\tools\win64\java;CODEQL_PLATFORM=win64

這里調試的目標是codeql database trace-command,在調試前先執行一次codeql database init完成數據庫初始化。并在com.semmle.cli2.CodeQL#main打下斷點再調試codeql database trace-command,與database相關的命令處理邏輯位于com.semmle.cli2.database,從類的名字可以很好找到與trace-command相關的類為com.semmle.cli2.database.TraceCommandCommand。大致查看這個類的代碼,執行邏輯在com.semmle.cli2.database.TraceCommandCommand#executeSubcommand

protected void executeSubcommand() {
    this.actionVersion = new CodeQLActionVersion() {
        protected boolean isVeryOldAction() {
            return TraceCommandCommand.this.command.size() == 3 && ((String)TraceCommandCommand.this.command.get(1)).endsWith(File.separator + "working" + File.separator + "tracer-env.js") && ((String)TraceCommandCommand.this.command.get(2)).endsWith(File.separator + "working" + File.separator + "env.tmp");
        }
    };
    super.executeSubcommand();
}

在此處也打下一個斷點,然后開啟調試,順利的話會執行到super.executeSubcommand();也就是DatabaseProcessCommandCommon#executeSubcommand這個方法,它的內容比較長,直接看尾部的一部分代碼,

protected void executeSubcommand() {
    ...
    Iterator var32 = commandlines.iterator();

    while(var32.hasNext()) {
        List<String> cmdArgs = (List)var32.next();
        this.printProgress("Running command in {}: {}", new Object[]{workingDir, cmdArgs});
        Builder8 p = new Builder8(cmdArgs, LogbackUtils.streamFor(this.logger(), "build-stdout", true), LogbackUtils.streamFor(this.logger(), "build-stderr", true), Env.systemEnv().getenv(), workingDir.toFile());
        this.env.addToProcess(p);
        List<String> cmdProcessor = new ArrayList();
        CommandLine.addCommandProcessor(cmdProcessor, this.env.expander);
        p.prependArgs(cmdProcessor);
        tracerSetup.enableTracing(p);
        StreamAppender streamOutAppender = new StreamAppender(Streams.out());

        int result;
        try {
            LogbackUtils.addAppender(streamOutAppender);
            result = p.execute();
        } finally {
            LogbackUtils.removeAppender(streamOutAppender);
        }

        if (result != 0) {
            cmdProcessor.addAll(cmdArgs);
            throw new UserError("Exit status " + result + " from command: " + cmdProcessor);
        }
    }
    ...
}

它根據傳入的命令mvn clean compile構造了一個Buildr8,它封裝了ProcessBuilder,在構造完成后會調用p.execute()執行命令,完整執行的命令為

"C:\Program Files\codeql\tools\win64\tracer.exe" "C:\Program Files\codeql\tools\win64\runner.exe" cmd.exe /C type NUL && mvn clean compile

相關的環境變量(由codeql增加的)如下

CODEQL_PLATFORM=win64;
CODEQL_PLATFORM_DLL_EXTENSION=.dll;
CODEQL_EXTRACTOR_JAVA_LOG_DIR=C:\Users\trganda\Documents\databases\app2\log;
CODEQL_JAVA_HOME=C:\Program Files\codeql\tools\win64\java;
CODEQL_EXTRACTOR_JAVA_SCRATCH_DIR=C:\Users\trganda\Documents\databases\app2\working;
ODASA_TRACER_CONFIGURATION=C:\Users\trganda\Documents\databases\app2\working\tracing\compiler-tracing1707598060791117786.spec;
SEMMLE_JAVA_TOOL_OPTIONS='-javaagent:C:\Program Files\codeql\java\tools/codeql-java-agent.jar=ignore-project,java' '-Xbootclasspath/a:C:\Program Files\codeql\java\tools/codeql-java-agent.jar';
CODEQL_EXTRACTOR_JAVA_WIP_DATABASE=C:\Users\trganda\Documents\databases\app2;
CODEQL_EXTRACTOR_JAVA_ROOT=C:\Program Files\codeql\java;
CODEQL_EXTRACTOR_JAVA_TRAP_DIR=C:\Users\trganda\Documents\databases\app2\trap\java;
CODEQL_TRACER_LOG=C:\Users\trganda\Documents\databases\app2\log\build-tracer.log;
CODEQL_EXTRACTOR_JAVA_SOURCE_ARCHIVE_DIR=C:\Users\trganda\Documents\databases\app2\src;
CODEQL_DIST=C:\Program Files\codeql;

環境變量中出現了很多熟悉的面孔,在javaextractor中見過它們。由于前面執行的命令涉及到tracer.exerunner.exe,如果直接以它們為目標進行分析需要借助其它逆向工具,導致問題過于復雜,先不走這條路。這里先通過process hacker查看這條命令執行過程中的變化

從進程創建的結構看,后3java.exe依次執行的命令如下

"C:\Program Files\Common Files\Oracle\Java\javapath\java.exe" -classpath "C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\boot\plexus-classworlds-2.6.0.jar"   "-Dclassworlds.conf=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\bin\m2.conf"   "-Dmaven.home=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\.."   "-Dlibrary.jansi.path=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\lib\jansi-native"   "-Dmaven.multiModuleProjectDirectory=C:\Users\trganda\Documents\app"   org.codehaus.plexus.classworlds.launcher.Launcher clean compile

"C:\Program Files\Java\jdk-16.0.1\bin\java.exe" -classpath "C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\boot\plexus-classworlds-2.6.0.jar" "-Dclassworlds.conf=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\bin\m2.conf" "-Dmaven.home=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\.." "-Dlibrary.jansi.path=C:\Program Files\JetBrains\IntelliJ IDEA 2021.3\plugins\maven\lib\maven3\bin\..\lib\jansi-native" -Dmaven.multiModuleProjectDirectory=C:\Users\trganda\Documents\app org.codehaus.plexus.classworlds.launcher.Launcher clean compile

"C:\Program Files\Java\jdk-16.0.1\bin\java.exe" -Dfile.encoding=windows-1252 -Xmx1024M -Xms256M --add-opens java.base/sun.reflect.annotation=ALL-UNNAMED -classpath "C:\Program Files\codeql\java\tools\semmle-extractor-java.jar" com.semmle.extractor.java.JavaExtractor --jdk-version 16 --javac-args @@@C:\Users\trganda\Documents\databases\app\log\ext\javac.args

前兩個是調用了maven工具鏈,而這里最引人注目的是最后一條命令的內容,它執行semmle-extractor-java.jar,并傳入javac.args文件,這個文件的內容長這樣

-Xprefer:source
-d
C:\Users\trganda\Documents\app\target\classes
-classpath
C:\Users\trganda\Documents\app\target\classes;
-sourcepath
C:\Users\trganda\Documents\app\src\main\java;C:\Users\trganda\Documents\app\target\generated-sources\annotations;
-s
C:\Users\trganda\Documents\app\target\generated-sources\annotations
-g
-nowarn
-target
1.7
-source
1.7
-encoding
UTF-8
C:\Users\trganda\Documents\app\src\main\java\org\example\App.java

這個文件稱為Command-Line Argument Files,用于給javac傳遞參數,它應該是通過執行maven來生成的。

Tracer

這里可能會疑惑semmle-extractor-java.jar是怎么被執行的,雖然并沒有對trace.exerunner.exe進行分析,但是可以從javac.env和環境變量SEMMLE_JAVA_TOOL_OPTIONS猜測出在"C:\Program Files\codeql\tools\win64\tracer.exe" "C:\Program Files\codeql\tools\win64\runner.exe" cmd.exe /C type NUL && mvn clean compile執行過程中時,通過agent的方式向jvm植入了codeql-java-agent.jar

下面這一段內容是新加入的

$CODEQL_HOME/tools目錄下,有一個tracer目錄,里面放著名為base.lua的問題,打開這個文件可以看到注釋中大大方方的寫著它的用途。

-- Overview:
-- Each traced language contains a `tracing-config.lua` file that defines two functions:
-- GetCompatibleVersions() -> [versionNumbers]. This function returns a list of major versions that
--   are compatible with this `tracing-config.lua` file.
-- RegisterExtractorPack(languageId) -> [matchers]. This function is called at by
--   the Lua tracer runtime. It returns a list of matchers for this language.
--   A matcher is a  function of the form function(compilerName, compilerPath, compilerArguments, languageID) -> Table | nil.
--   The return value of a matcher is either `nil` (no match) or a table with the following keys:
--     `trace`: True if the processes created by the compiler (and extractor) should be traced for the current language
--     `replace`: If true, then the compiler process is not run
--     `invocations`: A list of extractor invocations. Each invocation is a table with key `path` (absolute path to the executable)
--                    and key `arguments` XOR `transformedArguments` (see explanation below)
--   For convenience, the `CreatePatternMatcher` function is provided that deals with most of the low-level details
--   of creating matchers.
--
-- `compilerArguments` has the following structure:
-- {
--   "nativeArgumentPointer": Opaque pointer that can be used to create transformations of these command line arguments
--                        that are executed in C++. This is mostly necessary for Windows, where we want to
--                        prepend/append to the command line without parsing it
--   "argv": Posix-only, array of command line arguments passed to the compiler
--   "commandLineString": Windows-only, the string passed to CreateProcess*(), with the path to the compile removed (and converted to UTF-8).
--                  Can be parsed into an argv array using `NativeCommandLineToArgv`, but be warned, this is not
--                  a canonical interpretation of the command line.
-- }
-- The arguments for an extractor invocation have two possible shapes:
--   either, the invocation sets the key `transformedArguments` (like `BuildExtractorInvocation` does), which is a table with
--   the following keys:
--     `nativeArgumentPointer`: The same opaque pointer, copied from the compiler invocation
--     `prepend`: A list of arguments to prepend to the arguments from the compiler
--     `append`: A list of arguments to append to the arguments from the compiler
--   alternatively, it sets the key `arguments`, which is a table with the following keys:
--     `argv`: Posix-only: The command line arguments (without argv[0])
--     `commandLineString`: Windows-only: The command line string (without the leading path to the executable).
--                    This will be converted internally to UTF-16 before execution.
--
-- The user can specify an extra lua config file on the command line.
-- This is loaded after all enabled languages have been loaded. This file also needs to contain a `GetCompatibleVersions`
-- function, just like a regular tracing config.
-- Second, it is required to contain a function
-- RegisterExtraConfig() -> [{languageID -> [matchers]}], i.e. a function that returns a table
--   mapping language IDs to a list of matchers. For each language ID, these matchers will _overwrite_ the matchers
-- registered by that language.
-- Furthermore, this function has full access to the implementation details of `base.lua`. However, obviously
-- no guarantees about compatibility are made when accessing internal functions or state.
--
-- If tracing is enabled for multiple languages, the languages are processed in lexicographical order of the language ID.
-- For each language, the matchers are processed in the order supplied, until the first matcher returns non-nil.
-- Then, matching for that language is stopped.
-- Matchers between different languages are not allowed to cooperate - each language is supposed to be independent
-- of the other possibly active languages.
-- There is one exception, though: If two languages specify `replace=true` for the same compiler invocation,
-- then matching for the second language is aborted without action. In this case, a log message is emitted.

該文件配合trace.exe使用,每種語言的extractor下都有一個tracing-config.lua文件,它有點類似于插件,需要實現兩個函數GetCompatibleVersionsRegisterExtractorPack。前者用于標識自身支持的版本,后者則會被tracer調用返回一個matchermatcher可以用來標識編譯器并插入參數。以javaextractor為例,它的tracing-config.lua文件如下

function RegisterExtractorPack(id)
    local pathToAgent = AbsolutifyExtractorPath(id, 'tools' .. PathSep ..
                                                    'codeql-java-agent.jar')
    -- inject our CodeQL agent into all processes that boot a JVM
    return {
        CreatePatternMatcher({'.'}, MatchCompilerName, nil, {
            jvmPrependArgs = {
                '-javaagent:' .. pathToAgent .. '=ignore-project,java,kotlin:experimental',
                '-Xbootclasspath/a:' .. pathToAgent
            }
        })
    }
end

-- Return a list of minimum supported versions of the configuration file format
-- return one entry per supported major version.
function GetCompatibleVersions() return {'1.0.0'} end

注釋中已經寫明,會向jvm中注入agent文件codeql-java-agent.jar

以下為之前的理解

這個過程從process hacker中無法直接看到,但是任然有一些蛛絲馬跡可以證明這一點。

  • 日志文件build-tracer.log,有Reading configuration file ...\working\tracing\compiler-tracing12908925883751484166.spec
  • compiler-tracing12908925883751484166.spec來自compiler-tracing.spec,其中包含agent相應參數
  • trace.exe中包含ODASA_TRACER_CONFIGURATION字符串,指向spec文件

可以通過jd-gui打開codeql-java-agent.jar,閱讀其中代碼,在com.semmle.extractor.java.Utils#loadClass中看到

private static Class<?> loadClass(String name) {
    Class result;
    try {
        result = Class.forName(name);
    } catch (ClassNotFoundException var10) {
        String extractorTools = getExtractorTools();
        if (extractorTools == null) {
            throw new RuntimeException("Failed to determine SEMMLE_DIST", var10);
        }

        File extractorJar = new File(extractorTools, "semmle-extractor-java.jar");
        if (!extractorJar.exists() || !extractorJar.canRead()) {
            throw new RuntimeException("Cannot read semmle-extractor-java jar from " + extractorJar + " -- check SEMMLE_DIST", var10);
        }

        URL url;
        try {
            url = extractorJar.getAbsoluteFile().toURI().toURL();
        } catch (MalformedURLException var9) {
            throw new RuntimeException("Failed to convert " + extractorJar + " to URL", var9);
        }

        URLClassLoader loader = new URLClassLoader(new URL[]{url});

        try {
            result = loader.loadClass(name);
        } catch (ClassNotFoundException var8) {
            throw new RuntimeException("Failed to load " + name + " from " + extractorJar + " -- check SEMMLE_DIST", var8);
        }
    }

    return result;
}

會通過Utils加載semmle-extractor-java.jarcodeql-java-agent.jar的代碼量不大,其大致邏輯可以通過靜態代碼閱讀的方式來理解。

Java Extractor

從前面的分析結果來看,Extracotr的操作位于semmle-extractor-java.jar中,根據process hacker的內容在IDEA中新增一個Debug配置

由于中途更換了機器,所以某些路徑看上去會不一樣,但不影響閱讀。此外由于semmle-extractor-java.jar中沒有清單文件MAINFEST.MF,無法直接運行該jar包,所以創建Application進行Debug即可。

Main class: com.semmle.extractor.java.JavaExtractor
Program arguments: --jdk-version 16 --javac-args @@@E:\Documents\databases\app\log\ext\javac.args
Enviroment variables: CODEQL_PLATFORM=win64;CODEQL_PLATFORM_DLL_EXTENSION=.dll;CODEQL_EXTRACTOR_JAVA_LOG_DIR=E:\Documents\databases\app2\log;CODEQL_JAVA_HOME=E:\Program Files\codeql\tools\win64\java;CODEQL_EXTRACTOR_JAVA_SCRATCH_DIR=E:\Documents\databases\app2\working;CODEQL_EXTRACTOR_JAVA_WIP_DATABASE=E:\Documents\databases\app2;CODEQL_EXTRACTOR_JAVA_ROOT=E:\Program Files\codeql\java;CODEQL_EXTRACTOR_JAVA_TRAP_DIR=E:\Documents\databases\app2\trap\java;CODEQL_TRACER_LOG=E:\Documents\databases\app2\log\build-tracer.log;CODEQL_EXTRACTOR_JAVA_SOURCE_ARCHIVE_DIR=E:\Documents\databases\app2\src;CODEQL_DIST=E:\Program Files\codeql

先不急著調試,直接運行看看它運行后database/app目錄下有什么變化。注意要在log/ext目錄下放入相應的文件,這個可以從正常創建數據庫的步驟中獲取到。運行后會增加兩個目錄srctrapsrc中會放置項目中的源代碼,trap用于存放trap文件。

codeql提供了相關命令導入trap文件并生成數據庫,在前面列出的創建過程中,也有出現它的身影。

Usage: codeql dataset <command> <argument>...
[Plumbing] Work with raw QL datasets.
Commands:
  import   [Plumbing] Import a set of TRAP files to a raw dataset.
  upgrade  [Plumbing] Upgrade a dataset so it is usable by the current tools.
  cleanup  [Plumbing] Clean up temporary files from a dataset.
  check    [Plumbing] Check a particular dataset for internal consistency.
  measure  [Plumbing] Collect statistics about the relations in a particular
             dataset.

trap文件夾中列出了項目源碼以及jdk依賴中類的信息,文件夾的結構如下

|-- Java
    |-- classes
    |-- diagnostics
    `-- E_\Projects\IdeaProjects\app\src\main\java\org\example\

項目源碼對應的trap文件位于E_\Projects\IdeaProjects\app\src\main\java\org\example\中,里面有3個文件,App.java.depApp.java.setApp.java.trap.gz。可以將App.java.trap.gz解壓縮查看trap文件的內容。項目代碼只是調用System.out.println輸出Hello, World!,所以它的內容相對簡單,如下

CodeQLDB架構是基于Datalog的,如果你熟悉Datalog,那理解這個文件的內容也會容易許多。

// Generated by the CodeQL Java extractor
#10000=@"E:/Projects/IdeaProjects/app/src/main/java/org/example/App.java;sourcefile"
files(#10000,"E:/Projects/IdeaProjects/app/src/main/java/org/example/App.java")
#10001=@"E:/Projects/IdeaProjects/app/src/main/java/org/example;folder"
folders(#10001,"E:/Projects/IdeaProjects/app/src/main/java/org/example")
#10002=@"E:/Projects/IdeaProjects/app/src/main/java/org;folder"
folders(#10002,"E:/Projects/IdeaProjects/app/src/main/java/org")
#10003=@"E:/Projects/IdeaProjects/app/src/main/java;folder"
folders(#10003,"E:/Projects/IdeaProjects/app/src/main/java")
#10004=@"E:/Projects/IdeaProjects/app/src/main;folder"
folders(#10004,"E:/Projects/IdeaProjects/app/src/main")
#10005=@"E:/Projects/IdeaProjects/app/src;folder"
folders(#10005,"E:/Projects/IdeaProjects/app/src")
#10006=@"E:/Projects/IdeaProjects/app;folder"
folders(#10006,"E:/Projects/IdeaProjects/app")
#10007=@"E:/Projects/IdeaProjects;folder"
folders(#10007,"E:/Projects/IdeaProjects")
#10008=@"E:/Projects;folder"
folders(#10008,"E:/Projects")
#10009=@"E:/;folder"
folders(#10009,"E:/")
containerparent(#10009,#10008)
containerparent(#10008,#10007)
containerparent(#10007,#10006)
containerparent(#10006,#10005)
containerparent(#10005,#10004)
containerparent(#10004,#10003)
containerparent(#10003,#10002)
containerparent(#10002,#10001)
containerparent(#10001,#10000)
#10010=@"loc,{#10000},0,0,0,0"
locations_default(#10010,#10000,0,0,0,0)
hasLocation(#10000,#10010)
numlines(#10000,9,8,0)
#10011=@"package;org.example"
packages(#10011,"org.example")
cupackage(#10000,#10011)
#10012=@"class;org.example.App"
#10013=@"loc,{#10000},3,14,3,16"
locations_default(#10013,#10000,3,14,3,16)
hasLocation(#10012,#10013)
numlines(#10012,6,6,0)
#10014=@"type;void"
primitives(#10014,"void")
#10015=@"unknown;sourcefile"
files(#10015,"")
#10016=@"loc,{#10015},0,0,0,0"
locations_default(#10016,#10015,0,0,0,0)
hasLocation(#10014,#10016)
#10017=@"callable;{#10012}.<init>(){#10014}"
locations_default(#10013,#10000,3,14,3,16)
hasLocation(#10017,#10013)
numlines(#10017,1,1,0)
#10018=*
stmts(#10018,0,#10017,0,#10017)
#10019=*
locations_default(#10019,#10000,3,14,3,16)
hasLocation(#10018,#10019)
numlines(#10018,1,1,0)
#10020=*
stmts(#10020,20,#10018,0,#10017)
#10021=*
locations_default(#10021,#10000,3,14,3,16)
hasLocation(#10020,#10021)
numlines(#10020,1,1,0)
#10022=@"class;java.lang.Object"
#10023=@"callable;{#10022}.<init>(){#10014}"
callableBinding(#10020,#10023)
#10024=@"class;java.lang.String"
#10025=@"array;1;{#10024}"
arrays(#10025,"String[]",#10024,1,#10024)
locations_default(#10016,#10015,0,0,0,0)
hasLocation(#10025,#10016)
#10026=@"field;{#10025};length"
#10027=@"type;int"
fields(#10026,"length",#10027,#10025,#10026)
#10028=@"modifier;public"
modifiers(#10028,"public")
hasModifier(#10026,#10028)
#10029=@"modifier;final"
modifiers(#10029,"final")
hasModifier(#10026,#10029)
#10030=@"callable;{#10025}.clone(){#10025}"
methods(#10030,"clone","clone()",#10025,#10025,#10030)
hasModifier(#10030,#10028)
extendsReftype(#10025,#10022)
#10031=@"class;java.lang.Cloneable"
implInterface(#10025,#10031)
#10032=@"class;java.io.Serializable"
implInterface(#10025,#10032)
#10033=@"callable;{#10012}.main({#10025}){#10014}"
#10034=@"loc,{#10000},5,24,5,27"
locations_default(#10034,#10000,5,24,5,27)
hasLocation(#10033,#10034)
numlines(#10033,4,4,0)
#10035=*
stmts(#10035,0,#10033,0,#10033)
#10036=*
locations_default(#10036,#10000,6,5,8,5)
hasLocation(#10035,#10036)
numlines(#10035,3,3,0)
#10037=*
exprs(#10037,62,#10014,#10033,-1)
callableEnclosingExpr(#10037,#10033)
#10038=*
locations_default(#10038,#10000,5,19,5,22)
hasLocation(#10037,#10038)
numlines(#10037,1,1,0)
#10039=@"params;{#10033};0"
params(#10039,#10025,0,#10033,#10039)
paramName(#10039,"args")
#10040=@"loc,{#10000},5,30,5,42"
locations_default(#10040,#10000,5,30,5,42)
hasLocation(#10039,#10040)
#10041=*
exprs(#10041,63,#10025,#10039,-1)
callableEnclosingExpr(#10041,#10033)
#10042=*
locations_default(#10042,#10000,5,30,5,37)
hasLocation(#10041,#10042)
numlines(#10041,1,1,0)
#10043=*
exprs(#10043,62,#10024,#10041,0)
callableEnclosingExpr(#10043,#10033)
#10044=*
locations_default(#10044,#10000,5,30,5,35)
hasLocation(#10043,#10044)
numlines(#10043,1,1,0)
#10045=*
stmts(#10045,14,#10035,0,#10033)
#10046=*
locations_default(#10046,#10000,7,9,7,45)
hasLocation(#10045,#10046)
numlines(#10045,1,1,0)
#10047=*
exprs(#10047,61,#10014,#10045,0)
callableEnclosingExpr(#10047,#10033)
statementEnclosingExpr(#10047,#10045)
#10048=*
locations_default(#10048,#10000,7,9,7,44)
hasLocation(#10047,#10048)
numlines(#10047,1,1,0)
#10049=*
#10050=@"class;java.io.PrintStream"
exprs(#10049,60,#10050,#10047,-1)
callableEnclosingExpr(#10049,#10033)
statementEnclosingExpr(#10049,#10045)
#10051=*
locations_default(#10051,#10000,7,9,7,18)
hasLocation(#10049,#10051)
numlines(#10049,1,1,0)
#10052=@"callable;{#10050}.println({#10024}){#10014}"
callableBinding(#10047,#10052)
#10053=*
exprs(#10053,22,#10024,#10047,0)
callableEnclosingExpr(#10053,#10033)
statementEnclosingExpr(#10053,#10045)
#10054=*
locations_default(#10054,#10000,7,29,7,42)
hasLocation(#10053,#10054)
numlines(#10053,1,1,0)
#10055=*
#10056=@"class;java.lang.System"
exprs(#10055,62,#10056,#10049,-1)
callableEnclosingExpr(#10055,#10033)
statementEnclosingExpr(#10055,#10045)
#10057=*
locations_default(#10057,#10000,7,9,7,14)
hasLocation(#10055,#10057)
numlines(#10055,1,1,0)
#10058=@"field;{#10056};out"
variableBinding(#10049,#10058)
namestrings("""Hello World!""","Hello World!",#10053)

它的內部并不會太難理解,首先這個文件是根據semmlecode.dbscheme文件所創建的,每種語言的extractor下都有一個這樣的文件。

#10000=@"E:/Projects/IdeaProjects/app/src/main/java/org/example/App.java;sourcefile"

#10000可理解為一個標簽,類似于數據庫表格某一列的id,每個trap文件的標簽都是獨立的。

files(#10000,"E:/Projects/IdeaProjects/app/src/main/java/org/example/App.java")

這是一段聲明,這個聲明是按照semmlecode.dbscheme中的約定構建的,你可以在該文件中看到

folders(
  unique int id: @folder,
  string name: string ref
);

所以上面的內容表示了一個文件,它的id#10000,路徑為E:/Projects/IdeaProjects/app/src/main/java/org/example/App.java

其余的聲明都可以按相同的邏輯來理解。

下面跟進源碼看看它具體做了什么。

com.semmle.extractor.java.JavaExtractor#main打下斷點,先根據傳入的參數創建JavaExtractor對象再調用runExtractor執行extractor操作生成trap文件。jarac-extractor*.log日志文件對象由靜態代碼塊中的LOG_ID = MarkerFactory.getMarker("javac-extractor" + PID);創建

public static void main(String[] args) {
    String allArgs = StringUtil.glue(" ", args);
    JavaExtractor extractor = new JavaExtractor(args);
    boolean hasJavacErrors = false;

    try {
        hasJavacErrors = !extractor.runExtractor();
    } catch (Throwable var8) {
        label102: {
            if (extractor.log != null) {
                extractor.log.error("Exception running the extractor with arguments: {}", allArgs);
                extractor.log.error("Exception: ", var8);
            }

            if (!(var8 instanceof Abort) && !(var8 instanceof FatalError)) {
                if (!(var8 instanceof OutOfMemoryError) && !(var8 instanceof UnknownError)) {
                    break label102;
                }

                throw var8;
            }

            throw var8;
        }
    } finally {
        extractor.close();
    }

    if (extractor.strictJavacErrors && hasJavacErrors) {
        throw new UserError("Compilation errors were reported by javac.");
    }
}

跟進runExtractor看看,代碼內容很長,增加了一些注釋以便理解

boolean runExtractor() {
    long time = System.nanoTime();
    long cpuTime = getCurrentThreadCpuTime();
    Context context = this.output.getContext();
    /* 創建日志對象,將內容寫入javac-output+進程id文件 */
    Factory<PrintWriter> logFactory = new Factory<PrintWriter>() {
        public PrintWriter make(Context c) {
            return new PrintWriter(LogbackUtils.streamFor(JavaExtractor.this.log, "javac-output" + JavaExtractor.PID, false));
        }
    };
    context.put(Log.outKey, logFactory);
    context.put(Log.errKey, logFactory);
    JavacFileManager.preRegister(context, this.specialSourcepathHandling);
    /* javac 參數 */
    Arguments arguments = this.setupJavacOptions(context);
    Options.instance(context).put("ignore.symbol.file", "ignore.symbol.file");
    JavaFileManager jfm = (JavaFileManager)context.get(JavaFileManager.class);
    JavaFileManager bfm = jfm instanceof DelegatingJavaFileManager ? ((DelegatingJavaFileManager)jfm).getBaseFileManager() : jfm;
    JavacFileManager dfm = (JavacFileManager)bfm;
    dfm.handleOptions(arguments.getDeferredFileManagerOptions());
    arguments.validate();
    if (jfm.isSupportedOption(Option.MULTIRELEASE.primaryName) == 1) {
        Target target = Target.instance(context);
        List<String> list = List.of(target.multiReleaseValue());
        jfm.handleOption(Option.MULTIRELEASE.primaryName, list.iterator());
    }

    JavaCompiler compiler = JavaCompiler.instance(context);
    compiler.genEndPos = true;

    /* 列出待編譯的文件 */
    Set<JavaFileObject> fileObjects = arguments.getFileObjects();
    /* DiagnosticTrapWriter類用于向trap/java/diagnostics中寫入診斷信息(也就是日志) */
    DiagnosticTrapWriter diagWriter = this.dw.getDiagnosticTrapWriter();
    if (diagWriter != null) {
        Iterator var14 = fileObjects.iterator();

        while(var14.hasNext()) {
            JavaFileObject jfo = (JavaFileObject)var14.next();
            diagWriter.writeFileArgument(jfo);
        }
    }

    /* 通過javac解析源代碼文件,拿到上下文信息 */
    javac_extend.com.sun.tools.javac.util.List<JCCompilationUnit> parsedFiles = compiler.parseFiles(fileObjects);
    compiler.enterTrees(compiler.initModules(parsedFiles));
    Queue<Queue<javac_extend.com.sun.tools.javac.comp.Env<AttrContext>>> groupedTodos = Todo.instance(context).groupByFile();
    long javacInitTime = System.nanoTime() - time;
    long javacInitCpuTime = getCurrentThreadCpuTime() - cpuTime;
    if (diagWriter != null) {
        diagWriter.writeCompilationFileTime((double)javacInitCpuTime / 1.0E9D, (double)javacInitTime / 1.0E9D, 0.0D, 0.0D);
    }

    int prevErr = 0;

    while(true) {
        long currJavacCpu;
        long cpu;
        long currJavacTime;
        while(true) {
            JCCompilationUnit cu;
            while(true) {
                Queue todo;
                do {
                    /* 檢查待做事項,沒有的話就返回 */
                    if ((todo = (Queue)groupedTodos.poll()) == null) {
                        long totalExtractorTime = System.nanoTime() - this.extractorStartTime;
                        this.log(String.format("Javac init time: %.1fs", (double)javacInitTime / 1.0E9D));
                        this.log(String.format("Javac attr time: %.1fs", (double)this.javacTime / 1.0E9D));
                        this.log(String.format("Extractor time: %.1fs", (double)this.extractorTime / 1.0E9D));
                        long otherTime = totalExtractorTime - javacInitTime - this.javacTime - this.extractorTime;
                        this.log(String.format("Other time: %.1fs", (double)otherTime / 1.0E9D));
                        this.log(String.format("Total time: %.1fs", (double)totalExtractorTime / 1.0E9D));
                        int totalErrors = compiler.errorCount();
                        compiler.close();
                        if (diagWriter != null) {
                            diagWriter.writeCompilationFinished((double)getCurrentThreadCpuTime() / 1.0E9D, (double)totalExtractorTime / 1.0E9D);
                        }

                        if (totalErrors != 0) {
                            this.log.error(LOG_ID, totalErrors + " errors were reported by javac.");
                            return false;
                        }

                        return true;
                    }

                    cu = null;
                    Iterator var23 = todo.iterator();

                    while(var23.hasNext()) {
                        javac_extend.com.sun.tools.javac.comp.Env<AttrContext> env = (javac_extend.com.sun.tools.javac.comp.Env)var23.next();
                        if (cu == null) {
                            cu = env.toplevel;
                        } else if (cu != env.toplevel) {
                            throw new CatastrophicError("Not grouped by file: CUs " + cu + " and " + env.toplevel);
                        }
                    }
                } while(cu == null);

                if (diagWriter != null) {
                    diagWriter.writeCompilationFileStart(cu);
                }

                cpu = getCurrentThreadCpuTime();
                time = System.nanoTime();

                try {
                    Queue<javac_extend.com.sun.tools.javac.comp.Env<AttrContext>> queue = compiler.attribute(todo);
                    String envFlowChecks = System.getenv("CODEQL_EXTRACTOR_JAVA_FLOW_CHECKS");
                    if (envFlowChecks == null || Boolean.valueOf(envFlowChecks)) {
                        compiler.flow(queue);
                    }
                    break;
                } catch (StackOverflowError | Exception var36) {
                    this.logThrowable(cu, var36);
                }
            }

            currJavacTime = System.nanoTime() - time;
            this.javacTime += currJavacTime;
            currJavacCpu = getCurrentThreadCpuTime() - cpu;
            cpu = getCurrentThreadCpuTime();
            time = System.nanoTime();

            try {
                CharSequence cachedContent = dfm.getCachedContent(cu.getSourceFile());
                if (cachedContent == null) {
                    try {
                        cachedContent = cu.getSourceFile().getCharContent(false);
                    } catch (IOException var37) {
                        this.logThrowable(cu, var37);
                        continue;
                    }
                }

                String contents = ((CharSequence)cachedContent).toString();
                /** 
                 * 根據compiler處理的結果,進行extractor操作
                 * this.output 存有`trap`和`src`文件的保存路徑
                 *    trapFolder=E:\Documents\databases\app2\trap\java
                 *    sourceArchiveFolder=E:\Documents\databases\app2\src
                 */
                (new CompilationUnitExtractor(this.output, cu, this.dw)).process(contents);
            } catch (StackOverflowError | Exception var38) {
                this.logThrowable(cu, var38);
            }
            break;
        }

        long currExtractorTime = System.nanoTime() - time;
        this.extractorTime += currExtractorTime;
        long currExtractorCpu = getCurrentThreadCpuTime() - cpu;
        if (diagWriter != null) {
            diagWriter.writeCompilationFileTime((double)currJavacCpu / 1.0E9D, (double)currJavacTime / 1.0E9D, (double)currExtractorCpu / 1.0E9D, (double)currExtractorTime / 1.0E9D);
            int currErr = compiler.errorCount();
            int deltaErr = currErr - prevErr;
            if (deltaErr > 0) {
                String errorMsg = String.valueOf(deltaErr);
                diagWriter.writeDiagnostic(DiagSeverity.ErrorHigh, errorMsg, DiagKind.SOURCE, (Label)null);
            }

            prevErr = currErr;
        }
    }
}

process函數的內容如下,根據輸入的源代碼文件內容進行處理,而CompilationUnitExtractor在創建時傳入的cu(JCCompilationUnit)對象,保存著編輯器處理后的上下文信息。

ClassDeclExtractor#visitClassDef為例,會通過調用this.onDemand.getClassKey得到當前類的唯一標簽,其它方法也是類似的。

public void visitClassDef(JCClassDecl that) {
    if (this.onDemand.getOutput().getTrackClassOrigins()) {
        this.attributeClassFile(that);
    }

    if (this.extractedClasses.add(that)) {
        this.enclosingCallables.push((Object)null);
        this.enclosingStatements.push((Object)null);
        if (that.type instanceof ClassType) {
            this.onDemand.extractPrivateMembers((ClassType)that.type);
        } else {
            this.log.error(DiagKind.SOURCE, this.treeUtil, "Unexpected type for class " + that.name + ": " + that.type, that);
        }

        Label classId;
        if (that.sym != null) {
            /* 獲取標簽#10012 */
            classId = this.onDemand.getClassKey(that.sym);
            this.treeUtil.writeKeyedLocation(this.writer, that, classId);
            this.treeUtil.writeJavadocAssociation(this.writer, classId, that);

    ...
}

整個java extractor的代碼量太多,我沒有深入研究各個部分。Extractor有用到名為javac_extend.com.sun.tools.javac的包來進行javac的操作,但是jdk中只有com.sun.tools.javac,并不清楚兩者的差異具體體現在哪里,但可以看出是進行了一定修改的。整個jar包就像縫合怪,將很多功能修改后嵌入在里面。

這樣整個Extractor的工作流程大概了解,

  • 根據javac配置文件創建javac compiler對象
  • javac對源碼一次進行預處理
  • 根據前一步出的處理結果,構造trap文件

由于涉及到的內容較多且廣泛,繼續深入可能會讓我陷入泥沼,了解其作用和用法即可,如果有緣會再回來看看。

從前面的分析大致能看出,數據的構建過程中,codeql并不需要完整的去編譯源代碼,只是借助javac從源碼中那拿點東西。其次,只要能夠根據源碼文件構造正確的javac.args,就可以生成trap文件了。之后再通過codeql database finalize即可得到一個數據庫。

這種想法在[2]中已經提及,只是可能由于反編譯時代碼的正確性無法保證完美,其次編譯時各個文件編譯的先后順序不同都會導致構造trap出現錯誤。但另一種更簡單直接的方式是根據反編譯結果,構造編譯命令,然后通過codeql database create并指定構造好的編譯命令即可,在github中也有相關項目。

Tools

按照前面的分析結果,CodeQL創建數據庫的過程中并不關心整個編譯過程和結果,只是借用編譯過程中的部分數據。那么對于任何java代碼,無論其構建系統為何,只要能夠讓編譯該java文件時,編譯器不應錯誤而退出,那么數據庫的創建過程就可以正常進行下去。

可以通過下面的腳本來創建數據庫,這里以dubbo項目為實例,使用先需要下載好ecj.jar,這里使用ecj的目的是,相比javac而言,它更能容忍編譯錯誤,從而避免創建數據庫過程失敗。

import pathlib
import os


def compile_cmd_file_create(save_path, ecj_path):
    with open("{}/file.txt".format(save_path), "w+") as f:
        for java_path in pathlib.Path(save_path).glob('**/*.java'):
            f.write(str(java_path) + "\n")
    ecj_absolute_path = pathlib.Path(ecj_path).resolve()
    compile_cmd = "java -jar {} -encoding UTF-8 -8 " \
                  "-warn:none -noExit @{}/file.txt".format(ecj_absolute_path, save_path)

    with open("{}/run.cmd".format(save_path), "w+") as f:
        f.write(compile_cmd)

    with open("{}/run.sh".format(save_path), "w+") as f:
        f.write(compile_cmd)


if __name__ == '__main__':
    self_ecj_path = os.getcwd() + r"/ecj-4.6.1.jar"

    compile_cmd_file_create(os.getcwd() + r"/dubbo", self_ecj_path)

運行后會在os.getcwd() + r"/dubbo"中生成run.sh/run.cmd文件,之后進入os.getcwd() + r"/dubbo"運行

codeql database create --language=java -c "bash run.sh"  <path to database>

就可以快速創建數據庫,當然它與通過正常方式創建的結果是否一致尚未驗證。 這種不顧編譯錯誤情況的方式創建的數據庫,會丟失數據流的信息從而導致失去它存在的意義,因為當編譯某個文件它的依賴未找到時,生成的trap文件也是不完整的。

References

1.https://help.semmle.com/lgtm-enterprise/user/help/generate-database.html
2. https://testanull.com/build-codeql-db-without-source-code
3. http://www.bjnorthway.com/1324/


Paper 本文由 Seebug Paper 發布,如需轉載請注明來源。本文地址:http://www.bjnorthway.com/1921/