<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Framework for Automatic Exploit Generation for JIT Compilers</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/15/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356719</idno>
					<idno type="doi">10.1145/3465413.3488573</idno>
					<title level='j'>Checkmate '21: Proceedings of the 2021 Research on offensive and defensive techniques in the Context of Man At The End (MATE) Attacks</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Xiyu Kang</author><author>Saumya Debray</author><author>Yonghwi Kwon</author><author>Sebastian Banescu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper proposes a framework for automatic exploit generation in JIT compilers, focusing in particular on heap corruption vulnerabilities triggered by dynamic code, ie, code generated at runtime by the JIT compiler. The purpose is to help assess the severity of vulnerabilities and thereby assist with vulnerability triage. The framework consists of two components: the first extracts high-level representations of exploitation primitives from existing exploits, and the second uses the primitives so extracted to construct exploits for new bugs. We are currently building a prototype implementation of the framework focusing on JavaScript JIT compilers. To the best of our knowledge, this is the first proposal to consider automatic exploit generation for code generated dynamically by JIT compilers.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Each primitive is extracted from an existing exploit and represented in terms of one sequence of operations (described below). The operations exchange data between several objects. Among the several objects, we choose the object whose metafield is changed as the CoreObj because the change of a metafield can give us a primitive. Each primitive has an associated ability that specifies its operational behavior, i.e., what it is able to do. We have the following 4 kinds of primitives:</p><p>&#8226; Read(addr). Ability(addrs). The primitive is able to read the value at the given address (addr). The ability for this kind of primitives is described by the addresses (addrs) it can read from. &#8226; Write(addr, value). Ability(addrs). The primitive is able to write an attacker-controlled value to an attacker-controlled address. The ability for this kind of primitives is described the same as the read primitives. We do not give a value expression because most of the time we can use the CoreObj to directly overwrite the whole field at the given address. &#8226; Ip-hijack(addr). Ability(n). The primitive is able to overwrite the instruction pointer register with an attacker-controlled address. Such primitives are created by modifying a code pointer metafield. Before the modification, the instruction pointer register uses the original code pointer. After the modification, the register uses the modified code pointer. We compare the two register values, and record the number of consecutive bytes starting from the lowest byte that are different. We use n to represent the number and say that its ability is being able to control the lowest n bytes. &#8226; Tycon(obj, DstType). Ability(SrcType, DstType). The primitive is able to convert the type of an object to another attacker-specified type DstType. The primitive's ability is being able to convert an object of SrcType to another object of DesType.</p><p>Note that this is not intended to be an exhaustive list. Moreover, our framework is not tied to any particular set of primitives.</p><p>Each primitive is represented by a sequence of operations. The operations all belong to the following three kinds:</p><p>&#8226; ReadData(obj, field). This operation reads the value at a specified field of a specified object.</p><p>&#8226; WriteData(obj, field, data). This operation writes attackerspecified data to an attacker-specified field of an attackerspecified object. &#8226; CreateObj(specification). This operation creates an object that is the same as the given specification. We extract objects from the memory. For each object, we find out its type and non-metafield values. We use such values and type as its specification because they determine which code we should use and the arguments. An example of such specifications can be found in the appendix.</p><p>Besides primitives, another important concept is exploitation plan. We building exploits by combining primitives and exploitation plans.</p><p>An exploitation plan is a mixture of real code and descriptions of wanted primitives. The descriptions look the same as our 4 kinds of primitives. addr is represented by a tuple (base, index, offset). After replacing the descriptions with real primitives, the execution of the exploitation plan will achieve the goal of the attacker such as spawning a shell. For example, an exploitation plan may look like this:</p><p>If we replace the primitive descriptions with real primitives, the exploitation plan becomes the final exploit. The execution of it will create a RWX page, inject SHELLCODE into it, and eventually run the SHELLCODE and give us a shell.</p><p>Figure <ref type="figure">1</ref> shows the conceptual structure of our framework. There are two phases: Preprocessing and Exploit Creation. The first phase is responsible for extracting primitives from existing exploits, while the second applies the extracted primitives to the exploitation of new bugs. Details of their behavior are discussed in Sections 3 and 4. The Object Database stores information about object structures and relevant methods and is shared between the two phases. Each object is specified by the type of the object and its sequence of fields; each field is represented by its type and size (in bytes).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PREPROCESSING</head><p>Preprocessing involves extracting exploitation primitives from existing exploits. The Primitive Extractor assumes that we are able to detect the objects created in the full execution trace of an exploit. We need this assumption because our approach is based on reasoning about the data flow between different objects.</p><p>By data flow, we mean that the field values and field addresses in an object flow to another object. For example, a data flow can be like this: value in field 1 of object A is copied to field 2 of object B. The data flow information is captured in our representation of primitives. The representation of each primitive is a sequence of operations. For example, there is a read operation in the sequence which reads field 1 of object A. Later, there is a write operation that writes this value to field 2 of object B. So we are able to capture this data flow in our representation: field 1 of object A -&gt; field 2 of object B. The representations of primitives are generated by algorithm 1. The two algorithms are going to be discussed in section 3.1.</p><p>Before identifying objects, we need to find out the memory areas that store objects. We use the high address shared by a group of objects to describe a memory area. For example, if the addresses of a group of objects all begin with 0x135a94, we say that the shared high address is 0x135a94 and the corresponding memory area is 0x135a94.</p><p>For v8, we are interested in two kinds of memory areas: Map area, areas that store other objects. Map is an object. But we separate it from other objects because it is special for the following reason. Map stores type information of other objects. The structure of most objects starts with a pointer to a Map object. The two facts allow us to use Map to help identify other objects in the following way. First, we find pointers to Map area. Each such pointer indicates the beginning of an object. Then we use its associated Map to determine its type and structure. With its structure, we are able to identify this whole object in the memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Primitive Extraction</head><p>As discussed in Section 2, each primitive is represented by a sequence of operations. Algorithm 1 shows the procedure used to extract the sequence of primitives primitives(&#119890;) in T (&#119890;) given an instruction-level execution trace T (&#119890;) for an exploit &#119890;, We first extract the sequence of operations OpSeq(&#119890;) from T (&#119890;). OpSeq(&#119890;) is initially empty. We begin by gathering all memory reads and writes in T (&#119890;) and use them to analyze the data flow between objects. We scan through the reads and writes from the beginning to the end. If we find there is an object created, we record this object and append a CreateObj operation to OpSeq(&#119890;). If we find a write to an recorded object, we append a WriteData operation to OpSeq(&#119890;).</p><p>If we find a read from an recorded object, we append a ReadData operation to OpSeq(&#119890;). Eventually, OpSeq(&#119890;) stores all the creation, read, and write operations on objects in T (&#119890;).</p><p>In this algorithm, FindObjMatch looks at our objs list and returns the object that the current instruction is accessing. FindField returns the field being accessed. FindSrc returns where the value being written comes from. It either comes from a previous memory read or it is self-defined. isHeadOfObjects returns whether the current memory write is a Map pointer. Such pointers represent the beginning of objects. ObjTypeAnalysis returns the type value stored at a specific offset from the Map pointer. FindObjMemFields returns the memory values of the object's fields. isValidObj tells whether the object agrees with its structure. We then separate OpSeq(&#119890;) into different groups. Each group contains the data flow between several objects: Data inside the group does not flow out. Data outside the group does not flow in. This is the criteria on how we separate the groups. Each group is potentially a primitive. The last thing is to filter out those groups that are unlikely to be primitives. Read and write primitives usually involve the modification of bound field, for example the length field or the pointer field to an element area. Type Confusion primitives usually involve the modification of type field. Ip-hijack primitives usually involve the modification of code pointer field such as a function pointer. For the groups that do not have the above patterns, we remove them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Primitive Database</head><p>This database has 4 categories as described in Section 2: Read, Write, Ip-hijack, Type Confusion. Each primitive is stored in its corresponding category. Each primitive is represented by a 3-element</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Workshop Presentation</head><p>Checkmate '21, November 19, 2021, Virtual Event, Republic of Korea tuple: (sequence, usage, ability). We use a sequence of operations to represent extracted primitives. This has been explained in section 2 and 3. The operation indices in the sequence which are used to invoke the primitive is called its usage. Each primitive has its own ability which is described in section 2. In the beginning of section 7, we show what our Primitive Database looks like.</p><p>For each category of primitives, we sort them according to abilities (defined in section 2): powerful primitives are before other primitives. For read and write primitives, we have Ability(addrs) to describe their ability. addrs is a set of addresses that can be accessed. If the size of the set is greater than others, it is considered to be more powerful. For Ip-hijack primitives, their ability is described by Ability(n). If an Ip-hijack primitive has a bigger n than others, it is considered to be more powerful. TyCon primitives are equally powerful.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPLOIT CREATION</head><p>Exploit creation for a JIT compiler given a bug PoC involves constructing an input for the JIT compiler, namely, a source program, whose JIT compilation and execution will exploit the target vulnerability appropriately. In our case, both the bug PoC and the constructed exploit are JavaScript programs. This section discusses the modules involved in this process in our framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Memory Analyzer</head><p>The Memory Analyzer maintains the program state. A program state keeps track of values and permissions (read/write/execute) associated with registers and memory locations as well as objects that are in memory, i.e., the locations of the objects, the fields comprising the objects, and the values for those fields. The Memory Analyzer is responsible for monitoring the program state and passing the state information to the Primitive Application module and the Exploitation module. The two modules use the state information to make sure that the memory is in the desired state during the exploitation process. Algorithms 4 and 2, together with the example in the appendix show how this module is used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Bug Analyzer</head><p>JIT compilers have optimization bugs. Some of the bugs will eventually cause heap corruption bugs in the generated dynamic code. Our Bug Analyzer is to understand how we can modify a PoC file, so that the optimization bug and the heap corruption bug are still triggered.</p><p>The idea is to identify necessary features in the PoC file. As long as the features are satisfied, the heap corruption bug will be triggered. The set of features that we consider is the optimization actions taken by the JIT compiler. Given a PoC file, we randomly modify it many times and collect their optimization action sequence. For those modifications where the heap corruption bug is triggered, we possibly use the Longest Common Subsequence algorithm on their optimization action sequences. The algorithm will give us a sequence of features that are common in these modifications. In other words, the sequence of features needs to be satisfied for the heap corruption bug to be triggered.</p><p>Another job of Bug Analyzer is analyzing the ability of a bug. Since we are dealing with heap corruption bugs, we are interested in which values can be written to which addresses. We are not clear about how this will work. But we need to decide which and how the values in a PoC determine the addresses being corrupted and the data used in the corruption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Primitive Application</head><p>4.3.1 Assumptions. The Primitive Application module assumes that if an object A is allocated before another object B in time, A is before B in the direction of heap growth. This allows us to easily overwrite B with A, which saves the trouble of heap manipulation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Applied</head><p>Primitives. This module takes an operation sequence from the Primitive Database and applies it to new bugs. The applied primitives are added to the Applied Primitives database. This database has 4 categories: Read, Write, Ip-hijack, Type Confusion. Each primitive is stored in one category. We record each primitive's usage and ability. Usage tells us what JavaScript code we should use to call the primitive. For each category of primitives, we sort them according to abilities: powerful primitives are before other primitives. This has been discussed in section 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">Primitive Construction.</head><p>We use algorithm 2 to apply the primitives in our Primitive Database to new bugs. The pre-and post-condition checks make sure each operation is done correctly. For example, we have a write operation that writes value 0x1 to field A. The post condition is that field A should have value 0x1. The post condition then becomes the pre condition of the next operation. object.database.Implement is pretty straight forward. For example, an operation says that we need to create an array of 2 zeros. The function then generates code "[0, 0]" that creates the array. applied-primitives.Implement takes an operation and decides which category of primitives it demands. Then it chooses the most powerful one of that category from the Applied Primitives database. Next it asks the memory-analyzer to look at the memory and solve the parameters in the chosen primitive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Exploitation</head><p>The Exploit Plan Database contains lots of exploitation plans. Each plan gives a different way of using primitives to reach users' goal of attack. We set some plans in the database. One example is given in section 2. Our system is able to automatically implement the plans and eventually reach a goal of attack such as spawning a shell. Besides, users can also define their exploitation plans to achieve their goal of attack.</p><p>The definition of exploitation plan is given in section 2. In order to turn a plan into a real exploit, we traverse all the descriptions of wanted primitives in this plan, and replace them with real primitives. The real primitives are chosen from the Applied Primitives database. The most powerful one is firstly chosen. The Memory Analyzer is used to solve the parameters in the primitive. If this primitive is not able to finish the job of the description, we choose the next primitive until there is no primitive to choose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">IMPLEMENTATION STATUS</head><p>A prototype of the PREPROCESSING component has been finished. We applied the component to 4 exploits two of which involve JIT compilation. We are able to extract totally 1764 primitives which contain all the primitives actually constructed in each exploit. To be more concrete, we extracted 3 primitives from the exploit for bug <ref type="bibr">[3]</ref>, 6 from [7], 25 from [6], 1730 from <ref type="bibr">[8]</ref>. We extracted more than a thousand primitives from the last one is because that we detected so many objects and the bound field or type field of the objects is modified by v8. In our implementation, if a bound field or type field is modified, then it is potentially a primitive. That is why so many primitives were found.</p><p>The idea is to extract the data flow within different object groups. An object group contains objects that only access objects within the group. If the metafield of an object is modified, it is potentially a primitive. If the metafield is a bound field such as length field, we classify it into Read and Write primitives. If the metafield is a executable code pointer, we classify it into Ip-hijack primitives. If it is a type field, we classify it into TyCon primitives. The modified object is within an object group. We use a sequence of operations (ReadData, WriteData, CreateObj) to model the creation of the object group and the data flow within it. The sequence is also our representation for the primitive.</p><p>We are working on the second component now. The Bug Analyzer is an important module to be implemented. It analyzes the JIT compiler to understand how a PoC file can be modified so that the bug will still be triggered. In order to do so, we plan to adopt the method in section 4.2.</p><p>The Memory Analyzer is not implemented. It is expected to monitor memory values. We believe its implementation is not a problem because there are other papers doing a similar job. For example, Gollum <ref type="bibr">[14]</ref> used SHAPESHIFTER to log all live objects in the memory and record the memory values at a specific address.</p><p>The Exploitation and Primitive Application module are partly implemented. They apply primitives to bugs and translate exploitation plans into real exploits. We will continue to finish all the rest modules and evaluate them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">RELATED WORKS</head><p>There are a group of papers about automatic exploit generation (AEG) that focuses on exploiting programs that take pure data as input. They are not able to exploit huge programs that take source code as input, for example, interpreters and browsers. This is because the two kinds of programs require different format of input. Besides, these papers rely on fuzzing or symbolic execution to discover primitives and generate exploits. Fuzzing is not an ideal choice for huge programs because huge programs have huge numbers of program paths for fuzzing to explore, and fuzzing is not efficient and sometimes cannot give us useful results. Symbolic execution also has path explosion problem which makes it timeconsuming and not applicable system-wide.</p><p>In 2011, Avgerinos et al. <ref type="bibr">[2]</ref> proposed the first AEG system. They symbolize the input values and observe which memory addresses they can control. If they are able to control a return address and a buffer on the stack, they solve the symbolic formulas, inject shellcode to the buffer, and modify the return address to the buffer. They are able to exploit some programs such as iwconfig and socat.</p><p>In 2012, Mayhem <ref type="bibr">[4]</ref> was proposed. They took a similar way. First, they use taint analysis to find the path that taints the instruction pointer. Second, they use symbolic execution to analyze how they can get to the point where the instruction pointer is tainted and how they can control the instruction pointer. By solving the symbolic formulas, they are able to make the instruction pointer point to an address that contains injected shellcode. They are able to exploit programs such as iwconfig and htget.</p><p>There are other works <ref type="bibr">[13, 15&#347;17, 23, 24, 29</ref>] that also take similar ways. They rely on fuzzing or symbolic execution to exploit programs that take data as input.</p><p>On other other hand, there are a group of papers focusing on the automatic exploitation of heap allocators because once we control heap allocators, we control the programs that use the allocators. However, these papers only focus on what happens within a component of programs -the allocators. They are not able to exploit bugs that exist in programs but have nothing to do with their allocators. Moreover, they do not consider and cannot exploit dynamic code.</p><p>Dusan et al. <ref type="bibr">[20]</ref> symbolizes the overflowed values and uses symbolic execution to find the controlled instructions that can be used as primitives. HAEPG [30] symbolizes all input bytes and uses symbolic execution on function paths to search for interested instructions as primitives. HeapHopper <ref type="bibr">[9]</ref> uses symbolic execution in a similar way. They symbolize the corrupted allocator metadata and explore its influences and thus find primitives. Insu et al. <ref type="bibr">[28]</ref> defines several actions and uses fuzzing to try different combinations of the actions. They want to detect specific primitive patterns resulted from the actions.</p><p>Besides, automatic kernel exploitation is also a hot topic. FUZE <ref type="bibr">[27]</ref> uses fuzzing and symbolic execution to find exploitable states for UAF bugs. Then they use symbolic execution to determine the relationship between input and the exploitable states. KOOBE <ref type="bibr">[5]</ref> uses fuzzing to try differnet program path and uses symbolic execution to search for primitives among the influenced instructions by the bug.</p><p>Last but not least, there are papers about the automatic exploitation of interpreters and browsers as well. This is most similar to our work. However, these papers do not consider dynamic code. They focus on analyzing interpreter and browser code. Therefore, they cannot analyze and exploit code that is dynamically generated in interpreters and browsers.</p><p>In 2018, PrimGen <ref type="bibr">[11]</ref> was proposed to automatically generate primitives for browsers. They perform static analysis to find sinks after the crash point. Then they use symbolic execution on the local areas: from the crash point to the sinks. Solving the symbolic formulas will give them the values that lead to the sinks.</p><p>In 2019, Gollum <ref type="bibr">[14]</ref> corrupts different objects on the heap in a fuzzing manner and sees what primitives they can get. They didn't use symbolic execution to find primitives. Instead, each crash after the corruption of an object is considered to be a potential primitive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>We proposed a framework for automatic exploit generation in JIT compilers, focusing in particular on heap corruption vulnerabilities triggered by dynamic code. The framework contains two components: PREPROCESSING, EXPLOIT CREATION FROM BUG POC. The first component models primitives into sequences of object operations. With this modelling, this component is able to extract totally 1764 primitives from 4 exploits. Two of the exploits involve JIT compilation. The second component applies the extracted primitives to new bugs and thereby generates exploits. We are working on the second component, especially the Bug Analyzer which is the main module in the second component.</p></div></body>
		</text>
</TEI>
