Keynote 4: Formal semantics extraction from manuals of instruction sets

    by Professor Mizuhito Ogawa, Japan Advanced Institute of Science and Technology (JAIST).

    Malware includes many obfuscation techniques and the most powerful tool for de-obfuscation is believed to be a symbolic execution. For developing such tools, we first need to clarify the formal semantics of binary instructions. However, the number of such instructions are more than thousands. For instance, x86 (ech for 32 bits and 64 bits) has about 1500-2000 instructions, and ARM has 30-40 chip set variations in which each has 100-200 instructions. Hence, if everything is done manually, huge human effort is required. Instead, we aim to automatically extract from their manuals. We will overview how Natural Language Processing (NLP) techniques are applied for this purpose by combining software enginerring methods.

    Short bio:

    Mizuhito Ogawa graduated the master course of University of Tokyo, majoring Mathematics. He worked NTT laboratories on functional programs, dataflow machine, and dataflow analysis till 2001. Then, he was JST PRESTO fellow until 2003, and stayed at University of Tokyo as a visiting researcher. Since 2003 till now, he has been in JAIST as a professor in Schools of Information Science. Since 2020, he also joined Interpretable AI center in JAIST. His research interest covers from theory to practical tool implementation. For instance, the confluence of rewriting systems, the computational content of classical proofs, and the decidability of the reachability of infinite state transition systems are such examples in theory. Tool implementations are mostly performed under collaboration with Vietnamese universities and students. Such examples are, (1) SMT solver, raSAT, on polynomial constraints over reals,  which was ranked second in QFNRA category of SMT-COMP 2016 and 2017, (2) dynamic symbolic execution for binary code, e.g., x86-32, ARM Cortex-A/M (32 bits), MIPS (32bits), targeting on malware. The tool implementation are BE-PUM (x86), Corana (ARM), and SyMIPS (MIPS). Currently, 64bits extensions are under collaboration with LQDTU, especially based on formal semantics extraction from instruction manuals in English.