Java Bytecode Normalization for Code Similarity Analysis
Analyzing the similarity of two code fragments has many applications, including code clone, vulnerability and plagiarism detection. Most existing approaches for similarity analysis work on source code. However, in scenarios like plagiarism detection, copyright violation detection or Software Bill of Materials creation source code is often not available and thus similarity analysis has to be performed on binary formats. Java bytecode is a binary format executable by the Java Virtual Machine and obtained from the compilation of Java source code. Performing similarity detection on bytecode is challenging because different compilers can compile the same source code to syntactically vastly different bytecode. In this work we assess to what extent one can nonetheless enable similarity detection by bytecode normalization, a procedure to transform Java bytecode into a representation that is identical for the same original source code, irrespective of the Java compiler and Java version used during compilation. Our manual study revealed 16 classes of compilation differences that various compilation environments may induce. Based on these findings, we implemented bytecode normalization in a tool jNorm. It uses Jimple as intermediate representation, applies common code optimizations and transforms all classes of compilation difference to a normalized form, thus achieving a representation of the bytecode that is identical despite different compilation environments. Our evaluation, performed on more than 300 popular Java projects, shows that solely the act of incrementing a compiler version may cause differences in 46% of all resulting bytecode files. By applying bytecode normalization, one can remove more than 99% of these differences, thus acting as a crucial enabler for subsequent applications of bytecode similarity analysis.
Wed 18 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:30 - 12:00 | Synthesis and verificationTechnical Papers at EI 2 Pichelmayer Chair(s): Peter Thiemann University of Freiburg, Germany | ||
10:30 15mTalk | Inductive Predicate Synthesis Modulo Programs Technical Papers Scott Wesley Dalhousie University, Maria Christakis TU Wien, Jorge A. Navas Certora, Richard Trefler University of Waterloo, Valentin Wüstholz ConsenSys, Arie Gurfinkel University of Waterloo | ||
10:45 15mTalk | Fearless Asynchronous Communications with Timed Multiparty Session Protocols Technical Papers Ping Hou University of Oxford, Nicolas Lagaillardie Imperial College London, Nobuko Yoshida University of Oxford | ||
11:00 15mTalk | Java Bytecode Normalization for Code Similarity Analysis Technical Papers Stefan Schott Heinz Nixdorf Institut, Paderborn University, Serena Elisa Ponta SAP Security Research, Wolfram Fischer SAP Security Research, Jonas Klauke Heinz Nixdorf Institut, Paderborn University, Eric Bodden Heinz Nixdorf Institut, Paderborn University and Fraunhofer IEM | ||
11:30 15mTalk | Higher-Order Specifications for Deductive Synthesis of Programs with Pointers Technical Papers David Young University of Kansas, USA, Ziyi Yang National University of Singapore, Ilya Sergey National University of Singapore, Alex Potanin Australian National University | ||
11:45 15mTalk | Matching Plans for Frame Inference in Compositional Reasoning Technical Papers Andreas Lööw Imperial College London, Daniele Nantes-Sobrinho Imperial College London, Sacha-Élie Ayoun Imperial College London, Petar Maksimović Imperial College London, UK, Philippa Gardner Imperial College London |