In this session, we will present real life applications of compiler techniques helping kaChing achieve ultra confidence and power its incredible 5 minutes commit-to-production cycle [1]. We'll talk about idempotency analysis [2], dependency detection, on the fly optimisations, automatic memoization [3], type unification [4] and more! This talk is not suitable for the faint-hearted... If you want to dive deep, learn about advanced JVM topics, devoure bytecode and see first hand applications of theoretical computer science, join us.
[1] http://eng.kaching.com/2010/05/deployment-infrastructure-for.html
[2] http://en.wikipedia.org/wiki/Idempotence
[3] http://en.wikipedia.org/wiki/Memoization
[4] http://eng.kaching.com/2009/10/unifying-type-parameters-in-java.html
2. Engineering at kaChing TDD from day one Full regression suite runs in less than 3 minutes Deploy to production 30+ times a day People have written and launched new features during interview process
4. Seriously… (Java focused) Software Analysis Anatomy of a compiler Creating meta tests Leveraging Types Levels of interpretation Descriptors and signatures DRY your code (less bugs, greater reach for experts, higher testability)
5. Software Analysis Running a series of analyses on the code base. Catch common mistakes due to distracted developers, new hires or bad APIs.
6. Anatomy of a Compiler Annotated Abstract Syntax Tree Semantic Analysis Intermediate Representation Generation Intermediate Representation Abstract Syntax Tree Syntactic Analysis Optimization Optimized Intermediate Representation Tokens Lexical Analysis Machine Code Generation Target Code Source Code
7. int x 1; int y x + 2; Lexical Analysis IDENT(int) IDENT(x) ASGN NUMBER(1) SEMICOLON IDENT(int) IDENT(y) ASGN IDENT(x) PLUS NUMBER(2) SEMICOLON
11. Optimizations Simple optimizations can be done on the Abstract Syntax Tree Other optimizations require specialized representations Static Single Assignment form Control Flow Graph
20. Finding bad code snippets Describe bad code snippets using regular expressions. Analysis done on the source code, before lexical analysis as information such as whitespaces are lost. Extremely easy to implement.
29. voiddoStuff(BigDecimal a, BigDecimal b) { boolean c = a.equals(b); } Code: 0: aload_1 1: aload_2 2: invokevirtualjava/math/BigDecimal.equals(Ljava/lang/Object;)Z 5: astore_3 6: return
30. ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_infoconstant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_infofields[fields_count]; u2 methods_count; method_infomethods[methods_count]; u2 attributes_count; attribute_infoattributes[attributes_count]; }
31. ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_infoconstant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_infofields[fields_count]; u2 methods_count; method_infomethods[methods_count]; u2 attributes_count; attribute_infoattributes[attributes_count]; } const #18 = Ascizjava/math/BigDecimal; const #19 = Ascizequals; const #20 = Asciz(Ljava/lang/Object;)Z;
32. ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_infoconstant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_infofields[fields_count]; u2 methods_count; method_infomethods[methods_count]; u2 attributes_count; attribute_infoattributes[attributes_count]; } method_info { u2 access_flags; u2 name_index; u2 descriptor_index; u2 attributes_count; attribute_infoattributes[attributes_count]; }
33. method_info{ u2 access_flags; u2 name_index; u2 descriptor_index; u2 attributes_count; attribute_info attributes[attributes_count]; } Code_attribute { u2 attribute_name_index; u4 attribute_length; u2 max_stack; u2 max_locals; u4 code_length; u1 code[code_length]; u2 exception_table_length; { u2 start_pc; u2 end_pc; u2 handler_pc; u2 catch_type; } exception_table[exception_table_length]; u2 attributes_count; attribute_infoattributes[attributes_count]; }
34. Code_attribute { u2 attribute_name_index; u4 attribute_length; u2 max_stack; u2 max_locals; u4 code_length; u1 code[code_length]; u2 exception_table_length; { u2 start_pc; u2 end_pc; u2 handler_pc; u2 catch_type; } exception_table[exception_table_length]; u2 attributes_count; attribute_info attributes[attributes_count]; } 0: aload_1 1: aload_2 2: invokevirtual #2; 5: astore_3 6: return
41. Two Passes Find all classes, fields and methods annotated with the specified annotations. Find all instructions referring to these classes, fields and methods.
46. Java 5+ Type System Primitives Objects Generics Wildcards Intersection types
47. Erasure Object eraser() {returnnewArrayList<String>();} Object obj = eraser(); // impossible to recover that obj is a listof string
48. Compiled classes $ cat MustBeSerializable.java importjava.io.Serializable; interfaceMustBeSerializable<T extendsSerializable> {} $cat ExtendsMustBeSerializable.java class Value {} classExtendsMustBeSerializableimplementsMustBeSerializable<Value> {}
49. Compiled classes $javacMustBeSerializable.java $rm MustBeSerializable.java $ls MustBeSerializable.classExtendsMustBeSerializable.java $javac ExtendsMustBeSerializable.java -cp . ExtendsMustBeSerializable.java:2: type parameter Value is not within its bound classExtendsMustBeSerializableimplementsMustBeSerializable<Value> {} ^ 1 error
50. Compiled classes Compiler must write type information in class file for proper semantics When compiling other classes, need to read those type information and check against those contracts
51. Taking a peek at classes $javap -v MustBeSerializable | grep -A 1 'Signature;' const #3 = Asciz Signature; const #4 = Asciz <T::Ljava/io/Serializable;>Ljava/lang/Object;;
53. Signatures Primitives B for byte, C for char, D for double, … Objects Lclassname; such as Ljava/lang/String; Arrays [B for byte[], [[D for double[][] Void V … 8 pages of documentation
61. Unification But if we have MyClassextendsAbstractCallable<String> { …AbstractCallable<T> implements Callable<T> { … Unification.getActualTypeArgument(MyClass.class, Callable.class, 0);
62. Unification – Want to Try? class MergeOfIntegerAndStringextends Merge<Integer, String> {} class Merge<K, V> implementsOneTypeParam<Map<K, V>> {} interfaceOneTypeParam<T>
68. Just-In-Time Providers Pattern matching on typesMarshaller<?> is a pattern forMarshaller<User>, Marshaller<Portfolio>, … Can be arbitrary complex, including wildcards, intersection types etc. http://github.com/pascallouisperez/guice-jit-providers
70. References JVM spechttp://www.amazon.com/Java-Virtual-Machine-Specification-2nd/dp/0201432943 Class File spec http://java.sun.com/docs/books/jvms/second_edition/ClassFileFormat-Java5.pdf Super Type Tokenshttp://gafter.blogspot.com/2006/12/super-type-tokens.html Unifying Type Parameters in Javahttp://eng.kaching.com/2009/10/unifying-type-parameters-in-java.html Type Safe Bit Fields Using Higher Kinded Typeshttp://eng.kaching.com/2010/08/type-safe-bit-fields-using-higher.html
Notas del editor
A token is a string of characters, categorized according to the rules as a symbol.
Abstract interpretation
n compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is an intermediate representation (IR) in which every variable is assigned exactly once. Existing variables in the original IR are split into versions, new variables typically indicated by the original name with a subscript, so that every definition gets its own version
In functional language compilers, such as those for Scheme, ML and Haskell, continuation-passing style (CPS) is generally used where one might expect to find SSA in a compiler for Fortran or C. SSA is formally equivalent to a well-behaved subset of CPS
A token is a string of characters, categorized according to the rules as a symbol.
A token is a string of characters, categorized according to the rules as a symbol.
A token is a string of characters, categorized according to the rules as a symbol.
A token is a string of characters, categorized according to the rules as a symbol.
A token is a string of characters, categorized according to the rules as a symbol.
Internal name of the method’s owner class, method’s name and method’s descriptor.