SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
© 2017 Arm Limited
SFO17-314 Optimizing Golang for
High Performance with ARM64
AssemblyWei Xiao
Staff Software Engineer
Wei.Xiao@arm.com
September 27, 2017
Linaro Connect SFO17
© 2017 Arm Limited2
Agenda
• Introduction
• Differences from GNU Assembly
• Integrate assembly into Golang
• Optimize CRC32 for arm64
• Optimize SHA256 for arm64
• Optimize IndexByte for arm64
• Work Summary and Next steps
© 2017 Arm Limited3
Introduction
• Assembly optimization benefits
• Take advantages of ARMv8 capabilities
– Hardware specific instructions (such as SVC, AES, SHA and etc.)
– Vector (Single Instruction Multiple Data) Instructions
• Others
– No need for CGo dependency
– Avoid runtime context switching overhead
– Optimized code (vs Go compiler)
– Faster compilation
© 2017 Arm Limited4
Assembly Optimization Current Status
• Go Standard packages with assembly optimization
crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5
crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512
hash/crc32 math math/big reflect
runtime runtime/cgo runtime/internal/atomicruntime/internal/sys
strings sync/atomic syscall ……
red – arm64 optimization ongoing
black – no arm64 optimization
© 2017 Arm Limited5
Assembly Terminology
• Mnemonic
• CALL, MOVW, MOVD, …
• Register
• R1, F0, V3, …
• Immediate
• $1, $0x100, …
• Memory
• (R1), 8(R3), …
Registers in AArch64
© 2017 Arm Limited6
Instruction Differences from GNU Assembly
• Semi-abstract instruction set (Plan 9 from Bell Labs)
• Architecture independent mnemonics like MOVD
• Some architecture aspects shine through
• Assembler may insert prologues, remove ‘unreachable’
instructions
• Instructions may be expanded by the assembler
• Not all instructions available
• BYTE/WORD/LONG directives to lay down opcodes into
instruction stream directly
1 // func Add(a, b int) int
2 TEXT ·Add(SB),$0-24
3 MOVD arg1+0(FP), R0
4 MOVD arg2+8(FP), R1
5 ADD R1, R0, R0
6 MOVD R0, ret+16(FP)
7 RET
© 2017 Arm Limited7
Operand Differences from GNU Assembly
• Data flow from left to right
• ADD R1, R2 → R2 += R1
• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)
• Memory operands: base + offset
• MOVH (R1), R2 → R2 = *R1
• MOVBU 8(R3), R4 → R4 = *(8 + R3)
• MOVD mypackage·myvar(SB), R8 → R8 = *myvar
• Addresses
• MOVD $8(R1), R3 → R3 = R1 + 8
• MOVD $·myvar(SB), R4 → R4 = &myvar
package mypackage
var myvar int64
Unicode
U+00B7
© 2017 Arm Limited8
Go Assembly Extension for arm64
• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd
• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>
• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd
• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]
• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]
• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go
• Full details
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited9
Assembly Build Rule
• Toolchain will select appropriate assembly files according to GOOS+GOARCH
• Using file extensions, e.g.
• sys_linux_arm64.s
• sys_darwin_arm64.s
• Example: assembly files for: hash/crc32
• crc32_amd64p32.s
• crc32_amd64.s
• crc32_arm64.s
• crc32_ppc64le.s crc32_table_ppc64le.s
• crc32_s390x.s
© 2017 Arm Limited10
Prototype
• Function call is the bridge between Go and assembly
• Function declaration
• src/runtime/timestub.go
• func walltime() (sec int64, nsec int32)
• Function assembly implementation
• runtime/sys_linux_arm64.s
package
(optional)
function
name
Flag
(optional)
stack
frame size
arguments
size
(optional)
Middle
dot
© 2017 Arm Limited11
Pseudo-registers
• FP: Frame Pointer
• Points to the bottom of the argument list
• Offsets are positive
• Offsets must include a name, e.g. arg+0(FP)
• SP: Stack Pointer
• Points to the top of the space allocated for local variables
• Offsets are negative
• Offsets must include a name, e.g. ptr-8(SP)
• SB: Static Base
• Named offsets from a global base
Low address
High address
Low address
High address
© 2017 Arm Limited12
Calling Convention
• All arguments are passed on the stack
• Offsets from FP
• Return arguments follow input arguments
• Start of return arguments aligned to pointer size
• All registers are caller saved, except:
• Stack pointer register (RSP)
• G context pointer register (R28)
• Frame pointer (R29)
© 2017 Arm Limited13
arm64 Stack Frame
w/o frame pointer w/ frame pointer
Low address
High address
© 2017 Arm Limited14
Optimize CRC32 for arm64 – Before
• Pure Go table-driven implementation
src/hash/crc32/crc32_generic.go
42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {
43 crc = ^crc
44 for _, v := range p {
45 crc = tab[byte(crc)^v] ^ (crc >> 8)
46 }
47 return ^crc
48 }
© 2017 Arm Limited15
Optimize CRC32 for arm64 – After
• Assembly for arm64
src/hash/crc32/crc32_arm64.s
9 // func castagnoliUpdate(crc uint32, p []byte) uint32
10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36
11 MOVWU crc+0(FP), R9 // CRC value
12 MOVD p+8(FP), R13 // data pointer
13 MOVD p_len+16(FP), R11 // len(p)
14
15 CMP $8, R11
16 BLT less_than_8
17
18 update:
19 MOVD.P 8(R13), R10
20 CRC32CX R10, R9
21 SUB $8, R11
22
23 CMP $8, R11
24 BLT less_than_8
25
26 JMP update
…
46 done:
47 MOVWU R9, ret+32(FP)
48 RET
0(FP)
ret
p.cap
p.len
p.base
crc
32(FP)
8(FP)
16(FP)
© 2017 Arm Limited16
Optimize CRC32 for arm64 – Result
• Optimization with assembly
• 2X-7X speedup
© 2017 Arm Limited17
Optimize SHA256 for arm64
• SHA256 introduction
block rounds K Hash
SHA-256 512bits 64 32bits 32bits 256bits
© 2017 Arm Limited18
Optimize SHA256 for arm64 – Message schedule
src/crypto/sha256/sha256block.go
84 for i := 0; i < 16; i++ {
85 j := i * 4
86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])
87 }
88 for i := 16; i < 64; i++ {
89 v1 := w[i-2]
90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)
91 v2 := w[i-15]
92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)
93 w[i] = t1 + w[i-7] + t2 + w[i-16]
94 }
for i := 16; i < 64; i+=4 {
SHA256SU0 Vn.S4, Vd.S4
SHA256SU1 Vm.S4, Vn.S4, Vd.S4
}
© 2017 Arm Limited19
Optimize SHA256 for arm64 – Hash Computation
src/crypto/sha256/sha256block.go
98 for i := 0; i < 64; i++ {
99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]
100
101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))
102
103 h = g
104 g = f
105 f = e
106 e = d + t1
107 d = c
108 c = b
109 b = a
110 a = t1 + t2
111 }
for i := 0; i < 64; i+=4 {
SHA256H Vm, Vn, Vd.4S
SHA256H2 Vm, Vn, Vd.4S
}
© 2017 Arm Limited20
Optimize SHA256 for arm64 – Implementation
src/crypto/sha256/sha256block_arm64.s
© 2017 Arm Limited21
Optimize SHA256 for arm64 – Result
• Optimization with assembly
• 2X-16X speedup
© 2017 Arm Limited22
Optimize IndexByte for arm64 – Before
H E L L O W O R L D …
R1R0
R2 D
R0
src/runtime/asm_arm64.s
© 2017 Arm Limited23
Optimize IndexByte for arm64 – After
• Assembly implementation with SIMD
• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16
Compare 16 bytes in parallel
More details:
• Input slice shorter than 16
• Input slice address not 16-byte aligned
• Input slice size not 16-byte aligned
• Count trailing zeros (not leading zeros)
• Implementation:
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited24
Optimize IndexByte for arm64 – Result
• Optimization with SIMD
• 1.5X-8X speedup
© 2017 Arm Limited25
Work Summary
Disassembler (arm64):
https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930
https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530
Assembler (arm64):
https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511
https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951
https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350
https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653
Optimizations:
https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570
https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610
Others:
https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112
https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511
https://go-review.googlesource.com/c/arch/+/37172
© 2017 Arm Limited26
Next Steps
• Crypto optimizations:
• aes, elliptic, …
• SIMD optimizations:
• strings, bytes, runtime, reflect, …
• Compiler SSA arm64 back-end optimizations
• Others
• Internal arm64 linker
• Tool for arm64: race detector, memory sanitizer, …
• New architecture features
• ...
2727
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
© 2017 Arm Limited
© 2017 Arm Limited28
CGo
GO ABI C ABI
1 package print
2
3 // #include <stdio.h>
4 // #include <stdlib.h>
5 import "C"
6 import "unsafe"
7
8 func Print(s string) {
9 cs := C.CString(s)
10 C.fputs(cs, 11(*C.FILE)(C.stdout))
12 C.free(unsafe.Pointer(cs))
13 }
CGo
© 2017 Arm Limited29
Useful in
macros!
Branch Difference from GNU Assembly
• On arm64: B is alias for JMP, BL is alias for CALL
Jump to labels
JMP L1
NOP
L1:
NOP
L2: NOP
NOP
B L2
Call and Indirect Jump
BL $p.foo
MOV $p·foo, R3
CALL(R3)
B (R3)
MOV 0(R26), R4
JMP (R4)
Jump relative to PC
JMP 2(PC)
NOP
NOP
NOP
NOP
JMP -2(PC)

Más contenido relacionado

Más de Linaro

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 

Más de Linaro (20)

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

  • 1. © 2017 Arm Limited SFO17-314 Optimizing Golang for High Performance with ARM64 AssemblyWei Xiao Staff Software Engineer Wei.Xiao@arm.com September 27, 2017 Linaro Connect SFO17
  • 2. © 2017 Arm Limited2 Agenda • Introduction • Differences from GNU Assembly • Integrate assembly into Golang • Optimize CRC32 for arm64 • Optimize SHA256 for arm64 • Optimize IndexByte for arm64 • Work Summary and Next steps
  • 3. © 2017 Arm Limited3 Introduction • Assembly optimization benefits • Take advantages of ARMv8 capabilities – Hardware specific instructions (such as SVC, AES, SHA and etc.) – Vector (Single Instruction Multiple Data) Instructions • Others – No need for CGo dependency – Avoid runtime context switching overhead – Optimized code (vs Go compiler) – Faster compilation
  • 4. © 2017 Arm Limited4 Assembly Optimization Current Status • Go Standard packages with assembly optimization crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5 crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512 hash/crc32 math math/big reflect runtime runtime/cgo runtime/internal/atomicruntime/internal/sys strings sync/atomic syscall …… red – arm64 optimization ongoing black – no arm64 optimization
  • 5. © 2017 Arm Limited5 Assembly Terminology • Mnemonic • CALL, MOVW, MOVD, … • Register • R1, F0, V3, … • Immediate • $1, $0x100, … • Memory • (R1), 8(R3), … Registers in AArch64
  • 6. © 2017 Arm Limited6 Instruction Differences from GNU Assembly • Semi-abstract instruction set (Plan 9 from Bell Labs) • Architecture independent mnemonics like MOVD • Some architecture aspects shine through • Assembler may insert prologues, remove ‘unreachable’ instructions • Instructions may be expanded by the assembler • Not all instructions available • BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly 1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
  • 7. © 2017 Arm Limited7 Operand Differences from GNU Assembly • Data flow from left to right • ADD R1, R2 → R2 += R1 • SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29) • Memory operands: base + offset • MOVH (R1), R2 → R2 = *R1 • MOVBU 8(R3), R4 → R4 = *(8 + R3) • MOVD mypackage·myvar(SB), R8 → R8 = *myvar • Addresses • MOVD $8(R1), R3 → R3 = R1 + 8 • MOVD $·myvar(SB), R4 → R4 = &myvar package mypackage var myvar int64 Unicode U+00B7
  • 8. © 2017 Arm Limited8 Go Assembly Extension for arm64 • Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd • Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T> • Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd • Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>] • Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>] • Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go • Full details • https://go-review.googlesource.com/c/go/+/41654
  • 9. © 2017 Arm Limited9 Assembly Build Rule • Toolchain will select appropriate assembly files according to GOOS+GOARCH • Using file extensions, e.g. • sys_linux_arm64.s • sys_darwin_arm64.s • Example: assembly files for: hash/crc32 • crc32_amd64p32.s • crc32_amd64.s • crc32_arm64.s • crc32_ppc64le.s crc32_table_ppc64le.s • crc32_s390x.s
  • 10. © 2017 Arm Limited10 Prototype • Function call is the bridge between Go and assembly • Function declaration • src/runtime/timestub.go • func walltime() (sec int64, nsec int32) • Function assembly implementation • runtime/sys_linux_arm64.s package (optional) function name Flag (optional) stack frame size arguments size (optional) Middle dot
  • 11. © 2017 Arm Limited11 Pseudo-registers • FP: Frame Pointer • Points to the bottom of the argument list • Offsets are positive • Offsets must include a name, e.g. arg+0(FP) • SP: Stack Pointer • Points to the top of the space allocated for local variables • Offsets are negative • Offsets must include a name, e.g. ptr-8(SP) • SB: Static Base • Named offsets from a global base Low address High address Low address High address
  • 12. © 2017 Arm Limited12 Calling Convention • All arguments are passed on the stack • Offsets from FP • Return arguments follow input arguments • Start of return arguments aligned to pointer size • All registers are caller saved, except: • Stack pointer register (RSP) • G context pointer register (R28) • Frame pointer (R29)
  • 13. © 2017 Arm Limited13 arm64 Stack Frame w/o frame pointer w/ frame pointer Low address High address
  • 14. © 2017 Arm Limited14 Optimize CRC32 for arm64 – Before • Pure Go table-driven implementation src/hash/crc32/crc32_generic.go 42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 { 43 crc = ^crc 44 for _, v := range p { 45 crc = tab[byte(crc)^v] ^ (crc >> 8) 46 } 47 return ^crc 48 }
  • 15. © 2017 Arm Limited15 Optimize CRC32 for arm64 – After • Assembly for arm64 src/hash/crc32/crc32_arm64.s 9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update … 46 done: 47 MOVWU R9, ret+32(FP) 48 RET 0(FP) ret p.cap p.len p.base crc 32(FP) 8(FP) 16(FP)
  • 16. © 2017 Arm Limited16 Optimize CRC32 for arm64 – Result • Optimization with assembly • 2X-7X speedup
  • 17. © 2017 Arm Limited17 Optimize SHA256 for arm64 • SHA256 introduction block rounds K Hash SHA-256 512bits 64 32bits 32bits 256bits
  • 18. © 2017 Arm Limited18 Optimize SHA256 for arm64 – Message schedule src/crypto/sha256/sha256block.go 84 for i := 0; i < 16; i++ { 85 j := i * 4 86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3]) 87 } 88 for i := 16; i < 64; i++ { 89 v1 := w[i-2] 90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10) 91 v2 := w[i-15] 92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3) 93 w[i] = t1 + w[i-7] + t2 + w[i-16] 94 } for i := 16; i < 64; i+=4 { SHA256SU0 Vn.S4, Vd.S4 SHA256SU1 Vm.S4, Vn.S4, Vd.S4 }
  • 19. © 2017 Arm Limited19 Optimize SHA256 for arm64 – Hash Computation src/crypto/sha256/sha256block.go 98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i] 100 101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c)) 102 103 h = g 104 g = f 105 f = e 106 e = d + t1 107 d = c 108 c = b 109 b = a 110 a = t1 + t2 111 } for i := 0; i < 64; i+=4 { SHA256H Vm, Vn, Vd.4S SHA256H2 Vm, Vn, Vd.4S }
  • 20. © 2017 Arm Limited20 Optimize SHA256 for arm64 – Implementation src/crypto/sha256/sha256block_arm64.s
  • 21. © 2017 Arm Limited21 Optimize SHA256 for arm64 – Result • Optimization with assembly • 2X-16X speedup
  • 22. © 2017 Arm Limited22 Optimize IndexByte for arm64 – Before H E L L O W O R L D … R1R0 R2 D R0 src/runtime/asm_arm64.s
  • 23. © 2017 Arm Limited23 Optimize IndexByte for arm64 – After • Assembly implementation with SIMD • SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16 Compare 16 bytes in parallel More details: • Input slice shorter than 16 • Input slice address not 16-byte aligned • Input slice size not 16-byte aligned • Count trailing zeros (not leading zeros) • Implementation: • https://go-review.googlesource.com/c/go/+/41654
  • 24. © 2017 Arm Limited24 Optimize IndexByte for arm64 – Result • Optimization with SIMD • 1.5X-8X speedup
  • 25. © 2017 Arm Limited25 Work Summary Disassembler (arm64): https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930 https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530 Assembler (arm64): https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951 https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350 https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653 Optimizations: https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570 https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610 Others: https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112 https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511 https://go-review.googlesource.com/c/arch/+/37172
  • 26. © 2017 Arm Limited26 Next Steps • Crypto optimizations: • aes, elliptic, … • SIMD optimizations: • strings, bytes, runtime, reflect, … • Compiler SSA arm64 back-end optimizations • Others • Internal arm64 linker • Tool for arm64: race detector, memory sanitizer, … • New architecture features • ...
  • 28. © 2017 Arm Limited28 CGo GO ABI C ABI 1 package print 2 3 // #include <stdio.h> 4 // #include <stdlib.h> 5 import "C" 6 import "unsafe" 7 8 func Print(s string) { 9 cs := C.CString(s) 10 C.fputs(cs, 11(*C.FILE)(C.stdout)) 12 C.free(unsafe.Pointer(cs)) 13 } CGo
  • 29. © 2017 Arm Limited29 Useful in macros! Branch Difference from GNU Assembly • On arm64: B is alias for JMP, BL is alias for CALL Jump to labels JMP L1 NOP L1: NOP L2: NOP NOP B L2 Call and Indirect Jump BL $p.foo MOV $p·foo, R3 CALL(R3) B (R3) MOV 0(R26), R4 JMP (R4) Jump relative to PC JMP 2(PC) NOP NOP NOP NOP JMP -2(PC)