2. What is Loongson?
• “A Chinese Challenge to Intel”
• Microprocessor development project in ICT
• ST Microelectronics is manufacturing & selling
• MIPS compatible, but independently developed
6. Also other OSes
• Linux: Debian, RedFlag, Mandriva...
• NetBSD
• Windows CE
7. GS464(Loongson 3A)
• Scalable Architecture
• Reconfigurable CPU core and L2
• Hardware-assisted x86 emulation
• Low power consumption
8. Scalable Architecture
Scalable Architecture Design
! Scalable interconnection networ k
" C rossbar + M esh
• 8x8 crossbar
" Single crossbar connects cores, L2s, and four directions
! Directory-based cache coherence protocol
• Directory caches cache coherency
" Distributed L2 based are globally addressed
"• Bothcore cache65nm(3B), 4 core on 32nm(3C)directory
" E ach cache block has a directory entry
2
data
on
and instruction cache are recorded in
P0 P1 P2 P3
E E
S S
W 8x8 X bar W
N N
L2 L2 L2 L2
11
9. Reconfigurable CPU
core and L2
Reconfigurable architecture
Special purpose
General purpose Core GStera
Core GS464
DMA engine can be 8 configurable address
configured to achieve windows of each master port
high performance allow pages migration across
L2 and memory
10. Hardware-assisted x86
emulation
• On software based binary translation,
some of x86 instruction requires tens of
MIPS instructions due to the difference of
ISA
• added 200+ of new instructions to reduce
instructions on binary translation
11. BHT: Branch history table ITLB: Instruction translation
Virtual machine
BRQ: Bandwidth request look-aside buffer
DTLB: Data translation RAS: Return address stack
look-aside buffer TAP: Test access port
architecture
Figure 1. GS464 microarchitecture. GS464 adopts a nine-stage dynamical pipeline.
Microsoft Windows Linux applications on x86
Linux applications on MIPS
System-level x86 Process-level x86
virtual machine virtual machine
Linux on MIPS
Enhanced MIPS core
• It’s just QEMU on Linux
Figure 2. The GS464 virtual machine’s software architecture. The x86 operating systems
and applications are built on MIPS Linux system through virtual machine monitor.
•support for EFlag modified to improve performance,
Hardware
QEMU of x86 arithmetic calculation, and the branch direc-
using new instructions
A major difference between the x86 and tions of branch instructions are determined
MIPS ISAs is that the x86 ISA uses EFlags. according to the EFlag values. MIPS fixed-
12. x86 EFlag support
• Most of x86 fixed-point arithmetic
instructions generate EFlag
• Branch directions of branch instructions are
determined according to the EFlag
• MIPS doesn’t have flag register!
Therefore it needs to check result and set/
clear bit on virtual EFlag register on runtime
• That’s very costly
13. x86 EFlag support:
Solution
• Add new instructions to handle EFlag
• Generate EFlag
• Branch on EFlag
14. Number of
instructions Instruction Comment
0 SUB ECX EDX
1 JE X86_target
(a)
0.00 SUBU Result Recx Redx
0.01 SRL Rsf Result 31 /*SF=Result[31]*/
0.02 BEQ Result R0 L1
0.03 ADD Rzf R0 R0 /*ZF=0*/
0.04 B L2
0.05 NOP
0.06 L1: ADDI Rzf R0 1 /*ZF=1*/
. . . . . . .
. . . . . . .
. . . . . . .
0.35 B L8
0.36 NOP
0.37 L7: ADDI Rcf R0 1 /*CF=1*/
0.38 L8: ADD Recx Result R0
1.00 BNE Rzf R0 MIPS_target
1.01 NOP
(b)
0.0 SUBU Result Recx Redx /*Generating Sub result*/
0.1 SETFLAG
0.2 SUBU Reflag Recx Redx /*Generating EFLAGS*/
1.0 X86JE Reflag MIPS_target /*Branch on EFLAGS*/
(c)
0.0 SUB Result Recx Redx /*Generating Sub result*/
0.1 X86SUB Reflag Recx Redx /*Generating EFLAGS*/
1.0 X86JE Reflag MIPS_target /*Branch on EFLAGS*/
(d)
15. x87 support
• Register stack:
• Maintaining TOP pointer is costly
• Calculating absolute register number from
relative register number is costly
• Emulating x87 tag to detect stack overflow/
underflow is costly
• 80bit floating point:
MIPS only has 64bit floating point!
16. x87 support:
Solution
• Calculates TOP value in the decode stage, using register
renaming
New flag on fp control register to point TOP
=> Reduces 10+ instructions in each x87 instruction
• New instruction to simulate x87 tag, and new exception to
detect stack overflow/underflow
• New instructions for 80bit floating point:
• 80 bit fp number using two 64bit reg => 64 bit fp number
using one 64bit reg
• 64 bit fp number using one 64bit reg =>
80 bit fp number using two 64bit reg
17. Number of
instructions Instruction Comment
0 FLD *%R10
1 FMUL *16(%R10)
2 FSTP *%R10
(a)
0.00 LD Rtmp1 12(R8) /*convert 1st operand*/
0.01 LD Rtmp2 4(R8)
0.02 ANDI Rsign Rtmp1 /*get sign bit and sign bit of
exp*/
0.03 DSLL32 Rsign Rsign 16 /*get biased exponent
. . . . . .
. . . . . .
. . . . . .
0.23 DMTC1 F8 Rfp2
1.00 MUL.d F9 F7 F8 /*64-bit multiply*/
2.00 DMFC1 Rres F9
2.01 DSRL32 Rsign Rres 31 /*get sign bit*/
. . . . . .
. . . . . .
. . . . . .
2.12 SD Rres1 12(R8) /*write back result*/
2.13 SD Rres2 4(R8)
(b)
0.0 GSLQC1 F4 4(R8) /*128-bit load to F4 and F5*/
0.1 CVT.d.ld F7 F4 F5 /*80-bit to 64-bit convert*/
0.2 GSLQC1 F2 20(R8) /*128-bit load to F2 and F3*/
0.3 CVT.d.ld F8 F2 F3 /*80-bit to 64-bit convert*/
1.0 MUL.d F9 F7 F8 /*64-bit multiplication*/
2.0 CVT.ud.d F7 F9 /*64-bit to high part of 80-
bit*/
2.1 CVT.ld.d F8 F9 /*64-bit to low part of 80-bit*/
2.2 GSSQC1 F7 4(R8) /*128-bit store*/
18. Multimedia instructions
• x86 has MMX, SSE, SSE2...
• MIPS as extention instruction set called
MDMX, but very different from x86
multimedia instructions
• Added original SIMD instruction set which
similar to SSE2
19. New addressing mode
• MIPS only supports
“(base) + disp” for fixed/float,
“(base) + (index)” for float
• x86 has more flexible addressing modes
ex: “(base) + (index) x scale + disp”
• ‘‘(base) + (index) + disp8’’ addressing mode
added to translate it
20. Bounded load and
store
• x86 has segment address mode
• Bounded load/store instruction added to
handle this
This reads bound register as the memory-
access boundary
• It raises address exception if the memory-
access exceeds the boundary
21. Fixed-point multiplication
and division
• MIPS fixed-point multiplication/division
instruction use the special Hi/Lo register as
destination
Additional operation needed to move data
from Hi/Lo register to general-purpose
registers
• Added fixed-point multiplication/divison
instruction which use general-purpose
register as destination
22. Byte insertion and
extraction
• x86 supports 8, 16, 32, 64bit operations
• MIPS only supports 32, 64bit operations
• Added flexible byte insertion instructions
that can insert 8, 16, 32bit from any
location of a register to any location of
another register
Also added flexible byte extraction
instructions
23. CAM
• Translation of indirect branch is costly,
because the translator must lookup branch
target dynamically
• It requires
<x86 branch target:MIPS branch target>
hash table to keep mapping information
• 64-entry CAM added to speed up it
• CAM Entry format: PID, Address, Data
24. .................................................................................................................................................
.
Number of
instructions Instruction Comment
0 MOV %RAX %R11
1 JMPQ %*R11
(a)
0 MOVE Rr11 Rrax
1.0 CAMPV Rtmp Rr11 /* Look up the first level indirect jump
address */
1.1 CAMPV Rtgt Rtmp /* Look up the final jump address */
1.2 JR Rtgt
(b)
Figure 5. Example of indirect branch target translation: The original x86 program (a), and the
program translated with Godson-3 content-associated memory (CAM) instructions (b). The
boldface text indicates new instructions for x86 emulation.
25. Context Switch
Optimization
• The binary translator stores translated codes in data cache,
then the execution requires flushing them from data cache and
loading them into the instruction cache
• Keep coherence by hardware, between data and instruction
cache, as well as L2
• Binary translator performs context switch between translator
and translated codes, it requires to save/restore target
machines register, which simulated as general purpose registers
• To reduce the costs, 128bit load and store instructions are
added
• This save/restore up to four x86 registers in one time
26. EMBC x86 assembly FPGA x86 SIMD
crobench C and x86 assembly Xtreme-3/FPGA
PEC 2000 C FPGA
PEC 2000
PEC 2000
C
C Benchmark results FPGA
FPGA
ich bench-
x86 binary 100
No hardware support
e using the 90
Hardware support
tor; and 80
in which
Performance (percent)
70
nto x86 bi-
dware using 60
y translator 50
acceleration 40
) hardware
30
20
with the 10
0
rformance
e
T
FT
C
1
2
T
ip
er
t
ar
ag
C
BC
BC
O
G
gz
rs
-F
9.
-ID
BO
tor modes
er
pa
4.
M
M
FP
17
Av
FP
EE
EE
16
S-
7.
. Godson-
O
19
27. Godson SPEC Ratio Pentium SPEC Ratio
2E-750 2F-800 3A-800 PIII-800 PIV-1.4
or software on a Mhz Mhz Mhz Mhz Ghz
and time-consuming. 164.gzip 209 251 324 344 397
standard to facilitate 175.vpr 237 239 391 261 246
rdware/software sub- 176.gcc 282 329 369 241 350
hensive debugging ca- 181.mcf 271 232 421 229 255
ion and debug mode, 186.crafty 356 362 415 352 386
197.parser 202 152 225 231 331
breakpoint, instruc-
252.eon 289 441 526 90.7 125
nts, single-step execu-
253.perlbmk 235 321 330 397 547
on. The IEEE 1149.1
254.gap 238 243 229 260 441
ndard is employed to
255.vortex 236 274 297 383 478
EJTAG. Every pro-
256.bzip2 247 241 268 249 314
TAG TAP controller,
300.twolf 313 331 486 269 287
ected as a chain. A SPECint2000 256 275 345 260 326
h each processor core 168.wupwise 307 308 325 248 474
171.swim 247 273 336 218 244
172.mgrid 156 155 184 99.2 320
Evaluation 173.applu 188 268 200 154 333
177.mesa 373 438 400 265 265
he first-silicon sample
178.galgel - 345 583 - -
ned from fabrication.
179.art 349 693 1254 115 109
183.equake 250 303 278 190 493
187.facerec - 111 177 - -
188.ammp 277 283 364 174 200
189.lucas - 284 251 - -
191.fma3d - 108 128 - -
200.sixtrack 131 217 184 137 224
301.apsi 172 197 225 190 199
SPECfp2000 232 254 289 171 263
28. Conclusion
• GS464 added 200+ instructions and number of
optimization for x86 emulation
• In the result, binary translation speeds up 2x ~ 3x
faster than original QEMU
• That’s neary 70% performance of MIPS native binary
• CPU performance itself is poor though
• The paper doesn’t tell us enough informations to know
actual performance of the emulation on real chip...
• Anyway Loongson-3 looks good try and interesting!
29. Papers & Slides
• “GODSON-3: A SCALABLE MULTICORE
RISC PROCESSOR WITH X86
EMULATION”
• “Micro-architecture of Godson-3 Multi-Core
Processor”
• “Efficient Binary Translation System with Low
Hardware Cost”
• “Godson-3 Multicore RISC Processor”