SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 1
Janus – A Gigaflop RISC + VLIW SoC Tile
Pier S. PAOLUCCI a,b,*, Ben ALTIERI a, Federico AGLIETTI a,
Stefano V. BASILE a, Piergiovanni BAZZANA a, Sergio BRUZZONE a,
Alessandro CATASTA a, Antonio CERRUTO a, Maurizio COSIMI a,
Yves FUSELLA d, Philippe KAJFASZ c, Andrea MICHELOTTI a,
Elena PASTORELLI a, Silvia PIRIA a, Enrico REMONDINI a,
Andrea RICCIARDI a, Fabrizio ROSCIARELLI a
– a IPITEC srl, an ATMEL Company, Via Vito Giuseppe Galati 87, 00155 Roma, Italy
– b INFN Roma, Dip. Fisica Uni. Roma “La Sapienza”, P.le Aldo Moro 5, 00185 Roma, Italy
– c THALES Communications 66, rue du Fossé Blanc – BP 156 - 92231 Gennevilliers Cedex, France
– d ATMEL, Zone Industrielle 13106 Rousset cedex, France
– *Corresponding author: pier.paolucci@roma1.infn.it (Pier Stanislao PAOLUCCI)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 2
Agenda (1/2)
mAgic™ DSP VLIW Core
– Complex Domain, 40-bit Floating Point VLIW DSP Core, 15 ops/cycle
– Automatic VLIW Scheduling
– Dynamic Program Decompression
– Low clock, high ILP core: 1.0 Gigaflops @ 100 MHz, 180 nm CMOS
– SoC interface
Janus ™ = ARM7 + mAgic VLIW DSP Core
– Audio Beam-forming, Physical Modeling
– Architecture, Floor-plan, Technological results (180 nm CMOS)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 3
Agenda (2/2)
Dimensional Analysis for Deep Sub-Micron
(DSM) VLIW tile design methodology for high
performance at moderate clock speed
– ILP*frequency vs. Wire Delay balance on present and future designs
– Memory area vs. Operator Area on present and future designs
– Validation of the dimensional analysis using mAgic detailed Gate
Counts and Technological Implementation feedbacks.
– Tiles for Short Wires:
RISC+VLIW Tile for SoC on future DSM designs
Hypothesis for a 90 nm multiple tile design
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 4
mAgic™ DSP Core
1.0 Gigaflops @ 100 MHz, 15 ops/cycle
Complex Domain, 40-bit Floating Point VLIW DSP Core
Seamless VLIW: from linear assembler to VLIW scheduling
DyProDe: Dynamic Program Decompression: 4 PM bit/op
Low clock, high ILP core: easier SoC Design Closure, less
internal pipelines (no need for custom operators), higher
efficiency on applications
Memory mapped slave on the controller’s system bus
Expected dissipation: less than 500mW (typical)
2.4X the performance or 41% the clock at 40 bit vs.
classical 32-bit stand-alone floating point DSP
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 5
mAgic™ VLIW DSP core Architecture
mAgic – ARM I/F
PARM Memory
Left 512x40
PARM Memory
Right 512x40
Data Memory
Left 6Kx40
Data Memory
Right 6Kx40
Buffer Data
Memory Left
2Kx40
Buffer Data
Memory Right
2Kx40
External Memory I/F
Multiple
Address
Generation
Unit
Address
Register
File
Operator
Block
Data
Register File
VLIW Program Memory
Flow control and VLIW Decoder
Instruction
Decoder
Condition
Generation
Status
Register
Program
Counter
On chip 8 k*128 VLIW
Program Memory
(equivalent to 24k using
our patented DyProDe
Code compression
scheme)
2*256 entry multi-port
Register File
On core 8 k*80 Data
Memory
10 arithmetic operation
per cycle, native
complex arithmetic,
single cycle butterfly
Multiple Address
Generation Unit
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 6
Native Complex Domain Arithmetic
Significant DSP applications regard wave-processing in
audio, radio or ultrasound domains complex domain
The area required for each floating point arithmetic operator
is <1/2 mm2 on 180 nm and ~0.1 mm2 on 90 nm CMOS.
mAgic VLIW DSP benchmarks (Native Complex Domain
Support, added operators, added memory bandwidth):
–1024 points FFT:
5962 cycles on mAgic VLIW DSP vs 14400 on C67
–64 output from a 64 taps complex FIR filter:
4663 cycles on mAgic VLIW DSP vs. 8225 on C67
– Single cycle butterfly (40 bit floating point)
– Single cycle complex mulacc (40 bit floating point)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 7
mAgic™ Operator block
• Operator
Block: 10
Float/Int Op
per Cycle
• Complex
Arithmetic
support
• Vector 2
Arithmetic
• Butterfly
Arithmetic
• Large (512)
Multiport
(8+8)
Register File
Conv2
Div2
Sh/Log2
Conv1
Div1
Sh/Log1
LEFT
0 1 2 3
4 5 6 7
FP/I
*
FP/I
*
RIGHT
0 1 2 3
4 5 6 7
FP/I
*
FP/I
*
FP/I
-
FP/I
+
FP/I
- +
FP/I
+ -
L Memory R Memory
LMemory
RMemory
Mul1 Mul2 Mul4Mul3
Cadd1 Cadd2
Add1 Add2
Min
Max2
Min
Max1
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 8
Automatic VLIW Scheduling
The designer writes in a serial fashion, and the
assembler schedules optimized code that takes
advantage of the DSPs Instruction Level Parallelism,
accounting for data dependencies and latencies
IS SCHEDULED AS:
A=B+C; D=E*F;Q=Memory[I]
L=M+N;
G=A+D; P=Q*R
A SEQUENTIAL CODE LIKE:
A=B+C
D=E*F
G=A+D
L=M+N
Q=Memory[I]
P=Q*R
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 9
Multiple Address Generation Unit
S: 11 bits Start reg,
vector absolute base
address or circular
buffer starting address
L: 11 bits Length reg,
vector length
A: 11 bits Address
reg, offset or abs base
address
M: 7 bits Increment
reg, increment
P: 9 bits Page reg, for
internal memories
pages addressing
Description Address out Modified A
Assembly
suffix
Just Use S+A A -
Modify & Use S+A+M A M
Use Modify & Update S+A A+M U
Modify Use & Update S+A+M A+M MU
Just Use S+A A -
Modify & Use S+A+M A M
Use Modify & Update S+A A+M U
Modify Use & Update S+A+M A+M MU
Just Use S+A A -
Modify & Use S+ (A+M)mod.L A M
Use Modify & Update S+A (A+M)mod.L U
Modify Use & Update S+ (A+M)mod.L (A+M)mod.L MU
Modular:
Offset:
L=0
Linear:
S=L=0
SLAMP fields and MAGU Addressing Modes
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 10
mAgic™ DSP Core interface with SoC Bus and
External Memories
Slave memory
mapped device on
the controller’s
system bus
1.6 Mbit internal
Program and Data
Core Memories are
memory mapped as
well
XM DMA and SoC
System Bus activities
run in parallel with
the core on
dedicated Double
Port Buffers
mAAR
Core Memories
mAgic DSP core
Global
Sequencer
mAgic Internal Resources Mapper (mIrm )
PM
P2P3
XM
Burst Service
Local Sequencer
Register
File
Operators
M
A
G
U
PC
P 4
P1 P0
CSE Reg
mAgic System Mode Interface
ASB slave Wrapper
Registers
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 11
Examples of target Gigaflops Applications
Hands free home phone
– Audio Beam-Forming
Better audio hearing aids and ear prosthesis / Real time
modeling of cochlea
– Real-time Differential Equation Solution
Missile Guidance / Seeker
– Radar Beam-forming / Anti-jamming
SW Ultrasound Scanner / Better Diagnostic Image Quality
– Ultrasound Beam-forming
JANUS: An ARM7TDMI + mAgic VLIW DSP SoC for those
applications.
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 12
Janus: a DUAL CORE VLIW DSP + RISC
mAgic VLIW FiPU DSP:
1.5 GOPS @ 100 MHz
(1.0 GFLOPS)
32 bit ARM7TDMI RISC
Set of SoC peripherals
1.9 Mbit SRAM on Board
352 BGA, 243 functional I/O
1.2 W worst case @ 100 MHz
Arm7TDMI
32K ARM
Mem
ASB / APB Bridge
mAgic VLIW
GigaFlops
DSP core
8Kx128 bit
Program
Mem
Shared
Memory
Data Buffer 2 x 2k word
Double Bank, Double Port
Amba ASB
EBI
Data / Program Bus Mux
Program Bus
Mux / Demux
Data Bus
Mux / Demux
Data Mem
2 x 6k x 40 bit
Double Bank
Double Port
SPI0
USART0
USART1
TIMER
Watchdog
PIO
PDC
ADDA
Clock Gen
IRQ Ctrl
Run Mode data paths
System Mode data paths
ARM exclusive data paths
SPI1
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 13
Janus Floorplan & Technological Measurements
DSP Reg Files:
2 * 8 Ports *
256 Regs * 40 bit
mAgic VLIW floating point
DSP SW Core: 15
op/cycle, 390 Kgate
DSP Progr SRAM:
2 * 1 Port *
8K * 64 bit
ARM7TDMI 32 BIT
RISC CORE
RISC
PERIPHERALS:
135 Kgate
RISC SRAM:
4 * 1 Port *
8K * 8 bit
DSP Data SRAM:
8 * 2 Ports *
2K * 40 bit
Atmel 180 nm, five-level, Aluminum CMOS
Pad excluded, 39 mm2
Pad included, 55 mm2
243 functional IO
352 Ball Grid Array package
1.8 V (Core), 3.3 V (I/O)
<1.2 W (worst case) @ 100MHz
1.5 Gops, 1.0 Gigaflops
55 Kgate/mm2 effective density (10 mm2 for
550Kgate logic required for the software
macros: VLIW DSP + Arm peripherals +
testability stuff)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 14
HW Tools: JANUS Test & Evaluation Board
CLK
DIV
3.3V
RST
PLL
PIO
PIO
ARM DATA H
SRAM
128Kx8
ARM DATA L
SRAM
128Kx8
FLASH
ARM PRG
512Kx16
SSRAM
MAGIC
DATA L
128Kx36
EXTCLK
MTM DIP SWITCH
SPI-0
M-ICE
USART 0
RS232
Janus
PIO USARTs
RST
XMA
XMD[15:0]
CLKs
CNTRLs
SPIsADDA
ARMD
PLL
ICE
ARMC
ARMA
XMD[55:40]
XMD[31:16]
XMD[71:56]
XMD[39:32]
XMD[79:72]
SPI-1
SSRAM
MAGIC
DATA H
128Kx36
SSRAM
MAGIC
DATA E
128Kx36
USART 1
USB
CNTRL
USB
D-9 RS232D-9 RS232
RS232RS232
3.3-1.8
EXT
PSU
uPR
CODEC
LED
BUFF
CODEC
(opt)
CODEC
(opt)
CODEC
(opt)
LINE
IN
LINE
OUT
LINE
IN
(opt)
LINE
OUT
(opt)
LINE
IN
(opt)
LINE
OUT
(opt)
LINE
IN
(opt)
LINE
OUT
(opt)
CLK
DIV
25
MHz
MTM Works @ 100MHz and
Provides the following resources:
– Memories for mAgic and ARM
– Stereo Audio CODECs (up to 4)
– Serial I/O:
1 USB 2.0 Full Speed (12 Mbps)
2 RS232/LVTTL a/sync serial lines
2 SPI serial I/O lines
1 ÷ 4 audio codecs
– IO connectors (USART, SPI, PIO, AUDIO)
– Configuration DIP SWITCH
– Status 7-segment Display
– JTAG ARM M-ICE connector
– Size: 5 x 5 inch2 (12.7x12.7 cm2)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 15
Development Effort
Our SIMD/VLIW mAgic DSP architecture is largely based on the
know-how acquired by some of the authors thanks to their
participation to three generations of the Massively Parallel
Processing experiment APE, conducted since 1983 by INFN (Istituto
Nazionale di Fisica Nucleare). The design and development of
custom VLIW processors, hardware and system software for those
machines accounted for more than 300 Person Years.
From that background, the development and validation of mAgic
VLIW DSP and the essential system software required
approximately 65 Person Years.
The integration and validation at Janus level with the ARM core and
the set of pre-validated ARM peripherals, plus physical design and
validation board activities should required approximately 10 Person
Years.
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 16
A Differentiation Era beyond the Single
Processor Barrier
Log(SystemComplexity)
Time
Massive Parallel Computers
SuperScalar
32 bit Float Multiplier
Resources available on a single chip
1980
Single Processor
Barrier
1990 2000 2010
Risc
Forced
Convergence
Era
Multi Core Differentiation
Era
Power density, overhead logic and interconnect delays for monolithic high clock
speed processor design are approaching embarrassing figures according to
ITRS. The human brain has got a processing power >> 106 times than DSP
processors of 2003, yet runs at <<100 Hz, ~10W. A precious Hint for adoption of
moderate clock speed, better memory architectures and parallelism
management to exploit higher silicon densities provided by DSM technologies?
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 17
Dimensional Analysis for Deep Sub-Micron
Processor design with ASIC methodology
A possible concern regarding synthesizable cores for next decade
SoCs:
maximum frequency dominated by wire delays, not by gate
propagation
our simple dimensional analysis suggests that:
– adequate ILP + multiple tiles, lower clock speed, lower pipelining,
shorter wire lengths simpler design closure
– floating point relatively un-expensive Balancing Memory area vs.
Operating area
– DSP treatment of wave phenomena useful ILP >= 15 native
complex domain support
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 18
Multiple Tile SoC: Tiles for Short Wires
External Memories Interface
RISCMem
DSP Prog
Mem
DSPData
mem
Reg File
VLIW
DSP
RISC
Y+ Communication Interface
Y- Communication Interface
X-Communication
X+Communication
RISCMem
DSP Prog
Mem
DSPData
mem
Reg File
VLIW
DSP
RISC
Y+ Communication Interface
Y- Communication Interface
X-Communication
X+Communication
Y- Communication Interface Y- Communication Interface
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 19
ILP * frequency vs. Wire Delay
r and c := technological resistance and capacitance per wire unit length (e.g.
180 nm rc = 58 ps/mm2, 90 nm rc = 160ps/mm2)
L := processor’s tile size; 2L:=worst Manhattan path length
g:= gate density (e.g. 180 nm g=55Kgate/mm2; 90 nm g=200Kgate/mm2)
o:= arithmetic operator cost (e.g. 7Kgate 24 bit floating point, 25Kgate 40 bit)
)4(
rco
g
66.ILP*wfwG
)3(2L
o
g
ILP
)2(
rc38.
1pt
2TLL2
)1(
2)L2(rc38.
1
RC38.
1
wf0f
≤=
≤
==
≤=≤ logic delay f0, wire delay fw. f0<< fw
indicates designs far away from
interconnect troubles
LT repeater insertion critical length ~
4.4mm ~ 419 ps on 180 nm
L2 areas support VLIW ILP determined by
gate density over operator cost
peak power for a VLIW area below critical
area (NOTE the interesting
independency on size and ILP)
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 20
Dimentional Hints
Equation (4) Gw =fw * ILP <= .66 g/rco, provides suggestions:
At one extreme, it states the obvious. You should divide a die
too large into N smaller tiles.
At intermediate scale, it suggests to insert at least the
number of operators that can be reached by non repeated
wires. In fact a lower frequency reduces the number of local
pipelines, permits the adoption of a classical ASIC RTL
methodology and simplifies the adoption of lower Vdd.
A lower bound for the ILP is imposed by:
– adequate program and data memory inside each tile
– maintain a higher granularity to avoid classical massive parallel processing
inefficiencies
– balance memory vs processing resources
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 21
mAgic VLIW Arith Oper Gate Count
#Units Combina
torial
Non
Comb.
Total Grand
Total
4 40 bit Float Adder 10900 8400 19300 77200
4 40 bit Float and 32 bit Int
Mult
24100 2700 26800 107200
4 32 bit Int Adder 900 1600 2500 10000
2 Float Div & Sqrt Seed 3300 700 4000 8000
2 40 bit Shift & Logic Unit 6600 800 7400 14800
2 Float<->Int Converter 4500 400 4900 9800
1 Decoder 1200 1700 2900 2900
Total VLIW Arithmetic
Operators
251300
SUPPORT FOR NATIVE COMPLEX DOMAIN 40-bit FLOATING
POINT ARITHMETIC <= 1.5 mm2 on 90 nm CMOS
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 22
mAgic VLIW Floating Point DSP Gate Count
Combinato
rial
Non
Comb.
Total
VLIW Arithmetic & Logic Operators 186800 64500 251300
VLIW Multiple Address Generation Unit 18900 13100 32000
VLIW Flow Controller 8300 18000 26300
VLIW DMA + Global Status Controller 4600 11300 15900
VLIW Program Decompression Engine 7600 10500 18100
VLIW Local Memory Mux Logic & Bist 18700 13800 32500
VLIW<->RISC Memory Mapping Interface 1300 6800 8100
VLIW DSP ENGINE TOTAL GATE COUNT 250400 141400 391800
payload for mAgic VLIW core >= 70% grows to 90% considering
processor + memory tiles
equations (3) and (4) are reasonable approximations
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 23
Dimensional Analysis results
On 180 nm Atmel technology, the gate delay for a 25 Kgate
40-bit floating point operator with two pipeline stages is 8 ns.
Therefore f0<< fw for the full VLIW core with ILP = 15.
Complete RISC+VLIW logic area ~ 10 mm2, 550 Kgates
On 90 nm copper CMOS, f0 / fw ~ 1 / 6for the 250 Kgate
complex domain data path with 10 floating point operators. A
pipeline stage will be required on core<->memory busses.
8 Gigaflops, 4 Janus tiles SOC feasible with ASIC
methodology on 90 nm technology with classical RTL
synchronous design. Insertion of repeaters and promotion of
global wires on thicker layers helps, but is not yet mandatory.
Hot Chips 15
August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 24
Summary
mAgic VLIW DSP: a synthesizable complex domain floating point
core: 1.0 Gigaflops @ 100 MHz, 180 nm CMOS
Janus: a 180 nm RISC+floating point VLIW platform for Gigaflops
SoC applications
A Dimensional Analysis based on the technological parameter g/rco
provides an interconnect bound to the processing power of simple
synthesizable RTL designs on DSM techologies with ASIC
methodology
We will follow a simple roadmap for SoC integration of multiple
gigaflops on 90 nm using:
– Tiles for Short Wires
– Appropriate VLIW parallelism
– Moderate Clock Speed
– This methodology keeps the cost down to a few tens of designers and M$.

Más contenido relacionado

La actualidad más candente

Altera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Corporation
 
智慧城市通用交通資訊端點系統
智慧城市通用交通資訊端點系統智慧城市通用交通資訊端點系統
智慧城市通用交通資訊端點系統艾鍗科技
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application ProcessorJJ Wu
 
Synopsys User Group Presentation
Synopsys User Group PresentationSynopsys User Group Presentation
Synopsys User Group Presentationemlawgr
 
Webinar: Nova família de microcontroladores STM32WL – Sub Giga Multiprotocolo
Webinar: Nova família de microcontroladores STM32WL – Sub Giga MultiprotocoloWebinar: Nova família de microcontroladores STM32WL – Sub Giga Multiprotocolo
Webinar: Nova família de microcontroladores STM32WL – Sub Giga MultiprotocoloEmbarcados
 
Sil dgcis themis_n_specifications_v1.0_beta
Sil dgcis themis_n_specifications_v1.0_betaSil dgcis themis_n_specifications_v1.0_beta
Sil dgcis themis_n_specifications_v1.0_betabonnaudfrederic
 
student_pres120202final
student_pres120202finalstudent_pres120202final
student_pres120202finalJohn Marquis
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Nansen Chen
 
Sample inventory report
Sample inventory reportSample inventory report
Sample inventory reportKamran Arshad
 
FOSOH-V (TM) preliminary schematics
FOSOH-V (TM) preliminary schematicsFOSOH-V (TM) preliminary schematics
FOSOH-V (TM) preliminary schematicsAli Uzel
 
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon Labs
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon LabsWebinar: Criando Soluções LoRaWAN Otimizadas com Silicon Labs
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon LabsEmbarcados
 
Nanos and Nanos2 modules catalog september 2014
Nanos and Nanos2 modules catalog september 2014Nanos and Nanos2 modules catalog september 2014
Nanos and Nanos2 modules catalog september 2014ETEP
 
Slides for RFID-MIMO Prototype based on GnuRadio
Slides for RFID-MIMO Prototype based on GnuRadioSlides for RFID-MIMO Prototype based on GnuRadio
Slides for RFID-MIMO Prototype based on GnuRadioAmelia Jiménez Sánchez
 

La actualidad más candente (20)

Msp430g2453
Msp430g2453Msp430g2453
Msp430g2453
 
Chap 02[1]
Chap 02[1]Chap 02[1]
Chap 02[1]
 
Altera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer Presentation
 
智慧城市通用交通資訊端點系統
智慧城市通用交通資訊端點系統智慧城市通用交通資訊端點系統
智慧城市通用交通資訊端點系統
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application Processor
 
Synopsys User Group Presentation
Synopsys User Group PresentationSynopsys User Group Presentation
Synopsys User Group Presentation
 
Webinar: Nova família de microcontroladores STM32WL – Sub Giga Multiprotocolo
Webinar: Nova família de microcontroladores STM32WL – Sub Giga MultiprotocoloWebinar: Nova família de microcontroladores STM32WL – Sub Giga Multiprotocolo
Webinar: Nova família de microcontroladores STM32WL – Sub Giga Multiprotocolo
 
Sil dgcis themis_n_specifications_v1.0_beta
Sil dgcis themis_n_specifications_v1.0_betaSil dgcis themis_n_specifications_v1.0_beta
Sil dgcis themis_n_specifications_v1.0_beta
 
MG3130 gesture recognition kit
MG3130 gesture recognition kitMG3130 gesture recognition kit
MG3130 gesture recognition kit
 
Pc 104 series 1 application showcase
Pc 104 series 1 application showcasePc 104 series 1 application showcase
Pc 104 series 1 application showcase
 
student_pres120202final
student_pres120202finalstudent_pres120202final
student_pres120202final
 
Assignmentdsp
AssignmentdspAssignmentdsp
Assignmentdsp
 
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
Run Simulations and Then Become An Inventor (Best Paper Award in CDNLive Taiw...
 
Sample inventory report
Sample inventory reportSample inventory report
Sample inventory report
 
FOSOH-V (TM) preliminary schematics
FOSOH-V (TM) preliminary schematicsFOSOH-V (TM) preliminary schematics
FOSOH-V (TM) preliminary schematics
 
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon Labs
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon LabsWebinar: Criando Soluções LoRaWAN Otimizadas com Silicon Labs
Webinar: Criando Soluções LoRaWAN Otimizadas com Silicon Labs
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
Nanos and Nanos2 modules catalog september 2014
Nanos and Nanos2 modules catalog september 2014Nanos and Nanos2 modules catalog september 2014
Nanos and Nanos2 modules catalog september 2014
 
V30 Brochure(EN)-s
V30 Brochure(EN)-sV30 Brochure(EN)-s
V30 Brochure(EN)-s
 
Slides for RFID-MIMO Prototype based on GnuRadio
Slides for RFID-MIMO Prototype based on GnuRadioSlides for RFID-MIMO Prototype based on GnuRadio
Slides for RFID-MIMO Prototype based on GnuRadio
 

Destacado

Propuesta comercial feria de cali 2016
Propuesta comercial feria de cali 2016 Propuesta comercial feria de cali 2016
Propuesta comercial feria de cali 2016 J&M Comunicaciones
 
Ankit pareek portfolio
Ankit pareek portfolioAnkit pareek portfolio
Ankit pareek portfolioAnkit Pareek
 
Cgt appel greve sncf 23 juin 2016
Cgt appel greve sncf 23 juin 2016Cgt appel greve sncf 23 juin 2016
Cgt appel greve sncf 23 juin 2016Quoimaligne Idf
 
Resultados 4º cross country
Resultados 4º cross countryResultados 4º cross country
Resultados 4º cross countryCiudaddeldeporte
 
Mobutu received plane from president Kennedy
Mobutu received plane from president KennedyMobutu received plane from president Kennedy
Mobutu received plane from president KennedyThierry Debels
 
Propuesta Comercial Carnaval de Barranquilla 2017
Propuesta Comercial Carnaval de Barranquilla 2017Propuesta Comercial Carnaval de Barranquilla 2017
Propuesta Comercial Carnaval de Barranquilla 2017J&M Comunicaciones
 
бсмп зож это...
бсмп зож   это...бсмп зож   это...
бсмп зож это...sk1ll
 
Importancia de la cultura en el comercio internacional
Importancia de la cultura en el comercio internacionalImportancia de la cultura en el comercio internacional
Importancia de la cultura en el comercio internacionalCarlos Alberto Gonzalez Ardila
 
Pediatric Preparations
Pediatric Preparations Pediatric Preparations
Pediatric Preparations Kdurant36
 
Educating Pharmacy Technicians for Success: Understanding Current Issues in ...
Educating Pharmacy Technicians for Success:  Understanding Current Issues in ...Educating Pharmacy Technicians for Success:  Understanding Current Issues in ...
Educating Pharmacy Technicians for Success: Understanding Current Issues in ...FLAVORx
 
Morphologic changes in viral infections
Morphologic changes in viral infectionsMorphologic changes in viral infections
Morphologic changes in viral infectionsShridhan
 
South Korea--Macro Environmental Analysis
South Korea--Macro Environmental AnalysisSouth Korea--Macro Environmental Analysis
South Korea--Macro Environmental Analysistomasptacek
 
Mahoney Pipeline Status Siem Reap V2a
Mahoney Pipeline Status Siem Reap V2aMahoney Pipeline Status Siem Reap V2a
Mahoney Pipeline Status Siem Reap V2agambelguy
 

Destacado (20)

Propuesta comercial feria de cali 2016
Propuesta comercial feria de cali 2016 Propuesta comercial feria de cali 2016
Propuesta comercial feria de cali 2016
 
Ankit pareek portfolio
Ankit pareek portfolioAnkit pareek portfolio
Ankit pareek portfolio
 
تقرير
تقريرتقرير
تقرير
 
Cgt appel greve sncf 23 juin 2016
Cgt appel greve sncf 23 juin 2016Cgt appel greve sncf 23 juin 2016
Cgt appel greve sncf 23 juin 2016
 
Jin nations extraordinary history part-3
Jin nations extraordinary history part-3Jin nations extraordinary history part-3
Jin nations extraordinary history part-3
 
Best practices
Best practicesBest practices
Best practices
 
Resultados 4º cross country
Resultados 4º cross countryResultados 4º cross country
Resultados 4º cross country
 
Mobutu received plane from president Kennedy
Mobutu received plane from president KennedyMobutu received plane from president Kennedy
Mobutu received plane from president Kennedy
 
Propuesta Comercial Carnaval de Barranquilla 2017
Propuesta Comercial Carnaval de Barranquilla 2017Propuesta Comercial Carnaval de Barranquilla 2017
Propuesta Comercial Carnaval de Barranquilla 2017
 
Trabjo numero 1 (1)
Trabjo numero 1 (1)Trabjo numero 1 (1)
Trabjo numero 1 (1)
 
Cuarto jornada mañana
Cuarto jornada mañanaCuarto jornada mañana
Cuarto jornada mañana
 
бсмп зож это...
бсмп зож   это...бсмп зож   это...
бсмп зож это...
 
Importancia de la cultura en el comercio internacional
Importancia de la cultura en el comercio internacionalImportancia de la cultura en el comercio internacional
Importancia de la cultura en el comercio internacional
 
Kari u3 ea_rugj
Kari u3 ea_rugjKari u3 ea_rugj
Kari u3 ea_rugj
 
Guión cine ficción
Guión cine ficciónGuión cine ficción
Guión cine ficción
 
Pediatric Preparations
Pediatric Preparations Pediatric Preparations
Pediatric Preparations
 
Educating Pharmacy Technicians for Success: Understanding Current Issues in ...
Educating Pharmacy Technicians for Success:  Understanding Current Issues in ...Educating Pharmacy Technicians for Success:  Understanding Current Issues in ...
Educating Pharmacy Technicians for Success: Understanding Current Issues in ...
 
Morphologic changes in viral infections
Morphologic changes in viral infectionsMorphologic changes in viral infections
Morphologic changes in viral infections
 
South Korea--Macro Environmental Analysis
South Korea--Macro Environmental AnalysisSouth Korea--Macro Environmental Analysis
South Korea--Macro Environmental Analysis
 
Mahoney Pipeline Status Siem Reap V2a
Mahoney Pipeline Status Siem Reap V2aMahoney Pipeline Status Siem Reap V2a
Mahoney Pipeline Status Siem Reap V2a
 

Similar a 9.atmel

Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA OverviewPremier Farnell
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overviewTulipp. Eu
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
MYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleMYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleLinda Zhang
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
MIPI DevCon 2016: Image Sensor and Display Connectivity Disruption
MIPI DevCon 2016: Image Sensor and Display Connectivity DisruptionMIPI DevCon 2016: Image Sensor and Display Connectivity Disruption
MIPI DevCon 2016: Image Sensor and Display Connectivity DisruptionMIPI Alliance
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jpsjpsvenn
 
Technology overview
Technology overviewTechnology overview
Technology overviewvirtuehm
 
20170626 imx6 quad_vs_apq8016
20170626 imx6 quad_vs_apq801620170626 imx6 quad_vs_apq8016
20170626 imx6 quad_vs_apq8016Advantech
 
IoT support for .NET Core - IoT Saturday 2020
IoT support for .NET Core - IoT Saturday 2020IoT support for .NET Core - IoT Saturday 2020
IoT support for .NET Core - IoT Saturday 2020Mirco Vanini
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionPersiPersi1
 
IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)Mirco Vanini
 
Esp32 datasheet
Esp32 datasheetEsp32 datasheet
Esp32 datasheetMoises .
 

Similar a 9.atmel (20)

Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA Overview
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overview
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
MYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleMYC-C7Z015 CPU Module
MYC-C7Z015 CPU Module
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
Msp430
Msp430Msp430
Msp430
 
MIPI DevCon 2016: Image Sensor and Display Connectivity Disruption
MIPI DevCon 2016: Image Sensor and Display Connectivity DisruptionMIPI DevCon 2016: Image Sensor and Display Connectivity Disruption
MIPI DevCon 2016: Image Sensor and Display Connectivity Disruption
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jps
 
Technology overview
Technology overviewTechnology overview
Technology overview
 
FPGA / SOC teknologi - i dag og i fremtiden
FPGA / SOC teknologi - i dag og i fremtidenFPGA / SOC teknologi - i dag og i fremtiden
FPGA / SOC teknologi - i dag og i fremtiden
 
20170626 imx6 quad_vs_apq8016
20170626 imx6 quad_vs_apq801620170626 imx6 quad_vs_apq8016
20170626 imx6 quad_vs_apq8016
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
IoT support for .NET Core - IoT Saturday 2020
IoT support for .NET Core - IoT Saturday 2020IoT support for .NET Core - IoT Saturday 2020
IoT support for .NET Core - IoT Saturday 2020
 
Practica 2
Practica 2Practica 2
Practica 2
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusion
 
IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)IoT support for .NET (Core/5/6)
IoT support for .NET (Core/5/6)
 
Asus Tinker Board
Asus Tinker BoardAsus Tinker Board
Asus Tinker Board
 
Evolution of processors
Evolution of processorsEvolution of processors
Evolution of processors
 
Esp32 datasheet
Esp32 datasheetEsp32 datasheet
Esp32 datasheet
 
Cyclone IV FPGA Device
Cyclone IV FPGA DeviceCyclone IV FPGA Device
Cyclone IV FPGA Device
 

9.atmel

  • 1. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 1 Janus – A Gigaflop RISC + VLIW SoC Tile Pier S. PAOLUCCI a,b,*, Ben ALTIERI a, Federico AGLIETTI a, Stefano V. BASILE a, Piergiovanni BAZZANA a, Sergio BRUZZONE a, Alessandro CATASTA a, Antonio CERRUTO a, Maurizio COSIMI a, Yves FUSELLA d, Philippe KAJFASZ c, Andrea MICHELOTTI a, Elena PASTORELLI a, Silvia PIRIA a, Enrico REMONDINI a, Andrea RICCIARDI a, Fabrizio ROSCIARELLI a – a IPITEC srl, an ATMEL Company, Via Vito Giuseppe Galati 87, 00155 Roma, Italy – b INFN Roma, Dip. Fisica Uni. Roma “La Sapienza”, P.le Aldo Moro 5, 00185 Roma, Italy – c THALES Communications 66, rue du Fossé Blanc – BP 156 - 92231 Gennevilliers Cedex, France – d ATMEL, Zone Industrielle 13106 Rousset cedex, France – *Corresponding author: pier.paolucci@roma1.infn.it (Pier Stanislao PAOLUCCI)
  • 2. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 2 Agenda (1/2) mAgic™ DSP VLIW Core – Complex Domain, 40-bit Floating Point VLIW DSP Core, 15 ops/cycle – Automatic VLIW Scheduling – Dynamic Program Decompression – Low clock, high ILP core: 1.0 Gigaflops @ 100 MHz, 180 nm CMOS – SoC interface Janus ™ = ARM7 + mAgic VLIW DSP Core – Audio Beam-forming, Physical Modeling – Architecture, Floor-plan, Technological results (180 nm CMOS)
  • 3. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 3 Agenda (2/2) Dimensional Analysis for Deep Sub-Micron (DSM) VLIW tile design methodology for high performance at moderate clock speed – ILP*frequency vs. Wire Delay balance on present and future designs – Memory area vs. Operator Area on present and future designs – Validation of the dimensional analysis using mAgic detailed Gate Counts and Technological Implementation feedbacks. – Tiles for Short Wires: RISC+VLIW Tile for SoC on future DSM designs Hypothesis for a 90 nm multiple tile design
  • 4. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 4 mAgic™ DSP Core 1.0 Gigaflops @ 100 MHz, 15 ops/cycle Complex Domain, 40-bit Floating Point VLIW DSP Core Seamless VLIW: from linear assembler to VLIW scheduling DyProDe: Dynamic Program Decompression: 4 PM bit/op Low clock, high ILP core: easier SoC Design Closure, less internal pipelines (no need for custom operators), higher efficiency on applications Memory mapped slave on the controller’s system bus Expected dissipation: less than 500mW (typical) 2.4X the performance or 41% the clock at 40 bit vs. classical 32-bit stand-alone floating point DSP
  • 5. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 5 mAgic™ VLIW DSP core Architecture mAgic – ARM I/F PARM Memory Left 512x40 PARM Memory Right 512x40 Data Memory Left 6Kx40 Data Memory Right 6Kx40 Buffer Data Memory Left 2Kx40 Buffer Data Memory Right 2Kx40 External Memory I/F Multiple Address Generation Unit Address Register File Operator Block Data Register File VLIW Program Memory Flow control and VLIW Decoder Instruction Decoder Condition Generation Status Register Program Counter On chip 8 k*128 VLIW Program Memory (equivalent to 24k using our patented DyProDe Code compression scheme) 2*256 entry multi-port Register File On core 8 k*80 Data Memory 10 arithmetic operation per cycle, native complex arithmetic, single cycle butterfly Multiple Address Generation Unit
  • 6. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 6 Native Complex Domain Arithmetic Significant DSP applications regard wave-processing in audio, radio or ultrasound domains complex domain The area required for each floating point arithmetic operator is <1/2 mm2 on 180 nm and ~0.1 mm2 on 90 nm CMOS. mAgic VLIW DSP benchmarks (Native Complex Domain Support, added operators, added memory bandwidth): –1024 points FFT: 5962 cycles on mAgic VLIW DSP vs 14400 on C67 –64 output from a 64 taps complex FIR filter: 4663 cycles on mAgic VLIW DSP vs. 8225 on C67 – Single cycle butterfly (40 bit floating point) – Single cycle complex mulacc (40 bit floating point)
  • 7. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 7 mAgic™ Operator block • Operator Block: 10 Float/Int Op per Cycle • Complex Arithmetic support • Vector 2 Arithmetic • Butterfly Arithmetic • Large (512) Multiport (8+8) Register File Conv2 Div2 Sh/Log2 Conv1 Div1 Sh/Log1 LEFT 0 1 2 3 4 5 6 7 FP/I * FP/I * RIGHT 0 1 2 3 4 5 6 7 FP/I * FP/I * FP/I - FP/I + FP/I - + FP/I + - L Memory R Memory LMemory RMemory Mul1 Mul2 Mul4Mul3 Cadd1 Cadd2 Add1 Add2 Min Max2 Min Max1
  • 8. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 8 Automatic VLIW Scheduling The designer writes in a serial fashion, and the assembler schedules optimized code that takes advantage of the DSPs Instruction Level Parallelism, accounting for data dependencies and latencies IS SCHEDULED AS: A=B+C; D=E*F;Q=Memory[I] L=M+N; G=A+D; P=Q*R A SEQUENTIAL CODE LIKE: A=B+C D=E*F G=A+D L=M+N Q=Memory[I] P=Q*R
  • 9. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 9 Multiple Address Generation Unit S: 11 bits Start reg, vector absolute base address or circular buffer starting address L: 11 bits Length reg, vector length A: 11 bits Address reg, offset or abs base address M: 7 bits Increment reg, increment P: 9 bits Page reg, for internal memories pages addressing Description Address out Modified A Assembly suffix Just Use S+A A - Modify & Use S+A+M A M Use Modify & Update S+A A+M U Modify Use & Update S+A+M A+M MU Just Use S+A A - Modify & Use S+A+M A M Use Modify & Update S+A A+M U Modify Use & Update S+A+M A+M MU Just Use S+A A - Modify & Use S+ (A+M)mod.L A M Use Modify & Update S+A (A+M)mod.L U Modify Use & Update S+ (A+M)mod.L (A+M)mod.L MU Modular: Offset: L=0 Linear: S=L=0 SLAMP fields and MAGU Addressing Modes
  • 10. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 10 mAgic™ DSP Core interface with SoC Bus and External Memories Slave memory mapped device on the controller’s system bus 1.6 Mbit internal Program and Data Core Memories are memory mapped as well XM DMA and SoC System Bus activities run in parallel with the core on dedicated Double Port Buffers mAAR Core Memories mAgic DSP core Global Sequencer mAgic Internal Resources Mapper (mIrm ) PM P2P3 XM Burst Service Local Sequencer Register File Operators M A G U PC P 4 P1 P0 CSE Reg mAgic System Mode Interface ASB slave Wrapper Registers
  • 11. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 11 Examples of target Gigaflops Applications Hands free home phone – Audio Beam-Forming Better audio hearing aids and ear prosthesis / Real time modeling of cochlea – Real-time Differential Equation Solution Missile Guidance / Seeker – Radar Beam-forming / Anti-jamming SW Ultrasound Scanner / Better Diagnostic Image Quality – Ultrasound Beam-forming JANUS: An ARM7TDMI + mAgic VLIW DSP SoC for those applications.
  • 12. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 12 Janus: a DUAL CORE VLIW DSP + RISC mAgic VLIW FiPU DSP: 1.5 GOPS @ 100 MHz (1.0 GFLOPS) 32 bit ARM7TDMI RISC Set of SoC peripherals 1.9 Mbit SRAM on Board 352 BGA, 243 functional I/O 1.2 W worst case @ 100 MHz Arm7TDMI 32K ARM Mem ASB / APB Bridge mAgic VLIW GigaFlops DSP core 8Kx128 bit Program Mem Shared Memory Data Buffer 2 x 2k word Double Bank, Double Port Amba ASB EBI Data / Program Bus Mux Program Bus Mux / Demux Data Bus Mux / Demux Data Mem 2 x 6k x 40 bit Double Bank Double Port SPI0 USART0 USART1 TIMER Watchdog PIO PDC ADDA Clock Gen IRQ Ctrl Run Mode data paths System Mode data paths ARM exclusive data paths SPI1
  • 13. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 13 Janus Floorplan & Technological Measurements DSP Reg Files: 2 * 8 Ports * 256 Regs * 40 bit mAgic VLIW floating point DSP SW Core: 15 op/cycle, 390 Kgate DSP Progr SRAM: 2 * 1 Port * 8K * 64 bit ARM7TDMI 32 BIT RISC CORE RISC PERIPHERALS: 135 Kgate RISC SRAM: 4 * 1 Port * 8K * 8 bit DSP Data SRAM: 8 * 2 Ports * 2K * 40 bit Atmel 180 nm, five-level, Aluminum CMOS Pad excluded, 39 mm2 Pad included, 55 mm2 243 functional IO 352 Ball Grid Array package 1.8 V (Core), 3.3 V (I/O) <1.2 W (worst case) @ 100MHz 1.5 Gops, 1.0 Gigaflops 55 Kgate/mm2 effective density (10 mm2 for 550Kgate logic required for the software macros: VLIW DSP + Arm peripherals + testability stuff)
  • 14. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 14 HW Tools: JANUS Test & Evaluation Board CLK DIV 3.3V RST PLL PIO PIO ARM DATA H SRAM 128Kx8 ARM DATA L SRAM 128Kx8 FLASH ARM PRG 512Kx16 SSRAM MAGIC DATA L 128Kx36 EXTCLK MTM DIP SWITCH SPI-0 M-ICE USART 0 RS232 Janus PIO USARTs RST XMA XMD[15:0] CLKs CNTRLs SPIsADDA ARMD PLL ICE ARMC ARMA XMD[55:40] XMD[31:16] XMD[71:56] XMD[39:32] XMD[79:72] SPI-1 SSRAM MAGIC DATA H 128Kx36 SSRAM MAGIC DATA E 128Kx36 USART 1 USB CNTRL USB D-9 RS232D-9 RS232 RS232RS232 3.3-1.8 EXT PSU uPR CODEC LED BUFF CODEC (opt) CODEC (opt) CODEC (opt) LINE IN LINE OUT LINE IN (opt) LINE OUT (opt) LINE IN (opt) LINE OUT (opt) LINE IN (opt) LINE OUT (opt) CLK DIV 25 MHz MTM Works @ 100MHz and Provides the following resources: – Memories for mAgic and ARM – Stereo Audio CODECs (up to 4) – Serial I/O: 1 USB 2.0 Full Speed (12 Mbps) 2 RS232/LVTTL a/sync serial lines 2 SPI serial I/O lines 1 ÷ 4 audio codecs – IO connectors (USART, SPI, PIO, AUDIO) – Configuration DIP SWITCH – Status 7-segment Display – JTAG ARM M-ICE connector – Size: 5 x 5 inch2 (12.7x12.7 cm2)
  • 15. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 15 Development Effort Our SIMD/VLIW mAgic DSP architecture is largely based on the know-how acquired by some of the authors thanks to their participation to three generations of the Massively Parallel Processing experiment APE, conducted since 1983 by INFN (Istituto Nazionale di Fisica Nucleare). The design and development of custom VLIW processors, hardware and system software for those machines accounted for more than 300 Person Years. From that background, the development and validation of mAgic VLIW DSP and the essential system software required approximately 65 Person Years. The integration and validation at Janus level with the ARM core and the set of pre-validated ARM peripherals, plus physical design and validation board activities should required approximately 10 Person Years.
  • 16. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 16 A Differentiation Era beyond the Single Processor Barrier Log(SystemComplexity) Time Massive Parallel Computers SuperScalar 32 bit Float Multiplier Resources available on a single chip 1980 Single Processor Barrier 1990 2000 2010 Risc Forced Convergence Era Multi Core Differentiation Era Power density, overhead logic and interconnect delays for monolithic high clock speed processor design are approaching embarrassing figures according to ITRS. The human brain has got a processing power >> 106 times than DSP processors of 2003, yet runs at <<100 Hz, ~10W. A precious Hint for adoption of moderate clock speed, better memory architectures and parallelism management to exploit higher silicon densities provided by DSM technologies?
  • 17. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 17 Dimensional Analysis for Deep Sub-Micron Processor design with ASIC methodology A possible concern regarding synthesizable cores for next decade SoCs: maximum frequency dominated by wire delays, not by gate propagation our simple dimensional analysis suggests that: – adequate ILP + multiple tiles, lower clock speed, lower pipelining, shorter wire lengths simpler design closure – floating point relatively un-expensive Balancing Memory area vs. Operating area – DSP treatment of wave phenomena useful ILP >= 15 native complex domain support
  • 18. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 18 Multiple Tile SoC: Tiles for Short Wires External Memories Interface RISCMem DSP Prog Mem DSPData mem Reg File VLIW DSP RISC Y+ Communication Interface Y- Communication Interface X-Communication X+Communication RISCMem DSP Prog Mem DSPData mem Reg File VLIW DSP RISC Y+ Communication Interface Y- Communication Interface X-Communication X+Communication Y- Communication Interface Y- Communication Interface
  • 19. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 19 ILP * frequency vs. Wire Delay r and c := technological resistance and capacitance per wire unit length (e.g. 180 nm rc = 58 ps/mm2, 90 nm rc = 160ps/mm2) L := processor’s tile size; 2L:=worst Manhattan path length g:= gate density (e.g. 180 nm g=55Kgate/mm2; 90 nm g=200Kgate/mm2) o:= arithmetic operator cost (e.g. 7Kgate 24 bit floating point, 25Kgate 40 bit) )4( rco g 66.ILP*wfwG )3(2L o g ILP )2( rc38. 1pt 2TLL2 )1( 2)L2(rc38. 1 RC38. 1 wf0f ≤= ≤ == ≤=≤ logic delay f0, wire delay fw. f0<< fw indicates designs far away from interconnect troubles LT repeater insertion critical length ~ 4.4mm ~ 419 ps on 180 nm L2 areas support VLIW ILP determined by gate density over operator cost peak power for a VLIW area below critical area (NOTE the interesting independency on size and ILP)
  • 20. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 20 Dimentional Hints Equation (4) Gw =fw * ILP <= .66 g/rco, provides suggestions: At one extreme, it states the obvious. You should divide a die too large into N smaller tiles. At intermediate scale, it suggests to insert at least the number of operators that can be reached by non repeated wires. In fact a lower frequency reduces the number of local pipelines, permits the adoption of a classical ASIC RTL methodology and simplifies the adoption of lower Vdd. A lower bound for the ILP is imposed by: – adequate program and data memory inside each tile – maintain a higher granularity to avoid classical massive parallel processing inefficiencies – balance memory vs processing resources
  • 21. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 21 mAgic VLIW Arith Oper Gate Count #Units Combina torial Non Comb. Total Grand Total 4 40 bit Float Adder 10900 8400 19300 77200 4 40 bit Float and 32 bit Int Mult 24100 2700 26800 107200 4 32 bit Int Adder 900 1600 2500 10000 2 Float Div & Sqrt Seed 3300 700 4000 8000 2 40 bit Shift & Logic Unit 6600 800 7400 14800 2 Float<->Int Converter 4500 400 4900 9800 1 Decoder 1200 1700 2900 2900 Total VLIW Arithmetic Operators 251300 SUPPORT FOR NATIVE COMPLEX DOMAIN 40-bit FLOATING POINT ARITHMETIC <= 1.5 mm2 on 90 nm CMOS
  • 22. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 22 mAgic VLIW Floating Point DSP Gate Count Combinato rial Non Comb. Total VLIW Arithmetic & Logic Operators 186800 64500 251300 VLIW Multiple Address Generation Unit 18900 13100 32000 VLIW Flow Controller 8300 18000 26300 VLIW DMA + Global Status Controller 4600 11300 15900 VLIW Program Decompression Engine 7600 10500 18100 VLIW Local Memory Mux Logic & Bist 18700 13800 32500 VLIW<->RISC Memory Mapping Interface 1300 6800 8100 VLIW DSP ENGINE TOTAL GATE COUNT 250400 141400 391800 payload for mAgic VLIW core >= 70% grows to 90% considering processor + memory tiles equations (3) and (4) are reasonable approximations
  • 23. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 23 Dimensional Analysis results On 180 nm Atmel technology, the gate delay for a 25 Kgate 40-bit floating point operator with two pipeline stages is 8 ns. Therefore f0<< fw for the full VLIW core with ILP = 15. Complete RISC+VLIW logic area ~ 10 mm2, 550 Kgates On 90 nm copper CMOS, f0 / fw ~ 1 / 6for the 250 Kgate complex domain data path with 10 floating point operators. A pipeline stage will be required on core<->memory busses. 8 Gigaflops, 4 Janus tiles SOC feasible with ASIC methodology on 90 nm technology with classical RTL synchronous design. Insertion of repeaters and promotion of global wires on thicker layers helps, but is not yet mandatory.
  • 24. Hot Chips 15 August 2003Janus – A Gigaflops RISC+VLIW SoC TilePier S. PAOLUCCI 24 Summary mAgic VLIW DSP: a synthesizable complex domain floating point core: 1.0 Gigaflops @ 100 MHz, 180 nm CMOS Janus: a 180 nm RISC+floating point VLIW platform for Gigaflops SoC applications A Dimensional Analysis based on the technological parameter g/rco provides an interconnect bound to the processing power of simple synthesizable RTL designs on DSM techologies with ASIC methodology We will follow a simple roadmap for SoC integration of multiple gigaflops on 90 nm using: – Tiles for Short Wires – Appropriate VLIW parallelism – Moderate Clock Speed – This methodology keeps the cost down to a few tens of designers and M$.