SlideShare a Scribd company logo
1 of 70
EXPLOIT THE
INTEGRATED GRAPHICS
IN PACKET PROCESSING

Speaker:                        Prof. Fulvio Risso
Supervisor:
                            Progetto di Reti Locali
Course:                                 2010/2011
Academic year:
        Francesco Corazza
Francesco Corazza                                                    2




Scenario
Packet processing are demanding more performances:
• Increasing network speed
• More intelligence in network devices
• Deeper packet analysis
• …



Intel is the best network hardware choice thanks to:
• Scale economy
• Price/quality ratio
• Power Consumption




           We will deal with packet processing on Intel platforms…
Francesco Corazza                                                 3




Overview
Issues:
   • Intel
      • Have not yet deployed efficient tools for our needs
   • Discrete GPU
      • Heavy
      • Expensive
      • Not power-saving
      • Affected by BUS bottleneck

Focus:
   • Consumer platforms
   • CPU + GPU solutions


                    Two different objectives can be identified…
Francesco Corazza                                                                          4




Presentation Structure
Objectives:

                     Focus on
    Focus on
                    Integrated
    the Field
                     Graphics




Chapter Division:
                                         What is the                        How convenient
                                          hardware                          hardware can be
   What kind of                          best fit on                     exploited in these app?
   application is                           these
      packet                            applications?
                            Which                        What is the                   CPU+GP
   processing?            features                                           GPU
                                                          hardware                         U
                                                                           solutions
                        differentiate                       most                       solutions
                         them from                      profitable for
                           general                       these app?
                        computing?
FOCUS ON THE FIELD
Francesco Corazza                         Focus on the Field   6




Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
Francesco Corazza                               Focus on the Field   7




Packet processing Applications
• Memory intensive
  • Frequent data load from packet
  • Huge amount of data involved in the processing
• No data locality
  • Unpredictable loads from different memory areas
• Small tasks, over a large number of packets
Francesco Corazza                         Focus on the Field   8




Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
Francesco Corazza                                                    11




General computing vs. Packet processing
                                          Memory
                          Core
                                          access            Structure
                         activity
                                          patterns


                       CPU bounded      Locality pattern   Complex tasks
   General                                                 launched once
  Computing
  Application            ALU-based
                         computation
                                           Caches are      Small amount of
                                             useful        memory required


                         Memory                            Very repetitive
    Packet               bounded        Random pattern
                                                            small tasks
  Processing
                         Load/Store-      Unpredictable
  Application              based           loads from      Huge amount of
                         computation         memory        memory involved



        Differences in hardware will mirror differences in software…
Francesco Corazza                         Focus on the Field   12




Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
Francesco Corazza                                   Focus on the Field   13




Network Processors
                                             Packet processing Applications
• Memory
  • Narrow data buses                       • Memory intensive
                                               • Huge amount of data involved
  • Multiple data buses                          in the processing
                                               • Frequent data load from packet
  • Memory Hierarchies
  • Few caches                              • No data locality
                                               • Unpredictable loads from
• Superscalar execution                          different memory areas
  • Massive number of threads
  • Thread-level parallelism
                                            • Small tasks, over a large
  • Zero-overhead switching                   number of packets
  • Asynchronous code


   Packet processing is a market niche, so the industry was obliged to
    move to solutions borrowed from mainstream consumer market…
Francesco Corazza                       Focus on the Field   14




Network Hardware Evolution
The scale economies have dropped out specific hardware:

• Network Processors
     • CISCO
     • Tilera
     • …                                T
• Consumer Processors                   I
     • GPU solutions
           • Nvidia Fermi               M
     • CPU+GPU solutions                E
        • Our investigation lays here
• Hybrid Processors
     • Intel Many Integrated Core
     • AMD Fusion
Francesco Corazza                         Focus on the Field   15




Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
   • GPU
   • CPU + GPU
   • Intel MIC

• How convenient hardware can be exploited in these app?
Francesco Corazza                     Focus on the Field   16




GPU – Features
                               Packet processing Applications
• Shared Memory               • Memory intensive
  • High bandwidth               • Huge amount of data involved
                                   in the processing
  • Coalesced access             • Frequent data load from packet
                              • No data locality
                                 • Unpredictable loads from
• Lots of Execution Units          different memory areas

   • Slow cores
   • Massive parallelism      • Small tasks, over a large
                                number of packets
• SIMT execution model
  • More flexible than SIMD
Francesco Corazza                                    Focus on the Field   19




CPU + GPU solutions
… just wait few slides to find out how it will end up




   Let's take a look to the architectures that we will face in the future…
Francesco Corazza                                      Focus on the Field   20




Intel MIC (Many Integrated Core)
• Built from Single-Chip Cloud Computer and Larrabee
  researches
   • Programming GPU with x86 Instruction Set

• Development tools in common with Xeon
   • Same tools can compile both for the processor and for the co-processor
   • HPC market target

• Knights Corner (First Implementation):
  • 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit
    vector unit, GDDR5 memory, PCI Express 2.0
Francesco Corazza                         Focus on the Field   21




Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
   • GPGPU
   • DirectCompute
   • OpenCL
Francesco Corazza                                   Focus on the Field   22




GPGPU – Overview
• General-Purpose computing on graphics processing units
  • Programming GPUs through accessible programming interfaces
    and industry-standard languages such as C
  • Allows software developers to use stream processing on non-
    graphics data
• Competing interfaces
  • Nvidia Compute Unified Device Architecture (CUDA)
  • AMD Stream (now joined into OpenCL)
  • Microsoft DirectCompute (new subset of DirectX10/11 APIs)
• Convergence towards standardization (like OpenGL)
  • Khronos Group OpenCL


                    These frameworks lye just above hardware…
Francesco Corazza                                      Focus on the Field   23




GPGPU – Layer representation

                                         Media playback or processing,
                    Applications         media UI, recognition, etc.
                                         Technical
                                         Accelerator, Brook+, Rapidmind, Ct
      Domain                  Domain
      Libraries              Languages   MKL, ACML, cuFFT, D3DX, etc.

                                         DirectCompute, CUDA, CAL,
           Compute Languages
                                         OpenCL, LRB Native, etc.


                    Processors           CPU, GPU, Larrabee
                                         nVidia, Intel, AMD, S3, etc.
Francesco Corazza                                  Focus on the Field   25




GPGPU – Analysis
• CUDA
  • Tight hardware integration
  • Depence on Nvidia hardware
• OpenCL
  • Give up lower-level hooks into the architecture
  • Heterogeneous computational resources
  • Integration in the Khronos family (eg. OpenGL)
• DirectCompute
  • Only Windows (Wine/Mono are immature)
  • Integration in DirectX APIs
  • GPGPU under the hood of Windows 7


      For their spread, we are going to cover the latter two languages…
Francesco Corazza                                         Focus on the Field   26




DirectCompute
Exposes the compute functionality of the GPU as a new
type of shader (tool that determines the final appearance of an object's surface)
• Compute Shader
   • Delivers the performance of 3-D games to new applications
• Rendering integration
  • Demonstrates tight integration between computation and rendering
• Supported by all processor vendors
  • DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0
• Scalable parallel processing model
  • Code should scale for several generations
Francesco Corazza                            Focus on the Field   27




DirectCompute – Rendering Pipeline




Render scene


                    Write out scene image


     Use Compute for
image post-processing




                        Output final image
Francesco Corazza                                    Focus on the Field   30




DirectCompute – Programming Model
                        Dispatch
                        • 3D grid of thread groups


                        Thread Group
                        • 3D grid of threads
                        • numThreads(nX, nY, nZ)

                        Thread
                        • One invocation of a shader



Threads in the same group run concurrently
Francesco Corazza               Focus on the Field   31




DirectCompute – Execution Model

                    • A thread is executed by a scalar
                     processors



                    • A thread group is executed on a
                     multiprocessor




                    • A compute shader kernel is
                     launched as a grid of thread-
                     groups (Only one grid of thread groups
                     can execute on a device at one time)
Francesco Corazza                                                  Focus on the Field   35




DirectCompute – Example HLSL code
struct BufferStruct{ uint4 color;};

// group size
#define thread_group_size_x 4
#define thread_group_size_y 4
RWStructuredBuffer<BufferStruct> g_OutBuff;

/* This is the number of threads in a thread group, 4x4x1 in this example case */
// e.g.: [numthreads( 4, 4, 1 )]
[numthreads( thread_group_size_x, thread_group_size_y, 1 )]

void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint
groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )
{
  int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)
  int stride = thread_group_size_x * N_THREAD_GROUPS_X;
  // buffer stide, assumes data stride = data width (i.e. no padding)
  int idx = dispatchThreadID.y * stride + dispatchThreadID.x;
  float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);
  g_OutBuff[ idx ].color = color;
}
Francesco Corazza                               Focus on the Field   36




OpenCL – Overview
Open Computing Language
• Access to heterogeneous computational resources
• Parallel execution on single or multiple processors
   • GPU, CPU, GPU + CPU or multiple GPUs
• Desktop and Handheld Profiles
• Work with graphics APIs
  • OpenGL
• C99 with extensions
  • Familiar to developers
  • Rich set of built-in functions
  • Easy to develop data- and task- parallel compute programs
  • Defines hardware and numerical precision requirements
Francesco Corazza                                   Focus on the Field   37




OpenCL – Execution Model (I)
• Work item
  • Basic unit of work on an OpenCL device
• Kernel
  • Basic unit of executable code
  • Similar to a C function
  • Data-parallel or task-parallel
• Program
  • Collection of kernels and functions
  • Analogous to a dynamic library
• Context
  • Environment within which work- items executes
• Applications
  • Queue kernel execution instances
       • In-order: one queue to a device
   • Executed in-order or out-of-order
Francesco Corazza                                             Focus on the Field   43




OpenCL – Coding (I)
• Work-item
  • Smallest execution entity
  • Every time a Kernel is launched, lots of work-items (a number
    specified by the programmer) are launched, each one executing the
    same code
  • Unique ID
       • Accessible from the kernel
       • Used to distinguish the data to be processed by each work-item
• Work-group
  • Allow communication and cooperation between work-items
  • Reflect work-items organization
       • (N-dimensional grid of work-groups, N = 1, 2 or 3)
       • Independent element of execution in N-D domain
• ND-Range
  • Computation domain (Organization level)
  • Specify how work-groups are organized
       • (N-dimensional grid of work-groups, N = 1, 2 or 3)
       • Defines the total number of work-items that execute in parallel
Francesco Corazza      Focus on the Field   44




OpenCL – Coding (II)
Francesco Corazza                                                         Focus on the Field   45




OpenCL – Coding (III)
Process a 1024 x 1024 image
Global problem dimensions:
   • 1024 x 1024 = 1 kernel execution per pixel
   • 1,048,576 total executions




                                                 data-parallel
 scalar




          void scalar_mul ( int n,                               kernel void dp_mul(
          const float *a,                                        global const float *a,
          const float *b,                                        global const float *b,
          float *result)                                         global float *result )
          {                                                      {
               int i;                                                 int id = get_global_id(0);
               for (i=0; i<n; i++)                                    result[id] = a[id] * b[id];
                      result[i] = a[i] * b[i];                   }
          }                                                      // execute dp_mul over “n”
                                                                 work-items
FOCUS ON
INTEGRATED GRAPHICS
Francesco Corazza                   Focus on Integrated Graphics   47




CPU+GPU solutions
The architectures involved are:
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion




                    Let’s compare them…
Francesco Corazza                          Focus on Integrated Graphics   48




CPU+GPU solutions
                     Market Target                    Release Date


                     Desktop / Hi-End                        01/2011


                     Mobile / Industrial
                                                             11/2010
                       embedded


                      Mobile / Tablets                       01/2010


                    Consumer / Desktop                       01/2011
Francesco Corazza                       Focus on Integrated Graphics   49




Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
   • Features
   • Integrated GPU
   • AVX (Advanced Vector Extensions)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion
Francesco Corazza                          Focus on Integrated Graphics   50




Sandy Bridge – Features (I)
• CPU die redesigned
  • Chip’s northbridge and GPU are both on-die (in the previous
    versions they were on a physically separate chip)
• LLC (Last Level Cache, formerly L3 Cache)
   • Thanks to new ring bus LLC is shared amongst all components,
     including the GPU
   • Each individual core had its own private path to the LLC cache
• Unified Memory Architecture (UMA)
  • Architecture where the graphics subsystem does not have
    exclusive dedicated memory and uses the host system’s memory
  • Dynamic Video Memory Technology (DVMT)
• Hyper Threading
Francesco Corazza                            Focus on Integrated Graphics   51




Sandy Bridge – Features (II)
• Turbo Boost Technology 2.0
   • Adjust the processor core and GPU frequencies to increase
     performance and maintain the allotted power/thermal budget
   • Processor can increase individual core speed or graphics speed as
     the workload dictates
   • Developers cannot directly control it
• AVX (Advanced Vector eXtension)
   • Extends SIMD instructions from 128 bits to 256 bits.
   • AVX enables a single instruction to work on eight floating points at
     a time instead of the four that the current SIMD provides
   • Increased processor performance with minimal power gains
     (HUGI: Hurry Up And Get Idle)

         Next diagram shows the integration that Intel have reached…
Francesco Corazza                            Focus on Integrated Graphics   52




Sandy Bridge – Block Diagram




             Now we have to zoom in into the graphic processor…
Francesco Corazza    Focus on Integrated Graphics   53




Sandy Bridge – Integrated GPU (I)
Francesco Corazza                            Focus on Integrated Graphics   54




Sandy Bridge – Integrated GPU (II)
• DirectCompute support
  • DirectX 10.1
  • The internal ISA maps one-to-one with most DirectX10 API
    instructions resulting in a very CISC-like architecture
• Execution Unit (EU)
  • The pipeline decoder uses only fixed-type function logic to limit the
    overall power consumption (unlike NVIDIA and AMD that have
    programmable stream processors)
  • Each EU can dual issue picking instructions from multiple threads
  • Transcendental math is handled by hardware in the EU and its
    performance has been sped up considerably


   GPU’s parallel capabilities are exploited thanks DirectCompute, but
                             what about CPU?
Francesco Corazza                                                           Focus on Integrated Graphics   55




AVX – Overview
•KEY FEATURES
     •Wider Vectors
           •Increased from 128 to 256 bit
           •Two 128-bit load ports
     •Enhanced Data Rearrangement
           •Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes
     •Three and four Operands
           •Non Destructive Source for both AVX 128 and AVX 256
     •Flexible unaligned memory access support
     •Extensible new opcode (VEX)
•BENEFITS
     •Higher peak FLOPs with good power efficiency
     •Organize, access and pull only necessary data more quickly and efficiently
     •Fewer register copies, better register use for both vector and scalar code
     •More opportunities to fuse load and compute operations
     •Code size reduction



         Some assembly instructions can show the power of AVX…
Francesco Corazza        Focus on Integrated Graphics   56




AVX – Instructions (I)
Francesco Corazza         Focus on Integrated Graphics   57




AVX – Instructions (II)
Francesco Corazza                                                        Focus on Integrated Graphics   58




AVX – Code Example (I)




                                                             Assembly:
 High level code:
                    #include <immintrin.h>                               ; -- Begin _foo
                                                                         ALIGN 16
                                                                         PUBLIC _foo
                    void foo(float *a, float *b, float *r)
                    {                                                    _foo         PROC NEAR
                         __m256 s1, s2, res;                             ; parameter 1: 4 + esp
                                                                         ; parameter 2: 8 + esp
                         s1 = _mm256_loadu_ps(a);                        ; parameter 3: 12 + esp
                         s2 = _mm256_loadu_ps(b);                        $B2$1:       ; Preds $B2$0
                                                                            mov eax, DWORD PTR [4+esp]
                         res = _mm256_add_ps(s1, s2);
                                                                            mov edx, DWORD PTR [8+esp]
                         _mm256_storeu_ps(r, res);                          mov ecx, DWORD PTR [12+esp]
                    }                                                       vmovups ymm0, YMMWORD PTR [eax]
                                                                            vaddps ymm1, ymm0, YMMWORD PTR [edx]
                                                                            vmovups YMMWORD PTR [ecx], ymm1
                                                                            ; LOE ebx ebp esi edi
                                                                         $B2$2:       ; Preds $B2$1
                                                                            ret       ;10.1
                                                                            ALIGN     16
                                                                                      ; LOE
                                                                         _foo         ENDP
                                                                         ;_foo        ENDS
Francesco Corazza   Focus on Integrated Graphics   61




AVX – Benchmarks
Francesco Corazza                                      Focus on Integrated Graphics   62




AVX – Benchmarks




                       SIMD processing works best with data-parallel
                         applications where the data is arranged in a
                    structure of array (SOA) format. Graphics and image
                    processing applications are often highly parallel and
                         well-structured, and thus are typically good
                    candidates for SIMD processing. Geometry or mesh
                       data, on the other hand, is not always uniformly
                                   structured in a neat grid.
Francesco Corazza                          Focus on Integrated Graphics   63




Sandy Bridge – Conclusion
• Interesting features for packet processing
   • Integrated Memory controller
   • DirectCompute
   • AVX
• CPU+GPU integration is only on the physical layer
  • Packet processing can exploit CPU or GPU
  • Unpredictable evolution
       • DirectCompute could exploit CPU
       • AVX could exploit GPU
   • Next Ivy Bridge will support both OpenCL and DirectX11
Francesco Corazza                  Focus on Integrated Graphics   64




Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
   • Features
   • Block Diagram
   • Customization
• Nvidia Tegra (Tegra 2)
• AMD Fusion
Francesco Corazza                          Focus on Integrated Graphics   65




Atom E600 – Features (I)
• SoC (System on Chip)
• Power optimized
  • Fanless performance
• I/O flexible and open
   • Flexible application Specific Needs
   • PCIe instead of proprietary FSB
• 7 years long life support


• Hyper-Threading Technology
  • Two logical processors
• SSE3 (Streaming SIMD Extensions)
  • Support for SIMD intructions
Francesco Corazza                                   Focus on Integrated Graphics   66




Atom E600 – Features (II)
• Power saving
  • Intel SpeedStep Technology
       • Enables the operating system to program a processor to transition to
         lower frequency and/or voltage levels while executing a workload
   • Deep power down technology
       • Able to reduce static power consumption by turning off power to cache
         and other sub-systems in the processor.
   • In-order processing
       • Guarantees greater power efficiency, CPU will not reorder an instruction
         stream to extract instruction-level parallelism
• DirectCompute support
  • Tunnel Creek supports only DirectX9


        The next diagram shows the insight of the Atom architecture…
Francesco Corazza             Focus on Integrated Graphics   67




Atom E600 – Block Diagram




  Atom does not support
DirectCompute, so we have
to concentrate on the great
      flexibility of the
       architecture…
Francesco Corazza                             Focus on Integrated Graphics   68




Atom E600 – Customization
• Open connection
  • Developers can attach the
    processor to a variety of chipsets
       • application-specific third-party
         chipsets
       • FPGAs
       • ASIC
   • Processor can be used without a
     chipset (limited I/O needs)
       • The processor’s four PCIe
         connections can attach to discrete
         PCIe peripherals such as Ethernet
         controllers
Francesco Corazza                                Focus on Integrated Graphics   69




Atom E600 – Conclusion
• Interesting features for packet processing
   • Power saving features
   • Long support
   • Flexible Architecture
• Any support to GPGPU
  • Old school GPGPU
       • Use OpenGL ES 2.0 shaders (programmable shaders)
       • Rewrite the code as a fragment shader
   • Wait for Cedar Trail (2011 – not yet released)
       • DirectX 10.1
Francesco Corazza                  Focus on Integrated Graphics   70




Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
   • Features
   • Block Diagram

• AMD Fusion
Francesco Corazza                          Focus on Integrated Graphics   71




Tegra – Features
• SoC (System-on-a-chip)
  • ARM CPU Dual Core
  • GeForce GPU
• ULP (Ultra-low power consumption)
• Graphics support
  • No DirectX support
  • No CUDA support
  • OpenGL ES 2.0 support




       The next diagram shows quantitatively a view of a Tegra chip…
Francesco Corazza       Focus on Integrated Graphics   72




Tegra – Block Diagram
Francesco Corazza                                Focus on Integrated Graphics   73




Tegra – Conclusion
• Interesting features for packet processing
   • Integrated Memory controller
   • Low power consumption
• Any support to GPGPU
  • Old school GPGPU
       • Use OpenGL ES 2.0 shaders (programmable shaders)
       • Rewrite the code as a fragment shader
   • Wait for Tegra 3 (third quarter of 2011)
       • DirectX 11
       • CUDA
Francesco Corazza                   Focus on Integrated Graphics   74




Focus on Integrated Graphics
• Intel Core 2° Generation ( Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion
   • AMD Vision
   • Features
   • APU Roadmap
   • Integration Highlights
Francesco Corazza                       Focus on Integrated Graphics   75




Fusion – AMD Vision
Fusion is a step-forward technology:




AMD have realized this heterogeneous architecture developing APUs…
Francesco Corazza           Focus on Integrated Graphics   76




Fusion – Features (I)
                    Video
Francesco Corazza                             Focus on Integrated Graphics   77




Fusion – Features (II)
• DirectCompute support (DirectX 11)
• OpenCL 1.1
  • Additive capabilities of an APU and a
    discrete graphics solution
  • Power-oriented benefits
• Massive SIMD GPU (SSE5)
  • Programmable scalar and vector
    processor cores
• APU family
  • Bulldozer (Sandy Bridge’s opponent)
       • Performance and scalability
   • Bobcat (Atom’s opponent)


          Let’s compare this two solutions…
Francesco Corazza                          Focus on Integrated Graphics   79




Fusion – Features (III)




    The difference between Bulldozer/Bobcat is also the market target…
Francesco Corazza                            Focus on Integrated Graphics   81




Fusion – APU roadmap




        The high level of integration differentiate APUs from CPUs…
Francesco Corazza             Focus on Integrated Graphics   82




Fusion – Integration Highlights
• Shared memory
  • Lower latencies
• PCI Express
  • Cut down some latencies
• No discrete GPU, less
  • Cost
  • Power
  • Motherboard complexity
Francesco Corazza                             Focus on Integrated Graphics   83




Fusion – Conclusion
• Interesting features for packet processing
   • OpenCL/DirectCompute/SSE5
   • Architecture tight integrated
   • New technology (First-Come-First-Served)
• OpenCL
  • Could be the “El Dorado” for packet processing
       • CPU/GPU working in AND/OR configuration
       • Shared Memory
       • Embedded implementation of Fusion technology
   • AMD declaredly support it to bring the power of heterogeneous
     computing mainstream
CONCLUSIONS
Francesco Corazza                                         Conclusions   85




Summary (I)
This presentation has disclosed several ways of exploiting
integrated graphics and, more generally, consumer architectures
for packet processing:

• GPGPU-driven solutions
   • CUDA, OpenCL, DirectX11
• SIMD-driven solutions
   • Exploit very parallel operations through this SIMD implementation
   • AVX, SSE
• Custom hardware solutions
   • Design flexible modules tailored on specific needs
   • FPGA




            The former solutions are the most in vogue at the moment…
Francesco Corazza                            Conclusions   86




Summary (II)
                    Open                   Direct          Open
                            SSE      FPGA
                     CL                   Compute           GL

                             V
                     X                X      V                  V
                           (AVX)

                              V
                     X                V      X                  V
                           (SSE 3)

                              V
                     X                X      X                  V
                           (SSE 3)

                              V
                     V                X      V                  V
                           (SSE 5)
Francesco Corazza                             Conclusions   87




Recommendations
Write directly parallel code is more efficient than hardware
parallelization:
THANK YOU
Questions?
Francesco Corazza                                                        89




Bibliography
•   Lecture notes of course “Tecnologie per reti di calcolatori”
•   http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm
•   http://www.intel.com/technology/atom/index.htm
•   http://www.intel.com/technology/architecture-silicon/mic/index.htm
•   http://sites.amd.com/us/fusion/apu/pages/fusion.aspx
•   http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-
    architettura_index.html
•   http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/
•   http://www.multicorepacketprocessing.com/
•   http://www.nvidia.co.uk/object/tegra-2.html
•   http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-
    6.html
•   http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html
•   http://gpgpu.org/
•   http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/
•   http://gpgpu-computing.blogspot.com/
•   http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
•   http://www.khronos.org/developers/resources/opencl/#ttutorials
•   http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

More Related Content

What's hot

Sdn and open flow tutorial 4
Sdn and open flow tutorial 4Sdn and open flow tutorial 4
Sdn and open flow tutorial 4UmaMahesh Sistu
 
Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...Michelle Holley
 
SDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkSDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkTim4PreStartup
 
OpenNebula Interoperability
OpenNebula InteroperabilityOpenNebula Interoperability
OpenNebula Interoperabilitydmamolina
 
Building Blocks for Private and Hybrid Clouds
Building Blocks for Private and Hybrid CloudsBuilding Blocks for Private and Hybrid Clouds
Building Blocks for Private and Hybrid CloudsRightScale
 
Design and Deployment using the Cisco Smart Business Architecture (SBA)
Design and Deployment using the Cisco Smart Business Architecture (SBA)Design and Deployment using the Cisco Smart Business Architecture (SBA)
Design and Deployment using the Cisco Smart Business Architecture (SBA)Cisco Russia
 
Triangle bdpa wo vid
Triangle bdpa wo vidTriangle bdpa wo vid
Triangle bdpa wo vidsantosomar
 
Virtual Server 2004 Overview
Virtual Server 2004 OverviewVirtual Server 2004 Overview
Virtual Server 2004 Overviewwebhostingguy
 
Software Defined Networking (SDN) Technology Brief
Software Defined Networking (SDN) Technology BriefSoftware Defined Networking (SDN) Technology Brief
Software Defined Networking (SDN) Technology BriefZivaro Inc
 
Kilix: Heterogeneous Modeling of Gesture-Based 3D Applications
Kilix: Heterogeneous Modeling of Gesture-Based 3D ApplicationsKilix: Heterogeneous Modeling of Gesture-Based 3D Applications
Kilix: Heterogeneous Modeling of Gesture-Based 3D ApplicationsTom Mens
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战Weiwei Fang
 
INSIDE M2M products & references
INSIDE M2M products & referencesINSIDE M2M products & references
INSIDE M2M products & referencesDaniel Stanke
 
SDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergiesSDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergiesHector.Avalos
 
Embedded is not special
Embedded is not specialEmbedded is not special
Embedded is not specialAnne Nicolas
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 

What's hot (19)

Sdn and open flow tutorial 4
Sdn and open flow tutorial 4Sdn and open flow tutorial 4
Sdn and open flow tutorial 4
 
Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...Cache Consistency – Requirements and its packet processing Performance implic...
Cache Consistency – Requirements and its packet processing Performance implic...
 
Unix v6 Internals
Unix v6 InternalsUnix v6 Internals
Unix v6 Internals
 
SDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkSDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual Network
 
OpenNebula Interoperability
OpenNebula InteroperabilityOpenNebula Interoperability
OpenNebula Interoperability
 
Building Blocks for Private and Hybrid Clouds
Building Blocks for Private and Hybrid CloudsBuilding Blocks for Private and Hybrid Clouds
Building Blocks for Private and Hybrid Clouds
 
Design and Deployment using the Cisco Smart Business Architecture (SBA)
Design and Deployment using the Cisco Smart Business Architecture (SBA)Design and Deployment using the Cisco Smart Business Architecture (SBA)
Design and Deployment using the Cisco Smart Business Architecture (SBA)
 
2016 open-source-network-softwarization
2016 open-source-network-softwarization2016 open-source-network-softwarization
2016 open-source-network-softwarization
 
Triangle bdpa wo vid
Triangle bdpa wo vidTriangle bdpa wo vid
Triangle bdpa wo vid
 
Virtual Server 2004 Overview
Virtual Server 2004 OverviewVirtual Server 2004 Overview
Virtual Server 2004 Overview
 
Software Defined Networking (SDN) Technology Brief
Software Defined Networking (SDN) Technology BriefSoftware Defined Networking (SDN) Technology Brief
Software Defined Networking (SDN) Technology Brief
 
Kilix: Heterogeneous Modeling of Gesture-Based 3D Applications
Kilix: Heterogeneous Modeling of Gesture-Based 3D ApplicationsKilix: Heterogeneous Modeling of Gesture-Based 3D Applications
Kilix: Heterogeneous Modeling of Gesture-Based 3D Applications
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战
 
INSIDE M2M products & references
INSIDE M2M products & referencesINSIDE M2M products & references
INSIDE M2M products & references
 
SDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergiesSDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergies
 
SDN
SDNSDN
SDN
 
Embedded is not special
Embedded is not specialEmbedded is not special
Embedded is not special
 
Cont0519
Cont0519Cont0519
Cont0519
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 

Similar to Exploit the Integrated Graphics in Packet Processing

Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart cameraXIMEA
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...Edge AI and Vision Alliance
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
Chap1
Chap1Chap1
Chap1adisi
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Docker, Inc.
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingIndicThreads
 
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...AI Frontiers
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded SystemsSudhanshu Janwadkar
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Javalucenerevolution
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsNipun Sharma
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesAdam DeConinck
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 modelsanandme07
 
Network Processor - 2021.pptx
Network Processor - 2021.pptxNetwork Processor - 2021.pptx
Network Processor - 2021.pptxssuserdfb2da
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 

Similar to Exploit the Integrated Graphics in Packet Processing (20)

Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart camera
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Chap1
Chap1Chap1
Chap1
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
 
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...
Kevin Shaw at AI Frontiers: AI on the Edge: Bringing Intelligence to Small De...
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
 
Linux para iniciantes
Linux para iniciantesLinux para iniciantes
Linux para iniciantes
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of Actuaries
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
 
Network Processor - 2021.pptx
Network Processor - 2021.pptxNetwork Processor - 2021.pptx
Network Processor - 2021.pptx
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Ch 2
Ch 2Ch 2
Ch 2
 

Recently uploaded

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 

Recently uploaded (20)

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 

Exploit the Integrated Graphics in Packet Processing

  • 1. EXPLOIT THE INTEGRATED GRAPHICS IN PACKET PROCESSING Speaker: Prof. Fulvio Risso Supervisor: Progetto di Reti Locali Course: 2010/2011 Academic year: Francesco Corazza
  • 2. Francesco Corazza 2 Scenario Packet processing are demanding more performances: • Increasing network speed • More intelligence in network devices • Deeper packet analysis • … Intel is the best network hardware choice thanks to: • Scale economy • Price/quality ratio • Power Consumption We will deal with packet processing on Intel platforms…
  • 3. Francesco Corazza 3 Overview Issues: • Intel • Have not yet deployed efficient tools for our needs • Discrete GPU • Heavy • Expensive • Not power-saving • Affected by BUS bottleneck Focus: • Consumer platforms • CPU + GPU solutions Two different objectives can be identified…
  • 4. Francesco Corazza 4 Presentation Structure Objectives: Focus on Focus on Integrated the Field Graphics Chapter Division: What is the How convenient hardware hardware can be What kind of best fit on exploited in these app? application is these packet applications? Which What is the CPU+GP processing? features GPU hardware U solutions differentiate most solutions them from profitable for general these app? computing?
  • 5. FOCUS ON THE FIELD
  • 6. Francesco Corazza Focus on the Field 6 Focus on the field • What kind of application is packet processing? • Which features differentiate them from general computing? • What is the hardware best fit on these applications? • What is the hardware most profitable for these app? • How convenient hardware can be exploited in these app?
  • 7. Francesco Corazza Focus on the Field 7 Packet processing Applications • Memory intensive • Frequent data load from packet • Huge amount of data involved in the processing • No data locality • Unpredictable loads from different memory areas • Small tasks, over a large number of packets
  • 8. Francesco Corazza Focus on the Field 8 Focus on the field • What kind of application is packet processing? • Which features differentiate them from general computing? • What is the hardware best fit on these applications? • What is the hardware most profitable for these app? • How convenient hardware can be exploited in these app?
  • 9. Francesco Corazza 11 General computing vs. Packet processing Memory Core access Structure activity patterns CPU bounded Locality pattern Complex tasks General launched once Computing Application ALU-based computation Caches are Small amount of useful memory required Memory Very repetitive Packet bounded Random pattern small tasks Processing Load/Store- Unpredictable Application based loads from Huge amount of computation memory memory involved Differences in hardware will mirror differences in software…
  • 10. Francesco Corazza Focus on the Field 12 Focus on the field • What kind of application is packet processing? • Which features differentiate them from general computing? • What is the hardware best fit on these applications? • What is the hardware most profitable for these app? • How convenient hardware can be exploited in these app?
  • 11. Francesco Corazza Focus on the Field 13 Network Processors Packet processing Applications • Memory • Narrow data buses • Memory intensive • Huge amount of data involved • Multiple data buses in the processing • Frequent data load from packet • Memory Hierarchies • Few caches • No data locality • Unpredictable loads from • Superscalar execution different memory areas • Massive number of threads • Thread-level parallelism • Small tasks, over a large • Zero-overhead switching number of packets • Asynchronous code Packet processing is a market niche, so the industry was obliged to move to solutions borrowed from mainstream consumer market…
  • 12. Francesco Corazza Focus on the Field 14 Network Hardware Evolution The scale economies have dropped out specific hardware: • Network Processors • CISCO • Tilera • … T • Consumer Processors I • GPU solutions • Nvidia Fermi M • CPU+GPU solutions E • Our investigation lays here • Hybrid Processors • Intel Many Integrated Core • AMD Fusion
  • 13. Francesco Corazza Focus on the Field 15 Focus on the field • What kind of application is packet processing? • Which features differentiate them from general computing? • What is the hardware best fit on these applications? • What is the hardware most profitable for these app? • GPU • CPU + GPU • Intel MIC • How convenient hardware can be exploited in these app?
  • 14. Francesco Corazza Focus on the Field 16 GPU – Features Packet processing Applications • Shared Memory • Memory intensive • High bandwidth • Huge amount of data involved in the processing • Coalesced access • Frequent data load from packet • No data locality • Unpredictable loads from • Lots of Execution Units different memory areas • Slow cores • Massive parallelism • Small tasks, over a large number of packets • SIMT execution model • More flexible than SIMD
  • 15. Francesco Corazza Focus on the Field 19 CPU + GPU solutions … just wait few slides to find out how it will end up Let's take a look to the architectures that we will face in the future…
  • 16. Francesco Corazza Focus on the Field 20 Intel MIC (Many Integrated Core) • Built from Single-Chip Cloud Computer and Larrabee researches • Programming GPU with x86 Instruction Set • Development tools in common with Xeon • Same tools can compile both for the processor and for the co-processor • HPC market target • Knights Corner (First Implementation): • 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit vector unit, GDDR5 memory, PCI Express 2.0
  • 17. Francesco Corazza Focus on the Field 21 Focus on the field • What kind of application is packet processing? • Which features differentiate them from general computing? • What is the hardware best fit on these applications? • What is the hardware most profitable for these app? • How convenient hardware can be exploited in these app? • GPGPU • DirectCompute • OpenCL
  • 18. Francesco Corazza Focus on the Field 22 GPGPU – Overview • General-Purpose computing on graphics processing units • Programming GPUs through accessible programming interfaces and industry-standard languages such as C • Allows software developers to use stream processing on non- graphics data • Competing interfaces • Nvidia Compute Unified Device Architecture (CUDA) • AMD Stream (now joined into OpenCL) • Microsoft DirectCompute (new subset of DirectX10/11 APIs) • Convergence towards standardization (like OpenGL) • Khronos Group OpenCL These frameworks lye just above hardware…
  • 19. Francesco Corazza Focus on the Field 23 GPGPU – Layer representation Media playback or processing, Applications media UI, recognition, etc. Technical Accelerator, Brook+, Rapidmind, Ct Domain Domain Libraries Languages MKL, ACML, cuFFT, D3DX, etc. DirectCompute, CUDA, CAL, Compute Languages OpenCL, LRB Native, etc. Processors CPU, GPU, Larrabee nVidia, Intel, AMD, S3, etc.
  • 20. Francesco Corazza Focus on the Field 25 GPGPU – Analysis • CUDA • Tight hardware integration • Depence on Nvidia hardware • OpenCL • Give up lower-level hooks into the architecture • Heterogeneous computational resources • Integration in the Khronos family (eg. OpenGL) • DirectCompute • Only Windows (Wine/Mono are immature) • Integration in DirectX APIs • GPGPU under the hood of Windows 7 For their spread, we are going to cover the latter two languages…
  • 21. Francesco Corazza Focus on the Field 26 DirectCompute Exposes the compute functionality of the GPU as a new type of shader (tool that determines the final appearance of an object's surface) • Compute Shader • Delivers the performance of 3-D games to new applications • Rendering integration • Demonstrates tight integration between computation and rendering • Supported by all processor vendors • DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0 • Scalable parallel processing model • Code should scale for several generations
  • 22. Francesco Corazza Focus on the Field 27 DirectCompute – Rendering Pipeline Render scene Write out scene image Use Compute for image post-processing Output final image
  • 23. Francesco Corazza Focus on the Field 30 DirectCompute – Programming Model Dispatch • 3D grid of thread groups Thread Group • 3D grid of threads • numThreads(nX, nY, nZ) Thread • One invocation of a shader Threads in the same group run concurrently
  • 24. Francesco Corazza Focus on the Field 31 DirectCompute – Execution Model • A thread is executed by a scalar processors • A thread group is executed on a multiprocessor • A compute shader kernel is launched as a grid of thread- groups (Only one grid of thread groups can execute on a device at one time)
  • 25. Francesco Corazza Focus on the Field 35 DirectCompute – Example HLSL code struct BufferStruct{ uint4 color;}; // group size #define thread_group_size_x 4 #define thread_group_size_y 4 RWStructuredBuffer<BufferStruct> g_OutBuff; /* This is the number of threads in a thread group, 4x4x1 in this example case */ // e.g.: [numthreads( 4, 4, 1 )] [numthreads( thread_group_size_x, thread_group_size_y, 1 )] void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID ) { int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1) int stride = thread_group_size_x * N_THREAD_GROUPS_X; // buffer stide, assumes data stride = data width (i.e. no padding) int idx = dispatchThreadID.y * stride + dispatchThreadID.x; float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y); g_OutBuff[ idx ].color = color; }
  • 26. Francesco Corazza Focus on the Field 36 OpenCL – Overview Open Computing Language • Access to heterogeneous computational resources • Parallel execution on single or multiple processors • GPU, CPU, GPU + CPU or multiple GPUs • Desktop and Handheld Profiles • Work with graphics APIs • OpenGL • C99 with extensions • Familiar to developers • Rich set of built-in functions • Easy to develop data- and task- parallel compute programs • Defines hardware and numerical precision requirements
  • 27. Francesco Corazza Focus on the Field 37 OpenCL – Execution Model (I) • Work item • Basic unit of work on an OpenCL device • Kernel • Basic unit of executable code • Similar to a C function • Data-parallel or task-parallel • Program • Collection of kernels and functions • Analogous to a dynamic library • Context • Environment within which work- items executes • Applications • Queue kernel execution instances • In-order: one queue to a device • Executed in-order or out-of-order
  • 28. Francesco Corazza Focus on the Field 43 OpenCL – Coding (I) • Work-item • Smallest execution entity • Every time a Kernel is launched, lots of work-items (a number specified by the programmer) are launched, each one executing the same code • Unique ID • Accessible from the kernel • Used to distinguish the data to be processed by each work-item • Work-group • Allow communication and cooperation between work-items • Reflect work-items organization • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Independent element of execution in N-D domain • ND-Range • Computation domain (Organization level) • Specify how work-groups are organized • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Defines the total number of work-items that execute in parallel
  • 29. Francesco Corazza Focus on the Field 44 OpenCL – Coding (II)
  • 30. Francesco Corazza Focus on the Field 45 OpenCL – Coding (III) Process a 1024 x 1024 image Global problem dimensions: • 1024 x 1024 = 1 kernel execution per pixel • 1,048,576 total executions data-parallel scalar void scalar_mul ( int n, kernel void dp_mul( const float *a, global const float *a, const float *b, global const float *b, float *result) global float *result ) { { int i; int id = get_global_id(0); for (i=0; i<n; i++) result[id] = a[id] * b[id]; result[i] = a[i] * b[i]; } } // execute dp_mul over “n” work-items
  • 32. Francesco Corazza Focus on Integrated Graphics 47 CPU+GPU solutions The architectures involved are: • Intel Core 2° Generation (Sandy Bridge) • Intel Atom E600 Series (Tunnel Creek) • Nvidia Tegra (Tegra 2) • AMD Fusion Let’s compare them…
  • 33. Francesco Corazza Focus on Integrated Graphics 48 CPU+GPU solutions Market Target Release Date Desktop / Hi-End 01/2011 Mobile / Industrial 11/2010 embedded Mobile / Tablets 01/2010 Consumer / Desktop 01/2011
  • 34. Francesco Corazza Focus on Integrated Graphics 49 Focus on Integrated Graphics • Intel Core 2° Generation (Sandy Bridge) • Features • Integrated GPU • AVX (Advanced Vector Extensions) • Intel Atom E600 Series (Tunnel Creek) • Nvidia Tegra (Tegra 2) • AMD Fusion
  • 35. Francesco Corazza Focus on Integrated Graphics 50 Sandy Bridge – Features (I) • CPU die redesigned • Chip’s northbridge and GPU are both on-die (in the previous versions they were on a physically separate chip) • LLC (Last Level Cache, formerly L3 Cache) • Thanks to new ring bus LLC is shared amongst all components, including the GPU • Each individual core had its own private path to the LLC cache • Unified Memory Architecture (UMA) • Architecture where the graphics subsystem does not have exclusive dedicated memory and uses the host system’s memory • Dynamic Video Memory Technology (DVMT) • Hyper Threading
  • 36. Francesco Corazza Focus on Integrated Graphics 51 Sandy Bridge – Features (II) • Turbo Boost Technology 2.0 • Adjust the processor core and GPU frequencies to increase performance and maintain the allotted power/thermal budget • Processor can increase individual core speed or graphics speed as the workload dictates • Developers cannot directly control it • AVX (Advanced Vector eXtension) • Extends SIMD instructions from 128 bits to 256 bits. • AVX enables a single instruction to work on eight floating points at a time instead of the four that the current SIMD provides • Increased processor performance with minimal power gains (HUGI: Hurry Up And Get Idle) Next diagram shows the integration that Intel have reached…
  • 37. Francesco Corazza Focus on Integrated Graphics 52 Sandy Bridge – Block Diagram Now we have to zoom in into the graphic processor…
  • 38. Francesco Corazza Focus on Integrated Graphics 53 Sandy Bridge – Integrated GPU (I)
  • 39. Francesco Corazza Focus on Integrated Graphics 54 Sandy Bridge – Integrated GPU (II) • DirectCompute support • DirectX 10.1 • The internal ISA maps one-to-one with most DirectX10 API instructions resulting in a very CISC-like architecture • Execution Unit (EU) • The pipeline decoder uses only fixed-type function logic to limit the overall power consumption (unlike NVIDIA and AMD that have programmable stream processors) • Each EU can dual issue picking instructions from multiple threads • Transcendental math is handled by hardware in the EU and its performance has been sped up considerably GPU’s parallel capabilities are exploited thanks DirectCompute, but what about CPU?
  • 40. Francesco Corazza Focus on Integrated Graphics 55 AVX – Overview •KEY FEATURES •Wider Vectors •Increased from 128 to 256 bit •Two 128-bit load ports •Enhanced Data Rearrangement •Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes •Three and four Operands •Non Destructive Source for both AVX 128 and AVX 256 •Flexible unaligned memory access support •Extensible new opcode (VEX) •BENEFITS •Higher peak FLOPs with good power efficiency •Organize, access and pull only necessary data more quickly and efficiently •Fewer register copies, better register use for both vector and scalar code •More opportunities to fuse load and compute operations •Code size reduction Some assembly instructions can show the power of AVX…
  • 41. Francesco Corazza Focus on Integrated Graphics 56 AVX – Instructions (I)
  • 42. Francesco Corazza Focus on Integrated Graphics 57 AVX – Instructions (II)
  • 43. Francesco Corazza Focus on Integrated Graphics 58 AVX – Code Example (I) Assembly: High level code: #include <immintrin.h> ; -- Begin _foo ALIGN 16 PUBLIC _foo void foo(float *a, float *b, float *r) { _foo PROC NEAR __m256 s1, s2, res; ; parameter 1: 4 + esp ; parameter 2: 8 + esp s1 = _mm256_loadu_ps(a); ; parameter 3: 12 + esp s2 = _mm256_loadu_ps(b); $B2$1: ; Preds $B2$0 mov eax, DWORD PTR [4+esp] res = _mm256_add_ps(s1, s2); mov edx, DWORD PTR [8+esp] _mm256_storeu_ps(r, res); mov ecx, DWORD PTR [12+esp] } vmovups ymm0, YMMWORD PTR [eax] vaddps ymm1, ymm0, YMMWORD PTR [edx] vmovups YMMWORD PTR [ecx], ymm1 ; LOE ebx ebp esi edi $B2$2: ; Preds $B2$1 ret ;10.1 ALIGN 16 ; LOE _foo ENDP ;_foo ENDS
  • 44. Francesco Corazza Focus on Integrated Graphics 61 AVX – Benchmarks
  • 45. Francesco Corazza Focus on Integrated Graphics 62 AVX – Benchmarks SIMD processing works best with data-parallel applications where the data is arranged in a structure of array (SOA) format. Graphics and image processing applications are often highly parallel and well-structured, and thus are typically good candidates for SIMD processing. Geometry or mesh data, on the other hand, is not always uniformly structured in a neat grid.
  • 46. Francesco Corazza Focus on Integrated Graphics 63 Sandy Bridge – Conclusion • Interesting features for packet processing • Integrated Memory controller • DirectCompute • AVX • CPU+GPU integration is only on the physical layer • Packet processing can exploit CPU or GPU • Unpredictable evolution • DirectCompute could exploit CPU • AVX could exploit GPU • Next Ivy Bridge will support both OpenCL and DirectX11
  • 47. Francesco Corazza Focus on Integrated Graphics 64 Focus on Integrated Graphics • Intel Core 2° Generation (Sandy Bridge) • Intel Atom E600 Series (Tunnel Creek) • Features • Block Diagram • Customization • Nvidia Tegra (Tegra 2) • AMD Fusion
  • 48. Francesco Corazza Focus on Integrated Graphics 65 Atom E600 – Features (I) • SoC (System on Chip) • Power optimized • Fanless performance • I/O flexible and open • Flexible application Specific Needs • PCIe instead of proprietary FSB • 7 years long life support • Hyper-Threading Technology • Two logical processors • SSE3 (Streaming SIMD Extensions) • Support for SIMD intructions
  • 49. Francesco Corazza Focus on Integrated Graphics 66 Atom E600 – Features (II) • Power saving • Intel SpeedStep Technology • Enables the operating system to program a processor to transition to lower frequency and/or voltage levels while executing a workload • Deep power down technology • Able to reduce static power consumption by turning off power to cache and other sub-systems in the processor. • In-order processing • Guarantees greater power efficiency, CPU will not reorder an instruction stream to extract instruction-level parallelism • DirectCompute support • Tunnel Creek supports only DirectX9 The next diagram shows the insight of the Atom architecture…
  • 50. Francesco Corazza Focus on Integrated Graphics 67 Atom E600 – Block Diagram Atom does not support DirectCompute, so we have to concentrate on the great flexibility of the architecture…
  • 51. Francesco Corazza Focus on Integrated Graphics 68 Atom E600 – Customization • Open connection • Developers can attach the processor to a variety of chipsets • application-specific third-party chipsets • FPGAs • ASIC • Processor can be used without a chipset (limited I/O needs) • The processor’s four PCIe connections can attach to discrete PCIe peripherals such as Ethernet controllers
  • 52. Francesco Corazza Focus on Integrated Graphics 69 Atom E600 – Conclusion • Interesting features for packet processing • Power saving features • Long support • Flexible Architecture • Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Cedar Trail (2011 – not yet released) • DirectX 10.1
  • 53. Francesco Corazza Focus on Integrated Graphics 70 Focus on Integrated Graphics • Intel Core 2° Generation (Sandy Bridge) • Intel Atom E600 Series (Tunnel Creek) • Nvidia Tegra (Tegra 2) • Features • Block Diagram • AMD Fusion
  • 54. Francesco Corazza Focus on Integrated Graphics 71 Tegra – Features • SoC (System-on-a-chip) • ARM CPU Dual Core • GeForce GPU • ULP (Ultra-low power consumption) • Graphics support • No DirectX support • No CUDA support • OpenGL ES 2.0 support The next diagram shows quantitatively a view of a Tegra chip…
  • 55. Francesco Corazza Focus on Integrated Graphics 72 Tegra – Block Diagram
  • 56. Francesco Corazza Focus on Integrated Graphics 73 Tegra – Conclusion • Interesting features for packet processing • Integrated Memory controller • Low power consumption • Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Tegra 3 (third quarter of 2011) • DirectX 11 • CUDA
  • 57. Francesco Corazza Focus on Integrated Graphics 74 Focus on Integrated Graphics • Intel Core 2° Generation ( Sandy Bridge) • Intel Atom E600 Series (Tunnel Creek) • Nvidia Tegra (Tegra 2) • AMD Fusion • AMD Vision • Features • APU Roadmap • Integration Highlights
  • 58. Francesco Corazza Focus on Integrated Graphics 75 Fusion – AMD Vision Fusion is a step-forward technology: AMD have realized this heterogeneous architecture developing APUs…
  • 59. Francesco Corazza Focus on Integrated Graphics 76 Fusion – Features (I) Video
  • 60. Francesco Corazza Focus on Integrated Graphics 77 Fusion – Features (II) • DirectCompute support (DirectX 11) • OpenCL 1.1 • Additive capabilities of an APU and a discrete graphics solution • Power-oriented benefits • Massive SIMD GPU (SSE5) • Programmable scalar and vector processor cores • APU family • Bulldozer (Sandy Bridge’s opponent) • Performance and scalability • Bobcat (Atom’s opponent) Let’s compare this two solutions…
  • 61. Francesco Corazza Focus on Integrated Graphics 79 Fusion – Features (III) The difference between Bulldozer/Bobcat is also the market target…
  • 62. Francesco Corazza Focus on Integrated Graphics 81 Fusion – APU roadmap The high level of integration differentiate APUs from CPUs…
  • 63. Francesco Corazza Focus on Integrated Graphics 82 Fusion – Integration Highlights • Shared memory • Lower latencies • PCI Express • Cut down some latencies • No discrete GPU, less • Cost • Power • Motherboard complexity
  • 64. Francesco Corazza Focus on Integrated Graphics 83 Fusion – Conclusion • Interesting features for packet processing • OpenCL/DirectCompute/SSE5 • Architecture tight integrated • New technology (First-Come-First-Served) • OpenCL • Could be the “El Dorado” for packet processing • CPU/GPU working in AND/OR configuration • Shared Memory • Embedded implementation of Fusion technology • AMD declaredly support it to bring the power of heterogeneous computing mainstream
  • 66. Francesco Corazza Conclusions 85 Summary (I) This presentation has disclosed several ways of exploiting integrated graphics and, more generally, consumer architectures for packet processing: • GPGPU-driven solutions • CUDA, OpenCL, DirectX11 • SIMD-driven solutions • Exploit very parallel operations through this SIMD implementation • AVX, SSE • Custom hardware solutions • Design flexible modules tailored on specific needs • FPGA The former solutions are the most in vogue at the moment…
  • 67. Francesco Corazza Conclusions 86 Summary (II) Open Direct Open SSE FPGA CL Compute GL V X X V V (AVX) V X V X V (SSE 3) V X X X V (SSE 3) V V X V V (SSE 5)
  • 68. Francesco Corazza Conclusions 87 Recommendations Write directly parallel code is more efficient than hardware parallelization:
  • 70. Francesco Corazza 89 Bibliography • Lecture notes of course “Tecnologie per reti di calcolatori” • http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm • http://www.intel.com/technology/atom/index.htm • http://www.intel.com/technology/architecture-silicon/mic/index.htm • http://sites.amd.com/us/fusion/apu/pages/fusion.aspx • http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell- architettura_index.html • http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/ • http://www.multicorepacketprocessing.com/ • http://www.nvidia.co.uk/object/tegra-2.html • http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763- 6.html • http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html • http://gpgpu.org/ • http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/ • http://gpgpu-computing.blogspot.com/ • http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx • http://www.khronos.org/developers/resources/opencl/#ttutorials • http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related