SlideShare a Scribd company logo
1 of 5
Download to read offline
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012



     A Generic Describing Method of Memory Latency
       Hiding in a High-level Synthesis Technology
                                          Akira Yamawaki and Seiichi Serikawa
                                     Kyushu Institute of Technology, Kitakyushu, Japan
                                        Email: {yama, serikawa}@ecs.kyutech.ac.jp

Abstract—We show a generic describing method of hardware               [8] needs the load/store unit described in HDL which is a tiny
including a memory access controller in a C-based high-level           processor dedicated to memory access. Thus, the high-speed
synthesis technology (Handel-C). In this method, a prefetching         simulation at C level cannot be performed in an early stage
mechanism to improve the performance by hiding memory                  during the development period. We propose a generic method
access latency can be systematically described in C language.
                                                                       to describe the hardware including a functionality hiding
We demonstrate that the proposed method is very simple and
easy through a case study. The experimental result shows               application-specific memory access latency at C language.
that although the proposed method introduces a little hardware         This paper pays attention to the Handel-C [9] that is one of
overhead, it can improve the performance significantly.                the HLS tools that has been widely used for the hardware
                                                                       design.
Index Terms—high-level synthesis, memory access, data                      To hide memory access latency, our method employs the
prefetch, FPGA, Handel-C, latency hiding                               software-pipelining [10] which is widely used in the research
                                                                       domain of the high performance computing. The software-
                        I. INTRODUCTION                                pipelining reconstructs the programming list by copying the
    The system-on-chip (SoC) used for the embedded                     load and store operations in the loop into front of the loop
products must achieve the low cost and respond to the short            and into back of the loop respectively. Since this method is
life-cycle. Thus, to reduce the development burdens such as            very simple, the user can easily use it and describe the
the development period and cost for the SoC, the high-level            hardware with the feature hiding memory access latency.
synthesis (HLS) technologies for the hardware design                   Consequently, the generic method of the C-level memory
converting the high abstract algorithm-level description like          latency hiding can be introduced into the conventional HLS
C, C++, Jave, and MATLAB to a register transfer-level (RTL)            technology.
description in a hardware description language (HDL) have                  Generally, the performance estimation is performed to
been proposed and developed. In addition, many researchers             estimate the effect of the software pipelining. The
in this research domain have noticed that the memory access            conventional method [10] uses the average memory latency
latency has to be hidden to extract the inherent performance           for the processor with a write-back cache memory. For a
of the hardware.                                                       hardware module in an embedded SoC, such cache memory
    Some researchers and developers have proposed the                  is very expensive and cannot be employed. Thus, new
hardware platforms combining the HLS technology with an                performance estimation method is needed. Thus, we propose
efficient memory access controller (MAC). The MACs as a                the new estimation method considering of the hardware
part of the platforms at high-level description such as C, C++         module to be mounted onto the SoC. The rest of the paper is
and MATLAB have been shown [1, 2, 3]. In these platforms,              organized as follows. Section 2 shows the target hardware
the designer has only to write the data processing hardware            architecture. Section 3 describes the load/store functions in
with the simple interfaces communicating with the MACs in              Handel-C, Section 4 describes the templates of the memory
order to access to the memory. However they did not consider           access and data processing. Section 5 demonstrates the
hiding the memory access latency.                                      software-pipelining method to hide memory access latency.
    Ref. [4] describes some memory access schedulers built             Section 6 explains new method of the performance estimation
as hardware to reduce the memory access latency by issuing             based on the hardware architecture shown in Section 2,
the memory commands efficiently and hiding the refresh cycle           considering the load and store for the software-pipelining.
of the DRAM memory. This proposal hides only the latencies             Section 7 shows the experimental results. Finally, Section 8
native to the DRAM such as the RAS-CAS latency, the bank               concludes this paper and remarks the future work.
switching latency and the refresh cycle. Thus, this scheduler
cannot hide the application-specific latency like the streaming
data buffering, the block data buffering, and the window
buffering. Some HLS tools [5, 6, 7, and 8] can describe the
memory access behavior in C language as well as the data
processing hardware. Ref. [5, 6, 7] however have never shown
a generic describing method to hide memory access latency.
Thus, the designers must have a deep knowledge of the HLS
tool used and write the MAC well relying on their skill. Ref.                            Figure.1 Target architecture.
© 2012 ACEEE                                                      51
DOI: 01.IJIT.02.02.44
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


        II. TARGET HARDWARE ARCHITECTURE
    Fig. 1 shows the architecture of the target hardware. This
architecture is familiar to the hardware designers and the
HLS technologies. This architecture consists of the memory
access part (MAP), the input/output FIFOs (IF/OF) and the
data processing part (DP). They are written in C language of
the Handel-C. The MAP is a control intensive hardware to
load and store the data in the memory. The MAP accesses to
the memory via wishbone bus [11] in a conventional request-
ready fashion. The MAP has the register file including the
control/status registers, mailbox (MB), of the hardware
module. The designer of the hardware module can arbitrarily
define each register in the MB. Each register can be accessed
by the external devices such as the embedded processor.
The DP is a data processing hardware which processes the
streaming data. Any HLS technology is good at handling the                                       Figure.2 Load function.
streaming hardware.
    The MAP accesses the memory according to the memory
access pattern written in C. The MAP loads the data into the
IF, converting the memory data to the streaming data. The
DP processes the streaming data in the IF and stores the
processed data into the OF as streaming data. The MAP
reads the streaming data in the OF and stores the read data
into the memory according to the memory access pattern.
The MAP and the DP are decoupled by the input/output
FIFOs. Thus, the hardware description is not confused about
the memory access and the data processing. Generally, any
HLS technology has the primitives of the FIFOs.

                III. LOAD/STORE FUNCTION
     Generally, the memory such as SRAM, SDRAM and DDR
SDRAM support the burst transfer which includes the
continuous 4 to 8 words. Thus, we describe the load/store
primitives supporting burst transfer as function calls as shown                                  Figure.3 Store function
in Fig. 2 and Fig. 3 respectively.
                                                                             Then, the bus request is issued setting WE_O to 1 in the
     As for the load function shown in Fig. 2, the bus request
                                                                             lines 3-6. When the acknowledgement (ACK_I) is asserted
is issued setting WE_O to 0 in the lines 2-5. When the
                                                                             by the memory, this function performs the burst transfer of
acknowledgement (ACK_I) is asserted by the memory, this
                                                                             storing in the lines 7-22. During the burst transfer, the OF is
function performs the burst transfer of loading in the lines 6-
                                                                             popped and the popped data is outputted into the output
20. In the Handel-C, “par” performs the statements in the
                                                                             port (DAT_O) one by one per 1 clock. As similar to the load
block in parallel at 1 clock. In contrast, the “seq” performs
                                                                             function, the current address is added by the burst length for
the statements in the block sequentially. Each statement
                                                                             the next transfer in the line 19.
consumes 1 clock. That is, the continuous words in the burst
transfer are pushed into the input FIFO (IF) one by one. The
‘!’ is the primitive pushing into the FIFO in Handel-C. When
the specified FIFO (in this example it is IF) is full, this statement
is blocked until the FIFO has an empty space. When burst
transfer finishes, the current address is added by the burst
length as shown in the line 17 in order to issue the next burst
transfer of loading. As for the store function shown in Fig. 3,
this function attempts to pop the output FIFO (OF) to store
the data processed by the data processing part (DP) in the
line 2. The ‘?’ is the primitive popping the FIFO in the Handel-
C. When the OF is empty, this statement is blocked until the
DP finishes the data processing and pushes the result into
the OF.                                                                                 Figure. 4 Template of memory access part
© 2012 ACEEE                                                            52
DOI: 01.IJIT.02.02.44
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012




           Figure. 5 Template of data processing part




              Figure. 6 Whole of hardware module

     IV. MEMORY ACCESS AND DATA PROCESSING
                                                                                         Figure. 8 Execution snapshot
    By using load/store functions shown in Fig. 2 and Fig. 3,
the memory access part can be written easily as shown in                   Until all data are processed, above flow is repeated. When
Fig. 4. Since Fig. 4 is a template of the memory access part           the data processing finishes completely, the MAP exits the
(MAP), it can be modified as the designer’s like. The MAP              main loop in the lines 12 to 15 and sets the end flag to 1 in the
cooperates with the data processing part (DP) shown in Fig.            line 16. Consequently, the MAP is blocked in the line 4 and the
5. Fig. 5 is also the template. So, the designer must describe         DP is blocked in the line 7. Thus, this hardware module can
the data processing in the lines 8 to 10.                              be executed with new data.
    In this example, the mailbox #0 (MB[0]) is used for the                Fig. 6 shows how to describe the whole of the hardware
start flag of the hardware module. The MB[4] is used for the           module. The MB ( ) is the function of the behavior of the
end flag indicating that the hardware module finishes the              mailbox as bus slave. Due to paper limitation, its detail is
data processing completely. The MB[1], MB[2] and MB[3]                 omitted. In this program, the MB, the MAP and the DP are
are used for the parameters of the addresses used. By utilizing        executed in parallel.
the mailbox (MB), different memory access patterns can be
realized flexibly.                                                                     V. SOFTWARE PIPELINING
    When the MAP is invoked, it loads the memory by burst                  To hide the memory access latency, the software pipelining
transfer and pushes the loaded words into the input FIFO               has been applied to the original source code of the application
(IF) in the line 14. The DP is blocked by the pop statement to         program [10]. Our proposal applies the software pipelining to
the IF in the line 7. When the MAP pushes the IF, the DP               the program of the memory access part (MAP) as shown in
pops the words in IF to the temporary array (i_dat[i]). Then,          Fig. 4. Fig. 7 shows the overview of the software pipelining
the DP processes the popped data and generates the result              to the MAP program.
into the temporary array (o_dat[i]). At the same time, the                 In the software pipelining, the burst transfer of loading
                                                                       (mem_load) and storing (mem_store) in the main loop are
                                                                       copied to the front of the main loop and the back of it respec-
                                                                       tively. In the main loop, the data used at the next iteration is
                                                                       loaded at the current iteration. Thus, the memory accesses of
                                                                       the MAP are overlapped with the data processing part (DP).
                                                                       In addition, the size of the input and output FIFOs is doubled.

                                                                                     VI. PERFORMANCE ESTIMATION
                                                                           In the conventional software pipelining, its effect is
                  Figure. 7 Software pipelining                        estimated by using the average memory latency [10]. However
MAP is blocked by the pop statement in the line 16 until the           the average memory latency is measured on the processor
DP pushes the result data into the OF. When the DP pushes              with a cache memory. So, it is not applied to the hardware
the processed data into the OF, the MAP can perform the                module used in SoCs. The hardware module generally does
burst transfer of storing.                                             not have a cache memory. So, the memory access latency

© 2012 ACEEE                                                      53
DOI: 01.IJIT.02.02.44
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


due to every load and store should be considered. When
Tdp is the data processing time of the data processing part
(DP), the normal execution time (Tne) without software
pipelining can be calculated as following expression.
       Tne  ( Tl  Tdp  Ts )  n .                  (1)
where Tl is the load latency, Ts is the store latency and n is
the number of iterations. The execution snapshot when the
software pipelining is applied is shown as Fig. 8. The NE
means the normal execution and the SP means the software
pipelined execution. The number of each square indicates
the iteration number.
As shown in upper side of Fig. 8, if Tdp is greater than equal
to Tl+Ts, the memory access latency is enough hidden. So,                                Figure. 9 Performance evaluation
the data processing time is dominant of the total execution               The experimental result is shown in Fig. 9 which is the break-
time. In contrast, as shown in lower side of Fig. 8, if Tdp is            down of the execution time of the data processing part (DP).
lower than Tl + Ts, the memory latency affects the performance.           The horizontal axis means the number of clock cycles of the
The total execution time comes closer to the memory                       delay loop. The vertical axis shows the number of clock cycles
bottleneck. The software pipelined execution time (Tsp) can               consumed until the hardware finishes the execution for all
be calculated as follows.                                                 data. The black bar is the stall time of the DP due to the
                                                                          memory latency. The white bar is the clock cycles consumed
       Tdp  n  Tl  Ts , Tdp  ( Tl  Ts ).
Tsp                                                        (2)
       Tdp  ( Tl  Ts )  n , Tdp  ( Tl  Ts ).

When the speedup ratio (Tne / Tsp) is greater than 1, the
software pipelining is effective for the hardware module.

           VII. EXPERIMENT AND DISCUSSION
A. Performance Evaluation
    In order to confirm the effect of the software pipelining to
the performance, we have described the whole of hardware
shown in Fig. 4, Fig. 5, Fig. 6 and Fig. 7 in Handel-C (DK
Design tool 5.4 of Mentor Graphics). For the data processing
part in Fig. 5, we insert the delay loop into the lines 8 to 10 as                           Figure. 10 Speedup ratio
a data processing. Varying the number of clock cycles of the
delay loop, we have measured the execution time by using
the logic simulator (Modelsim10.1 of Mentor Graphics). In
addition, we have assumed that a 32bits width a DDR SDRAM
with the burst length of 4 is used. Its load latency is 8 clock
cycles and its store latency is 9 clock cycles. The number of
iterations has been set to 16384. That is, the data size was
256KB. The clock frequency has been set to 100MHz.




                                                                                          Figure. 11 Total Execution time
                                                                          by the other execution except for the memory stall time. Al-
                                                                          though the clock cycle of the delay loop is zero, the DP needs
                                                                          some clock cycles in addition to the stall time. This is be-
                                                                          cause the overhead pushing and popping FIFOs is included.
                                                                          This overhead is 6 clock cycles per iteration. The hiding
                                                                          memory latency by software pipelining can improve the per-
                                                                          formance significantly compared with the normal execution.
                                                                              Fig. 10 shows the speedup ratio. The ETne and the ETsp
                                                                          mean the calculated normal execution time and the software-
© 2012 ACEEE                                                         54
DOI: 01.IJIT.02.02. 44
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


pipelined execution time respectively. The Tne and the Tsp                                   ACKNOWLEDGMENT
are measured results. By hiding memory latency, the speed-
                                                                          This research was partially supported by the Grant-in-
ups of 1.21 to 1.79 can be achieved. In addition, the result
                                                                       Aid for Young Scientists (B) (22700055).
shows that our estimation method can get the same tendency
to the measured results.
                                                                                                 REFERENCES
    Fig. 11 shows the results of the estimated execution time
and the measured execution time. Where the data processing             [1] T. Papenfuss and H. Michel, “A Platform for High Level
time (Tdp) is lower than Tl + Ts =17, the performance is closer        Synthesis of Memory-intensive Image Processing Algorithms,” in
to the memory bottleneck plus the inherent overhead due to             Proc. of the 19th ACM/SIGDA Int’l Symp. on Field Programmable
                                                                       Gate Arrays, pp. 75–78, 2011.
FIFO access, by overlapping the data processing onto the
                                                                       [2] H. Gadke-Lutjens, and B. Thielmann and A. Koch, “A Flexible
memory access. Once the Tdp becomes larger than equal to               Compute and Memory Infrastructure for High-Level Language to
the Tl + Ts=17, the data processing occupies the performance.          Hardware Compilation,” in Proc. of the 2010 Int’l Conf. on Field
Thus, the execution time is increasing as the delay loop               Programmable Logic and Applications, pp. 475–482, 2010.
(computation) is becoming larger.                                      [3] J. S. Kim, L. Deng, P. Mangalagiri, K. Irick, K. Sobti, M.
                                                                       Kandemir, V. Narayanan, C. Chakrabarti, N. Pitsianis, and X. Sun,
B. Hardware Size                                                       “An Automated Framework for Accelerating Numerical Algorithms
    By reconstructing the hardware description as shown in             on Reconfigurable Platforms Using Algorithmic/Architectural
Fig. 7 for applying the software pipelining, the hardware size         Optimization,” in IEEE Transactions on Computers, vol. 58, No.
may increase compared with the normal version. To confirm              12, pp. 1654–1667, December, 2009.
this hardware overhead, we have implemented the Handel-C               [4] A. M. Kulkarni and V. Arunachalam, “FPGA Implementation
description into the FPGA. The target FPGA is Spartan6 and             & Comparison of Current Trends in Memory Scheduler for
                                                                       Multimedia Application,” in Proc. of the Int’l Workshop on
the ISE13.1 is used for implementation. The result shows that
                                                                       Emerging Trends in Technology, pp.1214–1218, 2011.
the software pipelined version uses 1.09 times of logic                [5] Mitrionics, Mitrion User’s Guide 1.5.0-001, Mitrionics, 2008.
resources than the normal version. The difference between              [6] D. Pellerin and S. Thibault, Practical FPGA Programming in
clock frequencies of both versions is about 2% only. Thus,             C, Prentice Hall, 2005.
the hardware overhead due to applying the software                     [7] D. Lau, O. Pritchard and P. Molson, “Automated Generation
pipelining is very small and it can be compensated by                  of Hardware Accelerators with Direct Memory Access from ANSI/
performance improvement.                                               ISO Standard C Functions,” IEEE Symp. on Field-Programmable
                                                                       Custom Computing Machines, pp.45–56, 2006.
                     VIII. CONCLUSION                                  [8] A. Yamawaki and M. Iwane, “High-level Synthesis Method
                                                                       Using Semi-programmable Hardware for C Program with Memory
    We have shown a generic describing method of hardware              Access,” Engineering Letters, Vol. 19, Issue 1, pp. 50–56, 2011.
including a memory access controller in a C-based high-level           [9] Mentor Graphics, “Handel-C Synthesis Methodology,” http:/
synthesis technology, Handel-C. This method is very simple             /www.mentor.com/products/fpga/handel-c/, 2011.
and easy, so any designer can employ the memory hiding                 [10] S. P. Vanderwiel, “Data Prefetch Mechanisms,” ACM
                                                                       Computing Surveys, Vol. 32, No. 2, pp. 174–199, 2000.
method for the design entry in C language level. The
                                                                       [11] OpenCores, Wishbone B4 WISHBONE System-on-Chip (SoC)
experimental result shows that the proposed method can                 Interconnection Architecture for Portable IP Cores, OpenCores,
improve the performance significantly by hiding memory                 2010.
access latency. The new performance estimation can be useful
because the estimated performances have shown the same
tendency to the measured results. The proposed method does
not introduce the significant bad effect to the normal version
hardware. As future work, we plan to apply our method to
other commercial HLS tools. Also, we will use more application
programs and practically integrate hardware modules into a
SoC.




© 2012 ACEEE                                                      55
DOI: 01.IJIT.02.02. 44

More Related Content

What's hot

PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
 
Assembly Language Lecture 1
Assembly Language Lecture 1Assembly Language Lecture 1
Assembly Language Lecture 1Motaz Saad
 
Implementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable MemoryImplementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilogSrinivas Naidu
 
Hidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comHidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comGoa App
 
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionTime and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionIJMTST Journal
 
Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Buvana Buvana
 

What's hot (11)

PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 
Tso and ispf
Tso and ispfTso and ispf
Tso and ispf
 
Assembly Language Lecture 1
Assembly Language Lecture 1Assembly Language Lecture 1
Assembly Language Lecture 1
 
Implementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable MemoryImplementation and Design of High Speed FPGA-based Content Addressable Memory
Implementation and Design of High Speed FPGA-based Content Addressable Memory
 
301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog301378156 design-of-sram-in-verilog
301378156 design-of-sram-in-verilog
 
Hidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comHidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.com
 
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data RetentionTime and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
Time and Low Power Operation Using Embedded Dram to Gain Cell Data Retention
 
Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013
 
Embedded _c_
Embedded  _c_Embedded  _c_
Embedded _c_
 
Fg3110541060
Fg3110541060Fg3110541060
Fg3110541060
 
Es notes unit 2
Es notes unit 2Es notes unit 2
Es notes unit 2
 

Viewers also liked

Specification-based Verification of Incomplete Programs
Specification-based Verification of Incomplete ProgramsSpecification-based Verification of Incomplete Programs
Specification-based Verification of Incomplete ProgramsIDES Editor
 
Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...IDES Editor
 
KidzFrame: Supporting Awareness in the Daycare
KidzFrame: Supporting Awareness in the DaycareKidzFrame: Supporting Awareness in the Daycare
KidzFrame: Supporting Awareness in the DaycareIDES Editor
 
Identification of Reactive Power Reserve in Transmission Network
Identification of Reactive Power Reserve in Transmission NetworkIdentification of Reactive Power Reserve in Transmission Network
Identification of Reactive Power Reserve in Transmission NetworkIDES Editor
 
Resource Identification Using Mobile Queries
Resource Identification Using Mobile QueriesResource Identification Using Mobile Queries
Resource Identification Using Mobile QueriesIDES Editor
 
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...IDES Editor
 
A New Soft-Switched Resonant DC-DC Converter
A New Soft-Switched Resonant DC-DC ConverterA New Soft-Switched Resonant DC-DC Converter
A New Soft-Switched Resonant DC-DC ConverterIDES Editor
 
Development of Dual Frequency Alternator Technology Based Power Source For Mi...
Development of Dual Frequency Alternator Technology Based Power Source For Mi...Development of Dual Frequency Alternator Technology Based Power Source For Mi...
Development of Dual Frequency Alternator Technology Based Power Source For Mi...IDES Editor
 
ANN Based PID Controlled Brushless DC drive System
ANN Based PID Controlled Brushless DC drive SystemANN Based PID Controlled Brushless DC drive System
ANN Based PID Controlled Brushless DC drive SystemIDES Editor
 

Viewers also liked (9)

Specification-based Verification of Incomplete Programs
Specification-based Verification of Incomplete ProgramsSpecification-based Verification of Incomplete Programs
Specification-based Verification of Incomplete Programs
 
Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...
 
KidzFrame: Supporting Awareness in the Daycare
KidzFrame: Supporting Awareness in the DaycareKidzFrame: Supporting Awareness in the Daycare
KidzFrame: Supporting Awareness in the Daycare
 
Identification of Reactive Power Reserve in Transmission Network
Identification of Reactive Power Reserve in Transmission NetworkIdentification of Reactive Power Reserve in Transmission Network
Identification of Reactive Power Reserve in Transmission Network
 
Resource Identification Using Mobile Queries
Resource Identification Using Mobile QueriesResource Identification Using Mobile Queries
Resource Identification Using Mobile Queries
 
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...
Application of Solar Powered High Voltage Discharge Plasma for NOX Removal in...
 
A New Soft-Switched Resonant DC-DC Converter
A New Soft-Switched Resonant DC-DC ConverterA New Soft-Switched Resonant DC-DC Converter
A New Soft-Switched Resonant DC-DC Converter
 
Development of Dual Frequency Alternator Technology Based Power Source For Mi...
Development of Dual Frequency Alternator Technology Based Power Source For Mi...Development of Dual Frequency Alternator Technology Based Power Source For Mi...
Development of Dual Frequency Alternator Technology Based Power Source For Mi...
 
ANN Based PID Controlled Brushless DC drive System
ANN Based PID Controlled Brushless DC drive SystemANN Based PID Controlled Brushless DC drive System
ANN Based PID Controlled Brushless DC drive System
 

Similar to A Generic Describing Method of Memory Latency Hiding in a High-level Synthesis Technology

VTU University Micro Controllers-06ES42 lecturer Notes
VTU University Micro Controllers-06ES42 lecturer NotesVTU University Micro Controllers-06ES42 lecturer Notes
VTU University Micro Controllers-06ES42 lecturer Notes24x7house
 
ARM Embeded_Firmware.pdf
ARM Embeded_Firmware.pdfARM Embeded_Firmware.pdf
ARM Embeded_Firmware.pdfhakilic1
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set ArchitectureJaffer Haadi
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-iAnkit Shah
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSORAlxus Shuvo
 
Computer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsComputer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsShinuMMAEI
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...csandit
 
soc ip core based for spacecraft application
soc ip core based for spacecraft applicationsoc ip core based for spacecraft application
soc ip core based for spacecraft applicationnavyashree pari
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessorsmumbahelp
 
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdf
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdfA NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdf
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdfSaiReddy794166
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 

Similar to A Generic Describing Method of Memory Latency Hiding in a High-level Synthesis Technology (20)

VTU University Micro Controllers-06ES42 lecturer Notes
VTU University Micro Controllers-06ES42 lecturer NotesVTU University Micro Controllers-06ES42 lecturer Notes
VTU University Micro Controllers-06ES42 lecturer Notes
 
ARM Embeded_Firmware.pdf
ARM Embeded_Firmware.pdfARM Embeded_Firmware.pdf
ARM Embeded_Firmware.pdf
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set Architecture
 
Bc0040
Bc0040Bc0040
Bc0040
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 
D031201021027
D031201021027D031201021027
D031201021027
 
Co question 2008
Co question 2008Co question 2008
Co question 2008
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSOR
 
Computer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsComputer Arithmetic and Processor Basics
Computer Arithmetic and Processor Basics
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
 
soc ip core based for spacecraft application
soc ip core based for spacecraft applicationsoc ip core based for spacecraft application
soc ip core based for spacecraft application
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessor
 
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdf
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdfA NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdf
A NETWORK-BASED DAC OPTIMIZATION PROTOTYPE SOFTWARE 2 (1).pdf
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 

More from IDES Editor

Power System State Estimation - A Review
Power System State Estimation - A ReviewPower System State Estimation - A Review
Power System State Estimation - A ReviewIDES Editor
 
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...IDES Editor
 
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...IDES Editor
 
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...IDES Editor
 
Line Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFCLine Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFCIDES Editor
 
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...IDES Editor
 
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric ModelingAssessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric ModelingIDES Editor
 
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...IDES Editor
 
Selfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive ThresholdsSelfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive ThresholdsIDES Editor
 
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...IDES Editor
 
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...IDES Editor
 
Cloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability FrameworkCloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability FrameworkIDES Editor
 
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetGenetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetIDES Editor
 
Enhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through SteganographyEnhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through SteganographyIDES Editor
 
Low Energy Routing for WSN’s
Low Energy Routing for WSN’sLow Energy Routing for WSN’s
Low Energy Routing for WSN’sIDES Editor
 
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...IDES Editor
 
Rotman Lens Performance Analysis
Rotman Lens Performance AnalysisRotman Lens Performance Analysis
Rotman Lens Performance AnalysisIDES Editor
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesIDES Editor
 
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...IDES Editor
 
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...IDES Editor
 

More from IDES Editor (20)

Power System State Estimation - A Review
Power System State Estimation - A ReviewPower System State Estimation - A Review
Power System State Estimation - A Review
 
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
 
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
 
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
 
Line Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFCLine Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFC
 
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
 
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric ModelingAssessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
 
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
 
Selfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive ThresholdsSelfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive Thresholds
 
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
 
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
 
Cloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability FrameworkCloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability Framework
 
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetGenetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
 
Enhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through SteganographyEnhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through Steganography
 
Low Energy Routing for WSN’s
Low Energy Routing for WSN’sLow Energy Routing for WSN’s
Low Energy Routing for WSN’s
 
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
 
Rotman Lens Performance Analysis
Rotman Lens Performance AnalysisRotman Lens Performance Analysis
Rotman Lens Performance Analysis
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
 
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
 
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

A Generic Describing Method of Memory Latency Hiding in a High-level Synthesis Technology

  • 1. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 A Generic Describing Method of Memory Latency Hiding in a High-level Synthesis Technology Akira Yamawaki and Seiichi Serikawa Kyushu Institute of Technology, Kitakyushu, Japan Email: {yama, serikawa}@ecs.kyutech.ac.jp Abstract—We show a generic describing method of hardware [8] needs the load/store unit described in HDL which is a tiny including a memory access controller in a C-based high-level processor dedicated to memory access. Thus, the high-speed synthesis technology (Handel-C). In this method, a prefetching simulation at C level cannot be performed in an early stage mechanism to improve the performance by hiding memory during the development period. We propose a generic method access latency can be systematically described in C language. to describe the hardware including a functionality hiding We demonstrate that the proposed method is very simple and easy through a case study. The experimental result shows application-specific memory access latency at C language. that although the proposed method introduces a little hardware This paper pays attention to the Handel-C [9] that is one of overhead, it can improve the performance significantly. the HLS tools that has been widely used for the hardware design. Index Terms—high-level synthesis, memory access, data To hide memory access latency, our method employs the prefetch, FPGA, Handel-C, latency hiding software-pipelining [10] which is widely used in the research domain of the high performance computing. The software- I. INTRODUCTION pipelining reconstructs the programming list by copying the The system-on-chip (SoC) used for the embedded load and store operations in the loop into front of the loop products must achieve the low cost and respond to the short and into back of the loop respectively. Since this method is life-cycle. Thus, to reduce the development burdens such as very simple, the user can easily use it and describe the the development period and cost for the SoC, the high-level hardware with the feature hiding memory access latency. synthesis (HLS) technologies for the hardware design Consequently, the generic method of the C-level memory converting the high abstract algorithm-level description like latency hiding can be introduced into the conventional HLS C, C++, Jave, and MATLAB to a register transfer-level (RTL) technology. description in a hardware description language (HDL) have Generally, the performance estimation is performed to been proposed and developed. In addition, many researchers estimate the effect of the software pipelining. The in this research domain have noticed that the memory access conventional method [10] uses the average memory latency latency has to be hidden to extract the inherent performance for the processor with a write-back cache memory. For a of the hardware. hardware module in an embedded SoC, such cache memory Some researchers and developers have proposed the is very expensive and cannot be employed. Thus, new hardware platforms combining the HLS technology with an performance estimation method is needed. Thus, we propose efficient memory access controller (MAC). The MACs as a the new estimation method considering of the hardware part of the platforms at high-level description such as C, C++ module to be mounted onto the SoC. The rest of the paper is and MATLAB have been shown [1, 2, 3]. In these platforms, organized as follows. Section 2 shows the target hardware the designer has only to write the data processing hardware architecture. Section 3 describes the load/store functions in with the simple interfaces communicating with the MACs in Handel-C, Section 4 describes the templates of the memory order to access to the memory. However they did not consider access and data processing. Section 5 demonstrates the hiding the memory access latency. software-pipelining method to hide memory access latency. Ref. [4] describes some memory access schedulers built Section 6 explains new method of the performance estimation as hardware to reduce the memory access latency by issuing based on the hardware architecture shown in Section 2, the memory commands efficiently and hiding the refresh cycle considering the load and store for the software-pipelining. of the DRAM memory. This proposal hides only the latencies Section 7 shows the experimental results. Finally, Section 8 native to the DRAM such as the RAS-CAS latency, the bank concludes this paper and remarks the future work. switching latency and the refresh cycle. Thus, this scheduler cannot hide the application-specific latency like the streaming data buffering, the block data buffering, and the window buffering. Some HLS tools [5, 6, 7, and 8] can describe the memory access behavior in C language as well as the data processing hardware. Ref. [5, 6, 7] however have never shown a generic describing method to hide memory access latency. Thus, the designers must have a deep knowledge of the HLS tool used and write the MAC well relying on their skill. Ref. Figure.1 Target architecture. © 2012 ACEEE 51 DOI: 01.IJIT.02.02.44
  • 2. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 II. TARGET HARDWARE ARCHITECTURE Fig. 1 shows the architecture of the target hardware. This architecture is familiar to the hardware designers and the HLS technologies. This architecture consists of the memory access part (MAP), the input/output FIFOs (IF/OF) and the data processing part (DP). They are written in C language of the Handel-C. The MAP is a control intensive hardware to load and store the data in the memory. The MAP accesses to the memory via wishbone bus [11] in a conventional request- ready fashion. The MAP has the register file including the control/status registers, mailbox (MB), of the hardware module. The designer of the hardware module can arbitrarily define each register in the MB. Each register can be accessed by the external devices such as the embedded processor. The DP is a data processing hardware which processes the streaming data. Any HLS technology is good at handling the Figure.2 Load function. streaming hardware. The MAP accesses the memory according to the memory access pattern written in C. The MAP loads the data into the IF, converting the memory data to the streaming data. The DP processes the streaming data in the IF and stores the processed data into the OF as streaming data. The MAP reads the streaming data in the OF and stores the read data into the memory according to the memory access pattern. The MAP and the DP are decoupled by the input/output FIFOs. Thus, the hardware description is not confused about the memory access and the data processing. Generally, any HLS technology has the primitives of the FIFOs. III. LOAD/STORE FUNCTION Generally, the memory such as SRAM, SDRAM and DDR SDRAM support the burst transfer which includes the continuous 4 to 8 words. Thus, we describe the load/store primitives supporting burst transfer as function calls as shown Figure.3 Store function in Fig. 2 and Fig. 3 respectively. Then, the bus request is issued setting WE_O to 1 in the As for the load function shown in Fig. 2, the bus request lines 3-6. When the acknowledgement (ACK_I) is asserted is issued setting WE_O to 0 in the lines 2-5. When the by the memory, this function performs the burst transfer of acknowledgement (ACK_I) is asserted by the memory, this storing in the lines 7-22. During the burst transfer, the OF is function performs the burst transfer of loading in the lines 6- popped and the popped data is outputted into the output 20. In the Handel-C, “par” performs the statements in the port (DAT_O) one by one per 1 clock. As similar to the load block in parallel at 1 clock. In contrast, the “seq” performs function, the current address is added by the burst length for the statements in the block sequentially. Each statement the next transfer in the line 19. consumes 1 clock. That is, the continuous words in the burst transfer are pushed into the input FIFO (IF) one by one. The ‘!’ is the primitive pushing into the FIFO in Handel-C. When the specified FIFO (in this example it is IF) is full, this statement is blocked until the FIFO has an empty space. When burst transfer finishes, the current address is added by the burst length as shown in the line 17 in order to issue the next burst transfer of loading. As for the store function shown in Fig. 3, this function attempts to pop the output FIFO (OF) to store the data processed by the data processing part (DP) in the line 2. The ‘?’ is the primitive popping the FIFO in the Handel- C. When the OF is empty, this statement is blocked until the DP finishes the data processing and pushes the result into the OF. Figure. 4 Template of memory access part © 2012 ACEEE 52 DOI: 01.IJIT.02.02.44
  • 3. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 Figure. 5 Template of data processing part Figure. 6 Whole of hardware module IV. MEMORY ACCESS AND DATA PROCESSING Figure. 8 Execution snapshot By using load/store functions shown in Fig. 2 and Fig. 3, the memory access part can be written easily as shown in Until all data are processed, above flow is repeated. When Fig. 4. Since Fig. 4 is a template of the memory access part the data processing finishes completely, the MAP exits the (MAP), it can be modified as the designer’s like. The MAP main loop in the lines 12 to 15 and sets the end flag to 1 in the cooperates with the data processing part (DP) shown in Fig. line 16. Consequently, the MAP is blocked in the line 4 and the 5. Fig. 5 is also the template. So, the designer must describe DP is blocked in the line 7. Thus, this hardware module can the data processing in the lines 8 to 10. be executed with new data. In this example, the mailbox #0 (MB[0]) is used for the Fig. 6 shows how to describe the whole of the hardware start flag of the hardware module. The MB[4] is used for the module. The MB ( ) is the function of the behavior of the end flag indicating that the hardware module finishes the mailbox as bus slave. Due to paper limitation, its detail is data processing completely. The MB[1], MB[2] and MB[3] omitted. In this program, the MB, the MAP and the DP are are used for the parameters of the addresses used. By utilizing executed in parallel. the mailbox (MB), different memory access patterns can be realized flexibly. V. SOFTWARE PIPELINING When the MAP is invoked, it loads the memory by burst To hide the memory access latency, the software pipelining transfer and pushes the loaded words into the input FIFO has been applied to the original source code of the application (IF) in the line 14. The DP is blocked by the pop statement to program [10]. Our proposal applies the software pipelining to the IF in the line 7. When the MAP pushes the IF, the DP the program of the memory access part (MAP) as shown in pops the words in IF to the temporary array (i_dat[i]). Then, Fig. 4. Fig. 7 shows the overview of the software pipelining the DP processes the popped data and generates the result to the MAP program. into the temporary array (o_dat[i]). At the same time, the In the software pipelining, the burst transfer of loading (mem_load) and storing (mem_store) in the main loop are copied to the front of the main loop and the back of it respec- tively. In the main loop, the data used at the next iteration is loaded at the current iteration. Thus, the memory accesses of the MAP are overlapped with the data processing part (DP). In addition, the size of the input and output FIFOs is doubled. VI. PERFORMANCE ESTIMATION In the conventional software pipelining, its effect is Figure. 7 Software pipelining estimated by using the average memory latency [10]. However MAP is blocked by the pop statement in the line 16 until the the average memory latency is measured on the processor DP pushes the result data into the OF. When the DP pushes with a cache memory. So, it is not applied to the hardware the processed data into the OF, the MAP can perform the module used in SoCs. The hardware module generally does burst transfer of storing. not have a cache memory. So, the memory access latency © 2012 ACEEE 53 DOI: 01.IJIT.02.02.44
  • 4. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 due to every load and store should be considered. When Tdp is the data processing time of the data processing part (DP), the normal execution time (Tne) without software pipelining can be calculated as following expression. Tne  ( Tl  Tdp  Ts )  n . (1) where Tl is the load latency, Ts is the store latency and n is the number of iterations. The execution snapshot when the software pipelining is applied is shown as Fig. 8. The NE means the normal execution and the SP means the software pipelined execution. The number of each square indicates the iteration number. As shown in upper side of Fig. 8, if Tdp is greater than equal to Tl+Ts, the memory access latency is enough hidden. So, Figure. 9 Performance evaluation the data processing time is dominant of the total execution The experimental result is shown in Fig. 9 which is the break- time. In contrast, as shown in lower side of Fig. 8, if Tdp is down of the execution time of the data processing part (DP). lower than Tl + Ts, the memory latency affects the performance. The horizontal axis means the number of clock cycles of the The total execution time comes closer to the memory delay loop. The vertical axis shows the number of clock cycles bottleneck. The software pipelined execution time (Tsp) can consumed until the hardware finishes the execution for all be calculated as follows. data. The black bar is the stall time of the DP due to the memory latency. The white bar is the clock cycles consumed  Tdp  n  Tl  Ts , Tdp  ( Tl  Ts ). Tsp   (2)  Tdp  ( Tl  Ts )  n , Tdp  ( Tl  Ts ). When the speedup ratio (Tne / Tsp) is greater than 1, the software pipelining is effective for the hardware module. VII. EXPERIMENT AND DISCUSSION A. Performance Evaluation In order to confirm the effect of the software pipelining to the performance, we have described the whole of hardware shown in Fig. 4, Fig. 5, Fig. 6 and Fig. 7 in Handel-C (DK Design tool 5.4 of Mentor Graphics). For the data processing part in Fig. 5, we insert the delay loop into the lines 8 to 10 as Figure. 10 Speedup ratio a data processing. Varying the number of clock cycles of the delay loop, we have measured the execution time by using the logic simulator (Modelsim10.1 of Mentor Graphics). In addition, we have assumed that a 32bits width a DDR SDRAM with the burst length of 4 is used. Its load latency is 8 clock cycles and its store latency is 9 clock cycles. The number of iterations has been set to 16384. That is, the data size was 256KB. The clock frequency has been set to 100MHz. Figure. 11 Total Execution time by the other execution except for the memory stall time. Al- though the clock cycle of the delay loop is zero, the DP needs some clock cycles in addition to the stall time. This is be- cause the overhead pushing and popping FIFOs is included. This overhead is 6 clock cycles per iteration. The hiding memory latency by software pipelining can improve the per- formance significantly compared with the normal execution. Fig. 10 shows the speedup ratio. The ETne and the ETsp mean the calculated normal execution time and the software- © 2012 ACEEE 54 DOI: 01.IJIT.02.02. 44
  • 5. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 pipelined execution time respectively. The Tne and the Tsp ACKNOWLEDGMENT are measured results. By hiding memory latency, the speed- This research was partially supported by the Grant-in- ups of 1.21 to 1.79 can be achieved. In addition, the result Aid for Young Scientists (B) (22700055). shows that our estimation method can get the same tendency to the measured results. REFERENCES Fig. 11 shows the results of the estimated execution time and the measured execution time. Where the data processing [1] T. Papenfuss and H. Michel, “A Platform for High Level time (Tdp) is lower than Tl + Ts =17, the performance is closer Synthesis of Memory-intensive Image Processing Algorithms,” in to the memory bottleneck plus the inherent overhead due to Proc. of the 19th ACM/SIGDA Int’l Symp. on Field Programmable Gate Arrays, pp. 75–78, 2011. FIFO access, by overlapping the data processing onto the [2] H. Gadke-Lutjens, and B. Thielmann and A. Koch, “A Flexible memory access. Once the Tdp becomes larger than equal to Compute and Memory Infrastructure for High-Level Language to the Tl + Ts=17, the data processing occupies the performance. Hardware Compilation,” in Proc. of the 2010 Int’l Conf. on Field Thus, the execution time is increasing as the delay loop Programmable Logic and Applications, pp. 475–482, 2010. (computation) is becoming larger. [3] J. S. Kim, L. Deng, P. Mangalagiri, K. Irick, K. Sobti, M. Kandemir, V. Narayanan, C. Chakrabarti, N. Pitsianis, and X. Sun, B. Hardware Size “An Automated Framework for Accelerating Numerical Algorithms By reconstructing the hardware description as shown in on Reconfigurable Platforms Using Algorithmic/Architectural Fig. 7 for applying the software pipelining, the hardware size Optimization,” in IEEE Transactions on Computers, vol. 58, No. may increase compared with the normal version. To confirm 12, pp. 1654–1667, December, 2009. this hardware overhead, we have implemented the Handel-C [4] A. M. Kulkarni and V. Arunachalam, “FPGA Implementation description into the FPGA. The target FPGA is Spartan6 and & Comparison of Current Trends in Memory Scheduler for Multimedia Application,” in Proc. of the Int’l Workshop on the ISE13.1 is used for implementation. The result shows that Emerging Trends in Technology, pp.1214–1218, 2011. the software pipelined version uses 1.09 times of logic [5] Mitrionics, Mitrion User’s Guide 1.5.0-001, Mitrionics, 2008. resources than the normal version. The difference between [6] D. Pellerin and S. Thibault, Practical FPGA Programming in clock frequencies of both versions is about 2% only. Thus, C, Prentice Hall, 2005. the hardware overhead due to applying the software [7] D. Lau, O. Pritchard and P. Molson, “Automated Generation pipelining is very small and it can be compensated by of Hardware Accelerators with Direct Memory Access from ANSI/ performance improvement. ISO Standard C Functions,” IEEE Symp. on Field-Programmable Custom Computing Machines, pp.45–56, 2006. VIII. CONCLUSION [8] A. Yamawaki and M. Iwane, “High-level Synthesis Method Using Semi-programmable Hardware for C Program with Memory We have shown a generic describing method of hardware Access,” Engineering Letters, Vol. 19, Issue 1, pp. 50–56, 2011. including a memory access controller in a C-based high-level [9] Mentor Graphics, “Handel-C Synthesis Methodology,” http:/ synthesis technology, Handel-C. This method is very simple /www.mentor.com/products/fpga/handel-c/, 2011. and easy, so any designer can employ the memory hiding [10] S. P. Vanderwiel, “Data Prefetch Mechanisms,” ACM Computing Surveys, Vol. 32, No. 2, pp. 174–199, 2000. method for the design entry in C language level. The [11] OpenCores, Wishbone B4 WISHBONE System-on-Chip (SoC) experimental result shows that the proposed method can Interconnection Architecture for Portable IP Cores, OpenCores, improve the performance significantly by hiding memory 2010. access latency. The new performance estimation can be useful because the estimated performances have shown the same tendency to the measured results. The proposed method does not introduce the significant bad effect to the normal version hardware. As future work, we plan to apply our method to other commercial HLS tools. Also, we will use more application programs and practically integrate hardware modules into a SoC. © 2012 ACEEE 55 DOI: 01.IJIT.02.02. 44