2. HIGH LEVEL OPTIMIZATIONS IN CODE
1. Floating – point to Fixed – point conversion
2. Simple loop transformations
3. Loop tiling/blocking
4. Loop Splitting
5. Array Folding
3. CODE OPTIMIZATION
1) Floating –point to Fixed –
point Conversion:
• Reduction in cycle count by
75% and energy
consumption by 76% for an
MPEG – 2 video compression
algorithm.
• Trade – off between cost of
implementation and quality
of algorithm.
• Done using Fixed – C data
types.
• E.g. a=fixed(5,4,s,wt,*b)
fixed a,*b,c[8]
2) Array Folding:
• Options for reducing storage
requirements of large arrays
must be explored since
memory space is limited in
embedded systems.
• Inter – array folding method
employs sharing of memory
space among arrays which
are not needed at
overlapping time intervals.
• Limited sets of components
needed within an array can
also be taken as at a time
only a subset of array
elements is needed.
4.
5. CODE OPTIMIZATION
3) Loop tiling/blocking:
• It is utmost essential to reuse
“small” memories including
caches and scratch – pad
memories.
• Blocked or tiled algorithms
improves locality of references.
• Innermost loop becomes
restricted as it accesses less
array elements.
• If a proper blocking factor is
selected, the elements are still
in the cache when next iteration
of the innermost loop starts.
• Improves performance for
matrix multiplications by
reducing no. of memory
references using reuse factor.
4) Loop Splitting:
• Efficiency of algorithm improves
if loops are splitted and one
loop body handles the regular
cases and a second one handles
the exceptions.
• Total number of cycles can be
saved by splitting of nested
loops for various applications
and target processors.
• Cycle count can be reduced by
75%.
6. CODE OPTIMIZATION
Simple Loop
Transformations
Loop
Permutation
Loop Fusion,
Loop Fission
Loop Unrolling
• Two loops can
either be merged
into a single loop –
Loop Fusion.
• Single loop is
splitted into two
loops – Loop Fission
• Helps in reuse of
array elements in
cache as next
iteration of the
loop body will
access an adjacent
location in
memory.
• Number of copies of the
loop is called unrolling
factor (>2).
• Reduces loop overhead
(less branches per
instruction) & improves
speed but increases code
size.
• Restricted to loops with
constant no. of
iterations.
7. EMBEDDED C FOR HIGH
PERFORMANCE DSP PROGRAMMING
• Performance is the key to digital signal
processing because it translates into
application – based end – user systems.
• Changes in technological and economic
requirements make it more and more
expensive to continue programming the DSP
processor in assembly languages.
• DSP architectures are not easy to program
optimally due to their non – orthogonality.
8. • Stronger error correction and encryption
algorithms must be added to match up to the
increased complexity in DSP.
• Communication protocols have become more
sophisticated and require much more code to
implement.
• Multiple protocol stacks have been
implemented to be compatible with multiple
service providers.
• In addition, backward compatibility with older
protocols is also needed to stay synchronized
with provider networks that are in a slow
process of upgrading.
9. ENTERING WITH EMBEDDED C
• Embedded C is designed to bridge the
performance mismatch between the signal
processing algorithms, standard C and the
architecture.
• It is an extension of C language with the
primitives that are needed by signal processing
applications and that are commonly provided by
DSP processors.
• Maintainability and portability of code are the
key winners in this process.
10. REQUIREMENTS FOR I/O HARDWARE
ADDRESSING INTERFACE
1. The device drive source code must be
portable.
2. The interface must not prevent
implementations to produce machine code
that is as efficient as other methods.
3. The design should permit encapsulation of
the system dependent access method.
11. MEMORY MANAGEMENT IN AN
AEROSPACE EMBEDDED CODE
• Dynamic Allocation eases development by
providing system memory to application
processes as needed at runtime and retrieving
the memory when it is no longer needed.
• C’s runtime library function malloc() can exhibit
wildly unpredictable performance and become a
bottleneck in multithread programs on multi core
systems.
• Hence, dynamic memory allocation is forbidden
in a safety – critical embedded avionics code.
12. WHY NOT DYNAMIC MEMORY
ALLOCATION IN AVIONICS?
• Dynamic memory is a poor – choice for a
mission – critical code as it is based on list
allocator algorithms that organize memory
pools into contiguous locations in a single
linked list.
• These list allocators allocates a memory using
malloc() and de – allocates the memory
location for reuse using free(). But it places a
burden on the programmer to balance each
call to malloc() with a corresponding call to
free().
13. THEN WHAT IS THE SOLUTION?
• Customized memory allocation functions that
more closely match specific allocation
scenarios are used such as:
1. Stack – based allocator
2. Thread – local allocator
3. In – Memory Database Systems (IMDS)
• The performance, stability and predictability
of the safety – critical code increases using
above custom allocators.
14. STACK – BASED ALLOCATOR
• In this algorithm, each allocation returns the
address of the current position of the stack
pointer and advances the pointer by the amount
of the request.
• When memory is no longer needed, the stack
pointer is rewound.
• Processing Overhead is reduced because there is
no chain of pointers to manage nor are there any
allocation sizes or contiguous locations to track.
• A memory leak can’t be accidentally introduced
through improper de – allocation because the
application does not have to track specific
allocations.
16. THREAD – LOCAL ALLOCATOR
• A custom thread – local allocator avoids conflicts
by assigning a specific memory pool to each
thread.
• The thread’s allocation is performed from this
block without interference with other thread’s
requests, thus enhancing performance and
predictability.
• It uses a Pending Request List or PRL for each
thread to coordinate the release of memory
blocks that are freed by a thread other than the
one that performed the original allocation.
• Memory that is allocated and de - allocated by
the same thread requires no coordination, and
therefore no lock conflicts occur.
17. IN – MEMORY DATABASE SYSTEMS
(IMDS)
• Benefits of Custom memory allocators can also
be harnessed by integrating third – party
software like IMDS.
• IMDS manages application objects in RAM.
• Memory allocation & de – allocation of
application objects is also done using malloc()
and free().
• With an IMDS, concurrency among multithreads
is maintained automatically via transactions.
18. APPLICATIONS IN MILITARY
• A sensor object could represent either optical
sensors for tracking missile targets or biosensors
for defense in chemical warfare or motion
sensors to aid in navigating an aircraft.
• This sensor object occupies memory from the
memory pool and free() returns memory back to
the heap & space is relinquished for reuse when
the code completes.
• malloc() is responsible for memory fragmentation
and for deciding the allocator type.
19. EMBEDDED C IN
FPGA SWITCHING
TECHNOLOGY
• C algorithms can be
applied to programmable
& flexible FPGAs using
ultra – low latency.
• Parallelism involves
unrolling a software
process into multiple
parallel hardware
processes.
• Recently applied in Wall
Street
• Possesses potential use
for military purposes.