2. Nearly all basic radical steps in architecture were
made by our team before anybody in industry
• “Carry save arithmetic” – one of the two basic technologies still in use for
main arithmetic primitive operations
– my student’s work (1954), presented at university conference (1955).
• The best possible architecture functionality definition and
implementation in Elbrus computer (1978) widely used in our country
including
– High level programming architecture support (not just support of the existing
HLL corrupted by outdated architecture) – without parallel execution
functionality (HW of that time was not ready for that)
not implemented so far in any existing computers
– Real HLL EL – 76 (1976) for Elbrus computers
– Clean best possible OS kernel (no privilege mode) for supporting real High
Level programming
• Elbrus architecture, which main goal is a real HLL EL – 76, and Elbrus OS
kernel as a byproduct, fully solved security problem including possibility
of supporting user programs’ correctness proof.
3. OUR RADICAL STEPS (first in industry)
(cont.)
• The very first-in-technology implementation of OOO superscalar (Elbrus 1 – 1978)
and what is even more important at the early stage (after the second generation of
Elbrus computers in 1985) getting rid of superscalar approach showing its weak
points and starting to find more robust solution of parallel execution problem.
• Successful implementation of cluster-based VLIW architecture with fine grained
parallel execution (Elbrus 3, end of 90s), probably for the first time in technology.
• Suggestion and the fist implementation of Binary Translation (BT) technology for
designing a new architecture built on radically new principles but binary
compatible with the old ones (Elbrus 3, end of 90s).
• Design and simulation of radically new principles of fine grained parallel
architecture and extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for
their support.
5. Drawbacks of current superscalar (SS)
• Program conversion in SS is rather complicated.
Parallel algorithm sequential binary implicitly parallel inside SS sequential at retirement
• SS has performance limit (independent of available HW).
• Inability to use all available HW properly.
• Funny situation exists with SMT mechanism using SMT instead of using natural algorithm parallelism.
• Rather complicated VECTOR HW and MULTI-THREAD programming.
• Current architecture corrupted all today’s HLLs.
• Current architecture does not support dynamic data typing and object oriented data memory.
This excludes possibility to support good security and debugging facility.
• Current organization of computations does not allow good optimization.
Compiler has no full information about algorithm and HW (corrupted HLL).
Cache structure of today’s architecture hides its internal structure preventing compiler from good
optimization of its operation.
• Today’s architecture is far from being universal.
• Etc.
An extremely important point here is that
all the above-mentioned drawbacks (including HLL, OS) have a single source –
inheriting of principles of ancient, early days computing with strong HW size constraints for
current architecture as its basic ones.
6. EARLY DAY’S COMPUTING
Main constraint – shortage of HW single execution unit EU and small linear memory
Execution unit was un-improvable
Carry cave and high radix arithmetic
Therefore, the whole architecture was un-improvable and universal
with said constraints
Basic architecture decisions
Single Instruction Pointer binary (SIP)
Simple unstructured linear memory (LM)
No data types support (No DT)
Binary was the sequence (SIP) of instructions for the main resource - single EU
Argument of instructions – address of another resource – memory location (LM)
No any data type support (No DT) – shortage of resources
All execution optimization was programmer’s job, he knows algorithm and HW
resources well. At that time both algorithms to be executed and HW were rather
simple, so programmer was able to do his job very well
Input binary includes instructions how to use resources,
rather than the algorithm description.
Design was best possible for those constraints.
7. SUPERSCALAR (SS)
With SS the situation became different:
• No HW size constraint
• The main constraint is requirement of user level compatibility with old
computer
(SIP, LM, No Dynamic data Types)
• Program size, HW complexity and optimization job became very big
Many drawbacks of superscalar presented above can be split in two areas:
• Bad functionality (semantics of data and operations)
Without supporting dynamic data types in HW it is impossible to correct this drawback.
It is impossible to support real high level programming and full security.
8. SUPERSCALAR (SS) (cont.)
• Bad performance
In SS optimization is executed by programmer, language compiler and HW.
Programmer
• Now it is too complicated for him and he doesn't know complicated HW
• Due to corrupted HLL he cannot specify results of optimization correctly.
Compiler Optimization is the right job for it (for it only),
but there are no good conditions for that in SS
• Due to corrupted HLL compiler has no full information about algorithm
• Now compiler is not local to model – it has no enough info about model
HW as well including cache structure, which is hidden from compiler for
compatibility reasons.
HW (BPU, prefetching, eviction) it is a wrong job for it
HW has no algorithm information
HW structure is not adjusted for algorithm structure (“artificial binding”)
9. BEST POSSIBLE COMPUTER SYSTEM
Radical step for Best Possible System (BPS) should
move the design into a strongly opposite extreme –
from resources to algorithms care
Two BPS systems will be discussed.
UNCONSTRAINED BPS with the only constraint –
algorithm limitation and specific model HW resources size
CONSTRAINED BPS with previous constraints
plus user level compatibility with x86 (or ARM, etc.)
All mechanisms designed for unconstrained BPS are best possible and should
be used as basic in constrained BPS. Besides, a few mechanisms should be
added for compatibility support.
For this the following requirements should be satisfied for language, compiler
and HW for unconstrained BPS
10. New language for BPS
Compiler should have full information about algorithm.
That means that algorithm should be presented in a new universal language
that is not corrupted by old architectures.
Programmer’s job is to optimize algorithm only, but not its execution.
His responsibility is only to give full information about algorithm to compiler.
This language should have at least three important features:
• Support of presentation of fine grained parallel algorithms (parallelism)
• The right functionality (semantics) of its elements including dynamic
data typing and capability feature
• Possibility to present exhaustive information about algorithm
The second feature is completely implemented in EL-76 language used in
several generations of computers in our country.
11. COMPILER for BPS
Only compiler can and should do optimization in BPS ,
but it should have the following good conditions for that:
• It should have full information about algorithm
Programmer should give it using the new language
• It should have full information about HW model
Compiler should be local to the HW model
Distributable binary should be just a simple recoding of new HLL
without any optimizations
Compiler will use some dynamic information from execution to be
able to tune optimization dynamically
• The structure of HW elements should be suitable for good optimization
control by compiler (see next slide).
Local to model compiler removes compatibility requirements from HW,
because local compiler receives binary and, if needed for HW improvement, it
can be changed together with compiler.
12. HW requirements for BPS
HW in BPS should not do any optimizations (BPU, prefetching, eviction, etc.) –
it cannot do this good enough, it has no algorithm info and cannot do
complex reasoning at run time for analysis.
It should do resources allocation according to compiler instruction.
The main point here is that HW structure should avoid “artificial binding” (AB)
like SIP, Cache line, Vectors in AVX, Full virtual pages, etc.
The data structure in HW should not contradict to that of algorithm.
The data in HW should be like Lego Set, which will allow compiler to do
restructuring for optimization.
The BPS should use Elbrus like object oriented memory structure.
13. CONSTRAINED BPS
All past architectures reach un-improvable state for their constraints. This is
true for current SS as well.
Therefore, at least relaxation of current constraints, with retaining user level
ISA compatibility (x86, ARM, etc.), is an absolutely necessary condition to step
forward and build constrained BPS.
We cannot change semantics of current ISA. The only possibility is to change
binary presentation by means of BT.
So, the only possible step forward for constrained computer architecture is
usage of BT system.
With BT constrained BPS will use all mechanisms of unconstrained BPS
with adding three more mechanisms to support basic compatibility
requirements (SIP, LM). These mechanisms are:
• Retirement
• Check Point
• Memory Lock Table
Unfortunately, for semantics compatibility reasons constrained BPS cannot
support security and aggressive procedure level parallelization.
15. In constrained architecture functionality (semantics) of all its elements (data and
operations) is strongly determined by compatibility requirements
In this section we are going to present the main functional features of unconstrained
computer system and its elements, which were developed in accordance with the
approach described above.
All mechanisms implementation good for both constrained and unconstrained
systems will be the subject of the following sections.
Primitive data types and operations
Besides the traditional ones (integer, FP, etc.) they include
Data and Functional Descriptors – DD and FD – references to object and procedure
DYNAMIC PRIMITIVE DATA TYPES
For primitive data HW supports data types together with values dynamically (with
TAGs).
TYPE SAFETY APPROACH
All primitive operations are checking types of their arguments.
16. User defined data types (objects) functionality
“Natural” requirements to the mechanism of user defined data types
(objects) and their implementation
1) Every procedure can generate a new data object and receive a reference (DD) to this new object
2) This procedure, using the received reference, can do with this new object anything possible:
– Read data from this object
– Read full constant only
– Update any element
– Delete this object
3) No other procedure can access this object just after it has been generated, but this procedure
can give a reference to this object to any objects it knows (has a reference to it) with all or
decreased rights listed above
4) Any procedure can generate a copy of reference to any object it knows maybe with decreased
rights
5) After the object has been deleted, nobody can access it (all existing references are obsolete)
This “natural” description quite uniquely identifies rather simple HW implementation with very high
overall execution efficiency (compared with traditional systems).
17. User defined data types (cont.)
Object can have user defined Object Type Name (OTN). OTN is also primitive
data allocated to objet by its creator.
Primitive HW operations check types of their arguments.
Procedure also can check type of any object it is working with.
Compaction algorithm - dangling pointer problem efficient solution
(compared with less efficient Garbage Collection GC) was developed in Elbrus
computer. It should be used in unconstrained BPS.
With this approach, user (similarly to existing systems) explicitly kills the
already used object, which (unlike GC) immediately frees physical (but,
unfortunately, not virtual) memory.
When virtual memory is close to overflow, background compaction algorithm
searches the whole memory sequentially, deleting DD of killed objects and
decrementing virtual memory value of still alive objects, which results in
compacted virtual memory and possibility to reuse all virtual memory freed
from killed objects.
18. Procedure mechanism (user defined operations)
Here also we would like to discuss the first “natural” requirement to procedure construction to
support language level functionality consistent with the “abstract algorithm” ideas.
1) Any procedure can define another procedure, and define any information accessible to the
original procedure as global data for the new procedure. In real running program the only
thing to do for definition of the new procedure is to generate (this special instruction in ISA)
Functional Descriptor (FD), which allows calling this new procedure.
2) Procedure, which generated this FD, can give this new FD to anybody it has access to, and this
new owner also can call this new procedure (only call without access to its global data,
executable code, etc., which can be used by the called procedure only).
3) Procedure, which generates FD, includes in FD virtual address of the code to be executed by
the new procedure, when this procedure will be called, and this procedure also includes in FD
a virtual address of global data object, which can be used by instructions of the new called
procedure. Therefore both references are included into FD (a reference to code and a
reference to global data)
4) Any procedure, which has FD of the new procedure, can call this procedure and can give it
some parameters. Parameter passing logically is an atomic step – the new procedure does
not work (no one instruction of the called procedure is executed) before caller specifies all
parameters; caller has no access to the parameters passed to callee after call is executed
5) Caller can receive some return data as a result of procedure execution. These data can be
used by caller code. Here also we have atomic return value passing
19. Procedure mechanism (user defined operations) (cont.)
An extremely important notion for procedure is procedure context – this is the
only set of data, which the called procedure can use. The called procedure can
use nothing besides the procedure context.
Procedure context includes:
• Global data given to procedure by creator procedure
• Parameters data from caller
• All data returned to procedure by procedures called by this procedure.
Procedure restriction for context only access is the result of HW architecture
features
• Dynamic data type and primitive operations type safety support
• Strong support semantics of references (DD and FD)
This is foundation of capability technology, which ensures strong inter procedure
protection.
Implementation of all these features in HW is a rather simple and efficient job.
20. Full solution of security problem
Strong inter procedure protection ensures that no any attacker can corrupt
functioning of system SW (if it has no internal mistakes) and model HW.
Attacker cannot access any system data as a result of capability feature, just
because attacker never will have any references (DD or FD) to system data.
Nobody can sent it to him and he is unable to “create” it artificially.
He is also unable to do something bad without real references to system SW.
However, now a lot of security problems are results of possibility to use mistakes
in user programs by attacker, which he is working with.
Logically, the only remedy here is possibility to use a well developed technology of
program correctness proof.
However, with todays architecture (x86, ARM, etc.) even procedure without any
mistakes can be corrupted by attacker due to imperfect old architecture.
This is not the case with capability system and correctness proof gives reliable
result.
Presented approach fully solves security problem.
This technology was fully implemented in Elbrus computer about 40 years ago.
Unfortunately, nobody till now is even close to this solution.
22. Object oriented memory (OOM)
OOM was designed and used in two generations of Elbrus
computer with good results. Unfortunately, at that time there
was no requirement for cache. But now it can be easily
extended into cache. Current Narch design was made on
traditional memory and cache structure. However, this
memory structure doesn’t correspond to above philosophy.
OOM design can be used in full degree on unconstrained BPS.
Unfortunately, it cannot be used for memory system of
constrained BPS (Narch) due to compatibility reason.
However, it can be used in its cache system.
OOM structure even for constrained BPS according to
preliminary estimations can decrease cache sizes by up to 2-3
times and nearly exclude performance losses due to cache
misses.
23. Object oriented memory (OOM) implementation
Organizations of physical memory and all cache levels, in general, are the same. The
following description is related to all of them.
The size of physical memory allocated for an object is equal to the object size. However,
each allocated object is also loaded in the virtual space. This space has fixed size pages. For
each new object virtual space is allocated from the beginning of a new page. If size of the
object is smaller than the page size, then the end of the virtual space of this page is empty
(not used). If object is bigger than the virtual page, then a number of pages are allocated
for it and the last one can be not fully used.
One of the main results of this organization is that each page can include data of one object
only. Any page can never include data of more than one object. All free space is explicitly
visible for HW and compiler (no “artificial binding”).
In memory, as well as in caches, an arbitrary physical part of the object can be allocated (by
compiler local to model) in some specific cache.
All physical space (of variable size) both in memory and in any cache levels is allocated
dynamically. Therefore, the whole free space is in high degree fragmented. Therefore, it is
very difficult, if possible at all sometimes, to allocate a rather big piece of an object.
We split object into pages to cope successfully with this problem.
However, for cache level even page size is big enough from this viewpoint.
Therefore, parts of object allocated at cache levels are split by local to model compiler even
into smaller parts (all these parts are a part of the same virtual page).
24. Object oriented memory (OOM) implementation (cont.)
The system supports special lists for all free spaces. Each list keeps the free
areas of a certain set of the sizes (more likely, of power of 2).
Each free area is listed in one of the bidirectional lists through the first word
of this free piece.
Actually, OOM uses virtual numbers of the objects instead of virtual memory
addresses. Therefore, in the case of object with the size of many pages, all its
pages will have the same virtual object number. Full identification of a specific
element of the object will include virtual object number and its index inside
object. However, descriptor includes virtual object number only.
In OOM each object should not be necessarily presented in memory. Some
objects can be generated, for example, in Level 1 cache only or in other levels
of caches.
25. Object oriented memory (OOM) implementation (cont.)
This memory/cache system organization allows stronger compiler control on
execution.
Compiler knows all program semantics information and does a more
sophisticated optimization.
Compiler can do preload of the needed data to high cache level, at first
without appointing a more valuable register memory, and moving this data
from cache to register only at the last moment. But now even preloading
directly into register sometimes could be a good alternative – now we have a
big register file.
This cache organization allows using access to the first level cache directly
from instruction by physical addresses without using virtual address and
associative search.
To do this, base register (BR) can support a special mode, in which it includes
pointers to the physical location of the first level cache together with its
virtual address.
26. Procedure mechanism (implementation)
In the past we used “strands” approach for this implementation. While “strand”
approach is substantially better than superscalar, it still allows dramatic
improvement.
In strand implementation each strand is HW resource. Parallelism level of
dynamically executed program is varying depending on resources dynamic
situation, therefore, execution should be able dynamically fork a new strand,
which requires a new resource.
Typically, for such a situation deadlock avoidance problem should be solved. Static
solution of this problem decreases performance. This is less dangerous for loops,
because loops can be executed nearly without stopping and forking strands.
However, it is not so good for scalar code.
Here we will discuss a substantially more advanced suggestion, which is good for
scalar and increases performance for loops as well. It can be used in constrained
BPS (Narch) as well.
This will improve already declared performance data for Narch.
27. Procedure mechanism (implementation) (cont.)
For new approach, code to be executed is presented as a fine grained parallel
graph with instructions in its nodes and dependencies presented by arcs of the
graph.
Compiler splits this graph into a number of streams similar to strands in current
implementation.
Instead of frontend in current design, the new approach has code buffer for whole
graph (not for separate streams) only.
Four basic technologies are used here:
• Register allocation, which is not so trivial in the case of fine grained dynamic
code execution.
• Speculative execution (control and data speculation) – same as today in Narch
• Dynamic execution of parallel instruction graph by “workers”
• Instruction graph loading into instruction buffer
28. Register allocation
DL/CL technology
Scalar code (streams) graph can be crossed both by DLs and CLs lines. Code can have
several DLs and CLs, each having a corresponding number – DLn and CLn.
All instructions, which cross DL or CL, include this information and HW knows when a
specific line was crossed.
When some DLn was crossed that means that some register WEBs are already free (all
reads and writes are finished) and can be reused. The registers, which were freed with
DLn, can be used by compiler in instructions after corresponding CLn. Therefore,
corresponding CLn also can be crossed by corresponding streams.
If some instruction marked by CLn is being executed and corresponding DLn is not
crossed yet, this instruction will wait until this happens.
Program will be executed normally, but the time of execution can be improved.
Dynamic feedback (in HW) collects information whether any CL was waiting, and using
this information compiler later can recompile procedure lifting a corresponding DL a
little bit. Eventually, program will work without any time losses for CL wait.
29. Speculative execution (control and data)
Branch execution in the new approach is similar to the previous one.
BT compiler in constrained and high level language compiler in unconstrained
versions generate fine grained parallel binary for HW.
Unlike superscalar with BPU technology, when all branches are critical and
need predicted speculative execution for each branch with performance
losses in case of miss prediction, in our case, due to explicitly parallel
execution according to our statistics 80% of branches are not critical and can
be executed without speculation.
Even with critical branches in our case, when predicate is known well ahead,
or has very strong prediction by compiler, there is no need in speculation.
Critical branches with late predicate and bad compiler prediction should
execute speculatively both alternatives, until predicate is known.
As a result, in our case we have no performance losses for branches at all.
Similar situation is with data speculation.
30. Dynamic execution of parallel instruction graph by “workers”
For constrained architecture, compiler will do all decoding itself instead of
HW, therefore, each instruction on the code is ready to be loaded into the
corresponding execution unit. For unconstrained case, each instruction also
will not need any decoding.
For each instruction compiler will calculate “Priority Value Number” (PVN).
This number is the number of clocks from this instruction up to the end of
scalar code along the longest path. Compiler will present the code in a
number of dependent instruction sequences - “streams” (similar to strands in
previous design).
In this architecture, from the very beginning processor will execute not the
“single instruction pointer” sequential code, but the whole graph of the
algorithm - all streams with explicitly parallel structure visible to HW.
To make it possible processor, besides register file, includes the code buffer
The new technology removes frontend from HW entirely. There are many
other advantages for this step as well.
As code will be executed in fine grained parallel mode, each register should
have EMPTY/FULL (E/F) bit to prevent reading from empty register and ask
reading instruction to wait until result is assigned.
31. Dynamic execution of parallel instruction graph by “workers” (cont.)
Our engine has a number of “workers” in each cluster, whose job is to take the
next instructions from the most important streams and to allocate them to a
corresponding execution unit.
The number of workers in each cluster should be enough to make all execution
units busy each clock.
Our preliminary guess is that each cluster should have about 16 workers.
It loads into Reservation Station (RS) a candidate instruction, which is ready to be
executed (all argument registers are FULL or instruction, which should generate its
value, has already been sent into RS – needs yet another bit (RS) in each register)
and destination is EMPTY).
Besides E/F and RS bits each register has (one byte) the head of the line of the
streams, which are waiting for the result to be written into this register from some
other stream.
If at least one argument of the next instruction to be allocated is not ready,
worker stops working with this stream and puts this stream into one directional
line of one of the registers, which is not ready. This work requires two register
assignments, which can be done in parallel, however, at this point worker is free
of work anyway, and it is searching any other stream ready to be handled.
32. Loading of instruction graph into instruction buffer
DL/CL technology helps to solve big code problem.
For code buffer, it is necessary to have its extension. When code is
executed before CLn, it is necessary to upload the next part of the code
between CLn and CLn+k. Similarly, when DLn is crossed, all code area
above can be free.
The size of code between CLn and CLn+k is not bigger than the size of
register file.
33. Example: Structure of Recurrent Loop Dependencies
Use loop iteration parallelism (both iteration internal and
inter-iteration) as fully as possible
Loop iterations analysis performed by the compiler:
– Find instructions, which are self-dependent over iteration
– Find the groups of instructions, which being self-dependent,
are also mutually dependent over the iterations (“rings” of data
dependency)
– The rest of instructions create sequences, or graph of
dependent instructions (a number of “rows”)
– The result of each row is either an output of the iteration
(STORE, for example), or is used by another row(s) or ring(s).
Each “ring” and/or “row” loop is producing data, which are
consumed by other small loops. Each producer can have a
number of consumers. However, producer and consumer
should be connected through a buffer, giving possibility for
producer to go forward, if consumer is not ready yet to use
these data
34. 1. Primitive data types and operations
introduction
2.1 User defined data types
functionality Objects introduction
2.2 User defined operations
functionality Procedure Introduction
4.1 User structural data architecture
support
Object oriented memory
implementation 4.2.1 intra (fine grained) & inter
procedure execution parallelism
architecture implementation
4.1.1 To be extended to cache
4.2 User ”operations” procedure
implementation
2.2.1 intra & inter proc parallelism
3. “New” HLL introduction
3.0.1 parallelism
5. “New” OS kernel
introduction
Basic components of computer technology, their current state
and our involvement in their implementation
35. 35
Green parts of computer technology were fully
implemented by our (ELBRUS) team in real design
(1978) before anybody else in technology
Yellow parts require moderate extensions of some
of green technologies to support fine grained
parallelism
Red part is introduction of intra & (fine grained)
inter procedure parallelism. All basic decisions are
well developed, need to be implemented in real
design.
36. The block diagram above includes all basic parts of computer
technology and indicates their current states.
1. Introduction of primitive data types and operations
Implementation of arithmetic highlighted in green – over 60 years ago this
implementation reached the un-improvable state
Carry save algorithm – my student’s work in 1954 – university presentation in
1955. The first western publication in 1956.
High radix arithmetic – James E. Robertson, mid 50s. I had a meeting with him in
Moscow in 1958.
2.x Introduction of functionality of user defined data types (Objects) & operations
(Procedure)
This functionality must be defined with the main and maybe the only basic goal:
• To fully correspond to the natural meaning of these notions, without corruption
by trying to do any optimization, security or other goals.
• If this job is not constrained by any compatibility requirements (especially, with
early day’s architecture), this approach ensures the best possible byproduct
results for all these goals.
This problem was fully solved in Elbrus architecture (1978) and showed
outstanding results in two generations of computers widely used in our country.
Though it is difficult to prove this theoretically, however, it is rather evident that
this approach is the best possible, just because the above goal (natural meaning
of functional elements) has the only solution.
37. 2.2.1 Intra & inter proc parallelism
Procedure definitions should be extended by intra (fine grained) & inter procedure
parallel execution semantics.
It was not possible to implement this in Elbrus time, because HW was unable to
support this.
This is part of the work to be done on parallel architecture implementation.
All basic approaches now have been already suggested in our team.
3. “New” HLL introduction
3.0.1 HLL parallelism extension
We have already implemented a New language for such a design in Elbrus (EL – 76)
According to the declared general design principle this language should be (and is)
with dynamic data types and with type safety approach.
It should be extended by parallel semantics.
4.1 Object oriented memory implementation
Unlike superscalar memory and cache organization, object oriented memory allows
to do efficient optimization for local to model compiler.
Object oriented memory is fully implemented in Elbrus.
38. 4.1.1 To be extended to cache.
In Elbrus time there was no need to use caches.
All suggestions in this area have been already made.
4.2 Procedure implementation
For advanced architecture procedure is a highly important feature.
Elbrus made a very clean functional implementation of procedure. The basic result is highly modular
programming with strong inter-procedure protection.
This is also clean and best possible implementation.
The main design step to be done here is its extension for intra & inter procedure parallelism support.
4.2.1 Intra (fine grained) & inter procedure execution parallelism implementation
These are the main design efforts for finishing design of the best possible architecture.
Only about 10 year progress of silicon technology required and made it possible to implement a radical
parallel architecture.
Our team has reached this point with big past experience in this area.
The industry-first real OOO superscalar (Elbrus 1, 2) 1978
Even more important is that we found out that it is not the best approach and got rid of it after the
second generation (Elbrus 2) 1985
VLIW (Elbrus 3) with the first successful cluster ~2000
Strands (already in Intel) 2007- 2013
Clean loop implementation based on strands 2007 - 2013
All these approaches, while reaching good results, are not the best possible (including strands)
Now we have suggested a radical improvement close to Data Flow both for scalar and for loops (also
looks like for the first time in industry).
39. 5. “New” OS kernel introduction
Elbrus 1, 2 are the first and the best possible full implementation of this
technology. Due to basic principles Elbrus did not need to use privileged
mode programming even in OS kernel.
OS kernel implementation having the same functionality is about four
times simpler (smaller in size) compared with today’s OSs and can be
implemented in application mode only.
40. Results
• Elbrus, Narch and Narch+ are made strongly according to the approach
presented in this paper. The results are impressive. These are the results
of work and application of widely used architecture Elbrus 1, 2, 3 and
detailed simulation of future design.
• This approach allows implementation of architecture unconstrained by
any compatibility restriction (Narch+) or compatible with one of existing
architectures - x86, ARM, POWER, etc., or even with all of them together
in one HW model with BT – (Narch).
41. Main results over most powerful Intel processors:
Narch
• Extremely high performance both in Single Job or MT applications –
unreachable for any existing architectures, maybe can reach an absolute un-
improvable level
already shown on detailed simulation,
before introduction of all performance mechanisms
2x+ on ST
2x on MT with the same area
After finishing of debugging
3x – 4x on ST
2,5x – 3x on MT with the same area
• Substantially more power efficiency and less area with the same performance
20% - 30% power efficiency
60% area
• More simpler architecture design
• Un-improvable for any current architecture, fully compatible with x86 or ARM
or any other current architectures.
42. Main results over most powerful Intel processors:
Narch+
• Performance is many tens of times higher both for ST and for MT
• Extremely simple and power efficient
• Substantially simpler and more reliable SW debugging (according to Elbrus
experience – by 10 times)
• Full solution of security problem both for HW, OS and for user programs
(with correctness proof) – all attackers will be jobless
• Really universal, this is a rather important feature. No one architecture after
the very first vacuum tube computer has this characteristic.
It is very likely that after Narch+ introduction (if this happens), it will not be
necessary to design a myriad of specialized architectures like graphics, computer
vision, machine learning and so on.
Narch+ will be absolutely un-improvable architecture nearly after the very first
design.