SlideShare una empresa de Scribd logo
1 de 42
Radical step in computer
architecture
Boris Babayan
Nearly all basic radical steps in architecture were
made by our team before anybody in industry
• “Carry save arithmetic” – one of the two basic technologies still in use for
main arithmetic primitive operations
– my student’s work (1954), presented at university conference (1955).
• The best possible architecture functionality definition and
implementation in Elbrus computer (1978) widely used in our country
including
– High level programming architecture support (not just support of the existing
HLL corrupted by outdated architecture) – without parallel execution
functionality (HW of that time was not ready for that)
not implemented so far in any existing computers
– Real HLL EL – 76 (1976) for Elbrus computers
– Clean best possible OS kernel (no privilege mode) for supporting real High
Level programming
• Elbrus architecture, which main goal is a real HLL EL – 76, and Elbrus OS
kernel as a byproduct, fully solved security problem including possibility
of supporting user programs’ correctness proof.
OUR RADICAL STEPS (first in industry)
(cont.)
• The very first-in-technology implementation of OOO superscalar (Elbrus 1 – 1978)
and what is even more important at the early stage (after the second generation of
Elbrus computers in 1985) getting rid of superscalar approach showing its weak
points and starting to find more robust solution of parallel execution problem.
• Successful implementation of cluster-based VLIW architecture with fine grained
parallel execution (Elbrus 3, end of 90s), probably for the first time in technology.
• Suggestion and the fist implementation of Binary Translation (BT) technology for
designing a new architecture built on radically new principles but binary
compatible with the old ones (Elbrus 3, end of 90s).
• Design and simulation of radically new principles of fine grained parallel
architecture and extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for
their support.
General computer system
structure
Drawbacks of current superscalar (SS)
• Program conversion in SS is rather complicated.
Parallel algorithm  sequential binary  implicitly parallel inside SS  sequential at retirement
• SS has performance limit (independent of available HW).
• Inability to use all available HW properly.
• Funny situation exists with SMT mechanism  using SMT instead of using natural algorithm parallelism.
• Rather complicated VECTOR HW and MULTI-THREAD programming.
• Current architecture corrupted all today’s HLLs.
• Current architecture does not support dynamic data typing and object oriented data memory.
This excludes possibility to support good security and debugging facility.
• Current organization of computations does not allow good optimization.
Compiler has no full information about algorithm and HW (corrupted HLL).
Cache structure of today’s architecture hides its internal structure preventing compiler from good
optimization of its operation.
• Today’s architecture is far from being universal.
• Etc.
An extremely important point here is that
all the above-mentioned drawbacks (including HLL, OS) have a single source –
inheriting of principles of ancient, early days computing with strong HW size constraints for
current architecture as its basic ones.
EARLY DAY’S COMPUTING
Main constraint – shortage of HW  single execution unit EU and small linear memory
Execution unit was un-improvable
Carry cave and high radix arithmetic
Therefore, the whole architecture was un-improvable and universal
with said constraints
Basic architecture decisions
Single Instruction Pointer binary (SIP)
Simple unstructured linear memory (LM)
No data types support (No DT)
Binary was the sequence (SIP) of instructions for the main resource - single EU
Argument of instructions – address of another resource – memory location (LM)
No any data type support (No DT) – shortage of resources
All execution optimization was programmer’s job, he knows algorithm and HW
resources well. At that time both algorithms to be executed and HW were rather
simple, so programmer was able to do his job very well
Input binary includes instructions how to use resources,
rather than the algorithm description.
Design was best possible for those constraints.
SUPERSCALAR (SS)
With SS the situation became different:
• No HW size constraint
• The main constraint is requirement of user level compatibility with old
computer
(SIP, LM, No Dynamic data Types)
• Program size, HW complexity and optimization job became very big
Many drawbacks of superscalar presented above can be split in two areas:
• Bad functionality (semantics of data and operations)
Without supporting dynamic data types in HW it is impossible to correct this drawback.
It is impossible to support real high level programming and full security.
SUPERSCALAR (SS) (cont.)
• Bad performance
In SS optimization is executed by programmer, language compiler and HW.
Programmer
• Now it is too complicated for him and he doesn't know complicated HW
• Due to corrupted HLL he cannot specify results of optimization correctly.
Compiler Optimization is the right job for it (for it only),
but there are no good conditions for that in SS
• Due to corrupted HLL compiler has no full information about algorithm
• Now compiler is not local to model – it has no enough info about model
HW as well including cache structure, which is hidden from compiler for
compatibility reasons.
HW (BPU, prefetching, eviction) it is a wrong job for it
HW has no algorithm information
HW structure is not adjusted for algorithm structure (“artificial binding”)
BEST POSSIBLE COMPUTER SYSTEM
Radical step for Best Possible System (BPS) should
move the design into a strongly opposite extreme –
from resources to algorithms care
Two BPS systems will be discussed.
UNCONSTRAINED BPS with the only constraint –
algorithm limitation and specific model HW resources size
CONSTRAINED BPS with previous constraints
plus user level compatibility with x86 (or ARM, etc.)
All mechanisms designed for unconstrained BPS are best possible and should
be used as basic in constrained BPS. Besides, a few mechanisms should be
added for compatibility support.
For this the following requirements should be satisfied for language, compiler
and HW for unconstrained BPS
New language for BPS
Compiler should have full information about algorithm.
That means that algorithm should be presented in a new universal language
that is not corrupted by old architectures.
Programmer’s job is to optimize algorithm only, but not its execution.
His responsibility is only to give full information about algorithm to compiler.
This language should have at least three important features:
• Support of presentation of fine grained parallel algorithms (parallelism)
• The right functionality (semantics) of its elements including dynamic
data typing and capability feature
• Possibility to present exhaustive information about algorithm
The second feature is completely implemented in EL-76 language used in
several generations of computers in our country.
COMPILER for BPS
Only compiler can and should do optimization in BPS ,
but it should have the following good conditions for that:
• It should have full information about algorithm
Programmer should give it using the new language
• It should have full information about HW model
Compiler should be local to the HW model
Distributable binary should be just a simple recoding of new HLL
without any optimizations
Compiler will use some dynamic information from execution to be
able to tune optimization dynamically
• The structure of HW elements should be suitable for good optimization
control by compiler (see next slide).
Local to model compiler removes compatibility requirements from HW,
because local compiler receives binary and, if needed for HW improvement, it
can be changed together with compiler.
HW requirements for BPS
HW in BPS should not do any optimizations (BPU, prefetching, eviction, etc.) –
it cannot do this good enough, it has no algorithm info and cannot do
complex reasoning at run time for analysis.
It should do resources allocation according to compiler instruction.
The main point here is that HW structure should avoid “artificial binding” (AB)
like SIP, Cache line, Vectors in AVX, Full virtual pages, etc.
The data structure in HW should not contradict to that of algorithm.
The data in HW should be like Lego Set, which will allow compiler to do
restructuring for optimization.
The BPS should use Elbrus like object oriented memory structure.
CONSTRAINED BPS
All past architectures reach un-improvable state for their constraints. This is
true for current SS as well.
Therefore, at least relaxation of current constraints, with retaining user level
ISA compatibility (x86, ARM, etc.), is an absolutely necessary condition to step
forward and build constrained BPS.
We cannot change semantics of current ISA. The only possibility is to change
binary presentation by means of BT.
So, the only possible step forward for constrained computer architecture is
usage of BT system.
With BT constrained BPS will use all mechanisms of unconstrained BPS
with adding three more mechanisms to support basic compatibility
requirements (SIP, LM). These mechanisms are:
• Retirement
• Check Point
• Memory Lock Table
Unfortunately, for semantics compatibility reasons constrained BPS cannot
support security and aggressive procedure level parallelization.
Functionality (semantics) of basic
elements
In constrained architecture functionality (semantics) of all its elements (data and
operations) is strongly determined by compatibility requirements
In this section we are going to present the main functional features of unconstrained
computer system and its elements, which were developed in accordance with the
approach described above.
All mechanisms implementation good for both constrained and unconstrained
systems will be the subject of the following sections.
Primitive data types and operations
Besides the traditional ones (integer, FP, etc.) they include
Data and Functional Descriptors – DD and FD – references to object and procedure
DYNAMIC PRIMITIVE DATA TYPES
For primitive data HW supports data types together with values dynamically (with
TAGs).
TYPE SAFETY APPROACH
All primitive operations are checking types of their arguments.
User defined data types (objects) functionality
“Natural” requirements to the mechanism of user defined data types
(objects) and their implementation
1) Every procedure can generate a new data object and receive a reference (DD) to this new object
2) This procedure, using the received reference, can do with this new object anything possible:
– Read data from this object
– Read full constant only
– Update any element
– Delete this object
3) No other procedure can access this object just after it has been generated, but this procedure
can give a reference to this object to any objects it knows (has a reference to it) with all or
decreased rights listed above
4) Any procedure can generate a copy of reference to any object it knows maybe with decreased
rights
5) After the object has been deleted, nobody can access it (all existing references are obsolete)
This “natural” description quite uniquely identifies rather simple HW implementation with very high
overall execution efficiency (compared with traditional systems).
User defined data types (cont.)
Object can have user defined Object Type Name (OTN). OTN is also primitive
data allocated to objet by its creator.
Primitive HW operations check types of their arguments.
Procedure also can check type of any object it is working with.
Compaction algorithm - dangling pointer problem efficient solution
(compared with less efficient Garbage Collection GC) was developed in Elbrus
computer. It should be used in unconstrained BPS.
With this approach, user (similarly to existing systems) explicitly kills the
already used object, which (unlike GC) immediately frees physical (but,
unfortunately, not virtual) memory.
When virtual memory is close to overflow, background compaction algorithm
searches the whole memory sequentially, deleting DD of killed objects and
decrementing virtual memory value of still alive objects, which results in
compacted virtual memory and possibility to reuse all virtual memory freed
from killed objects.
Procedure mechanism (user defined operations)
Here also we would like to discuss the first “natural” requirement to procedure construction to
support language level functionality consistent with the “abstract algorithm” ideas.
1) Any procedure can define another procedure, and define any information accessible to the
original procedure as global data for the new procedure. In real running program the only
thing to do for definition of the new procedure is to generate (this special instruction in ISA)
Functional Descriptor (FD), which allows calling this new procedure.
2) Procedure, which generated this FD, can give this new FD to anybody it has access to, and this
new owner also can call this new procedure (only call without access to its global data,
executable code, etc., which can be used by the called procedure only).
3) Procedure, which generates FD, includes in FD virtual address of the code to be executed by
the new procedure, when this procedure will be called, and this procedure also includes in FD
a virtual address of global data object, which can be used by instructions of the new called
procedure. Therefore both references are included into FD (a reference to code and a
reference to global data)
4) Any procedure, which has FD of the new procedure, can call this procedure and can give it
some parameters. Parameter passing logically is an atomic step – the new procedure does
not work (no one instruction of the called procedure is executed) before caller specifies all
parameters; caller has no access to the parameters passed to callee after call is executed
5) Caller can receive some return data as a result of procedure execution. These data can be
used by caller code. Here also we have atomic return value passing
Procedure mechanism (user defined operations) (cont.)
An extremely important notion for procedure is procedure context – this is the
only set of data, which the called procedure can use. The called procedure can
use nothing besides the procedure context.
Procedure context includes:
• Global data given to procedure by creator procedure
• Parameters data from caller
• All data returned to procedure by procedures called by this procedure.
Procedure restriction for context only access is the result of HW architecture
features
• Dynamic data type and primitive operations type safety support
• Strong support semantics of references (DD and FD)
This is foundation of capability technology, which ensures strong inter procedure
protection.
Implementation of all these features in HW is a rather simple and efficient job.
Full solution of security problem
Strong inter procedure protection ensures that no any attacker can corrupt
functioning of system SW (if it has no internal mistakes) and model HW.
Attacker cannot access any system data as a result of capability feature, just
because attacker never will have any references (DD or FD) to system data.
Nobody can sent it to him and he is unable to “create” it artificially.
He is also unable to do something bad without real references to system SW.
However, now a lot of security problems are results of possibility to use mistakes
in user programs by attacker, which he is working with.
Logically, the only remedy here is possibility to use a well developed technology of
program correctness proof.
However, with todays architecture (x86, ARM, etc.) even procedure without any
mistakes can be corrupted by attacker due to imperfect old architecture.
This is not the case with capability system and correctness proof gives reliable
result.
Presented approach fully solves security problem.
This technology was fully implemented in Elbrus computer about 40 years ago.
Unfortunately, nobody till now is even close to this solution.
Implementation of above
functionality
Object oriented memory (OOM)
OOM was designed and used in two generations of Elbrus
computer with good results. Unfortunately, at that time there
was no requirement for cache. But now it can be easily
extended into cache. Current Narch design was made on
traditional memory and cache structure. However, this
memory structure doesn’t correspond to above philosophy.
OOM design can be used in full degree on unconstrained BPS.
Unfortunately, it cannot be used for memory system of
constrained BPS (Narch) due to compatibility reason.
However, it can be used in its cache system.
OOM structure even for constrained BPS according to
preliminary estimations can decrease cache sizes by up to 2-3
times and nearly exclude performance losses due to cache
misses.
Object oriented memory (OOM) implementation
Organizations of physical memory and all cache levels, in general, are the same. The
following description is related to all of them.
The size of physical memory allocated for an object is equal to the object size. However,
each allocated object is also loaded in the virtual space. This space has fixed size pages. For
each new object virtual space is allocated from the beginning of a new page. If size of the
object is smaller than the page size, then the end of the virtual space of this page is empty
(not used). If object is bigger than the virtual page, then a number of pages are allocated
for it and the last one can be not fully used.
One of the main results of this organization is that each page can include data of one object
only. Any page can never include data of more than one object. All free space is explicitly
visible for HW and compiler (no “artificial binding”).
In memory, as well as in caches, an arbitrary physical part of the object can be allocated (by
compiler local to model) in some specific cache.
All physical space (of variable size) both in memory and in any cache levels is allocated
dynamically. Therefore, the whole free space is in high degree fragmented. Therefore, it is
very difficult, if possible at all sometimes, to allocate a rather big piece of an object.
We split object into pages to cope successfully with this problem.
However, for cache level even page size is big enough from this viewpoint.
Therefore, parts of object allocated at cache levels are split by local to model compiler even
into smaller parts (all these parts are a part of the same virtual page).
Object oriented memory (OOM) implementation (cont.)
The system supports special lists for all free spaces. Each list keeps the free
areas of a certain set of the sizes (more likely, of power of 2).
Each free area is listed in one of the bidirectional lists through the first word
of this free piece.
Actually, OOM uses virtual numbers of the objects instead of virtual memory
addresses. Therefore, in the case of object with the size of many pages, all its
pages will have the same virtual object number. Full identification of a specific
element of the object will include virtual object number and its index inside
object. However, descriptor includes virtual object number only.
In OOM each object should not be necessarily presented in memory. Some
objects can be generated, for example, in Level 1 cache only or in other levels
of caches.
Object oriented memory (OOM) implementation (cont.)
This memory/cache system organization allows stronger compiler control on
execution.
Compiler knows all program semantics information and does a more
sophisticated optimization.
Compiler can do preload of the needed data to high cache level, at first
without appointing a more valuable register memory, and moving this data
from cache to register only at the last moment. But now even preloading
directly into register sometimes could be a good alternative – now we have a
big register file.
This cache organization allows using access to the first level cache directly
from instruction by physical addresses without using virtual address and
associative search.
To do this, base register (BR) can support a special mode, in which it includes
pointers to the physical location of the first level cache together with its
virtual address.
Procedure mechanism (implementation)
In the past we used “strands” approach for this implementation. While “strand”
approach is substantially better than superscalar, it still allows dramatic
improvement.
In strand implementation each strand is HW resource. Parallelism level of
dynamically executed program is varying depending on resources dynamic
situation, therefore, execution should be able dynamically fork a new strand,
which requires a new resource.
Typically, for such a situation deadlock avoidance problem should be solved. Static
solution of this problem decreases performance. This is less dangerous for loops,
because loops can be executed nearly without stopping and forking strands.
However, it is not so good for scalar code.
Here we will discuss a substantially more advanced suggestion, which is good for
scalar and increases performance for loops as well. It can be used in constrained
BPS (Narch) as well.
This will improve already declared performance data for Narch.
Procedure mechanism (implementation) (cont.)
For new approach, code to be executed is presented as a fine grained parallel
graph with instructions in its nodes and dependencies presented by arcs of the
graph.
Compiler splits this graph into a number of streams similar to strands in current
implementation.
Instead of frontend in current design, the new approach has code buffer for whole
graph (not for separate streams) only.
Four basic technologies are used here:
• Register allocation, which is not so trivial in the case of fine grained dynamic
code execution.
• Speculative execution (control and data speculation) – same as today in Narch
• Dynamic execution of parallel instruction graph by “workers”
• Instruction graph loading into instruction buffer
Register allocation
DL/CL technology
Scalar code (streams) graph can be crossed both by DLs and CLs lines. Code can have
several DLs and CLs, each having a corresponding number – DLn and CLn.
All instructions, which cross DL or CL, include this information and HW knows when a
specific line was crossed.
When some DLn was crossed that means that some register WEBs are already free (all
reads and writes are finished) and can be reused. The registers, which were freed with
DLn, can be used by compiler in instructions after corresponding CLn. Therefore,
corresponding CLn also can be crossed by corresponding streams.
If some instruction marked by CLn is being executed and corresponding DLn is not
crossed yet, this instruction will wait until this happens.
Program will be executed normally, but the time of execution can be improved.
Dynamic feedback (in HW) collects information whether any CL was waiting, and using
this information compiler later can recompile procedure lifting a corresponding DL a
little bit. Eventually, program will work without any time losses for CL wait.
Speculative execution (control and data)
Branch execution in the new approach is similar to the previous one.
BT compiler in constrained and high level language compiler in unconstrained
versions generate fine grained parallel binary for HW.
Unlike superscalar with BPU technology, when all branches are critical and
need predicted speculative execution for each branch with performance
losses in case of miss prediction, in our case, due to explicitly parallel
execution according to our statistics 80% of branches are not critical and can
be executed without speculation.
Even with critical branches in our case, when predicate is known well ahead,
or has very strong prediction by compiler, there is no need in speculation.
Critical branches with late predicate and bad compiler prediction should
execute speculatively both alternatives, until predicate is known.
As a result, in our case we have no performance losses for branches at all.
Similar situation is with data speculation.
Dynamic execution of parallel instruction graph by “workers”
For constrained architecture, compiler will do all decoding itself instead of
HW, therefore, each instruction on the code is ready to be loaded into the
corresponding execution unit. For unconstrained case, each instruction also
will not need any decoding.
For each instruction compiler will calculate “Priority Value Number” (PVN).
This number is the number of clocks from this instruction up to the end of
scalar code along the longest path. Compiler will present the code in a
number of dependent instruction sequences - “streams” (similar to strands in
previous design).
In this architecture, from the very beginning processor will execute not the
“single instruction pointer” sequential code, but the whole graph of the
algorithm - all streams with explicitly parallel structure visible to HW.
To make it possible processor, besides register file, includes the code buffer
The new technology removes frontend from HW entirely. There are many
other advantages for this step as well.
As code will be executed in fine grained parallel mode, each register should
have EMPTY/FULL (E/F) bit to prevent reading from empty register and ask
reading instruction to wait until result is assigned.
Dynamic execution of parallel instruction graph by “workers” (cont.)
Our engine has a number of “workers” in each cluster, whose job is to take the
next instructions from the most important streams and to allocate them to a
corresponding execution unit.
The number of workers in each cluster should be enough to make all execution
units busy each clock.
Our preliminary guess is that each cluster should have about 16 workers.
It loads into Reservation Station (RS) a candidate instruction, which is ready to be
executed (all argument registers are FULL or instruction, which should generate its
value, has already been sent into RS – needs yet another bit (RS) in each register)
and destination is EMPTY).
Besides E/F and RS bits each register has (one byte) the head of the line of the
streams, which are waiting for the result to be written into this register from some
other stream.
If at least one argument of the next instruction to be allocated is not ready,
worker stops working with this stream and puts this stream into one directional
line of one of the registers, which is not ready. This work requires two register
assignments, which can be done in parallel, however, at this point worker is free
of work anyway, and it is searching any other stream ready to be handled.
Loading of instruction graph into instruction buffer
DL/CL technology helps to solve big code problem.
For code buffer, it is necessary to have its extension. When code is
executed before CLn, it is necessary to upload the next part of the code
between CLn and CLn+k. Similarly, when DLn is crossed, all code area
above can be free.
The size of code between CLn and CLn+k is not bigger than the size of
register file.
Example: Structure of Recurrent Loop Dependencies
 Use loop iteration parallelism (both iteration internal and
inter-iteration) as fully as possible
 Loop iterations analysis performed by the compiler:
– Find instructions, which are self-dependent over iteration
– Find the groups of instructions, which being self-dependent,
are also mutually dependent over the iterations (“rings” of data
dependency)
– The rest of instructions create sequences, or graph of
dependent instructions (a number of “rows”)
– The result of each row is either an output of the iteration
(STORE, for example), or is used by another row(s) or ring(s).
 Each “ring” and/or “row” loop is producing data, which are
consumed by other small loops. Each producer can have a
number of consumers. However, producer and consumer
should be connected through a buffer, giving possibility for
producer to go forward, if consumer is not ready yet to use
these data
1. Primitive data types and operations
introduction
2.1 User defined data types
functionality Objects introduction
2.2 User defined operations
functionality Procedure Introduction
4.1 User structural data architecture
support
Object oriented memory
implementation 4.2.1 intra (fine grained) & inter
procedure execution parallelism
architecture implementation
4.1.1 To be extended to cache
4.2 User ”operations” procedure
implementation
2.2.1 intra & inter proc parallelism
3. “New” HLL introduction
3.0.1 parallelism
5. “New” OS kernel
introduction
Basic components of computer technology, their current state
and our involvement in their implementation
35
Green parts of computer technology were fully
implemented by our (ELBRUS) team in real design
(1978) before anybody else in technology
Yellow parts require moderate extensions of some
of green technologies to support fine grained
parallelism
Red part is introduction of intra & (fine grained)
inter procedure parallelism. All basic decisions are
well developed, need to be implemented in real
design.
The block diagram above includes all basic parts of computer
technology and indicates their current states.
1. Introduction of primitive data types and operations
Implementation of arithmetic highlighted in green – over 60 years ago this
implementation reached the un-improvable state
 Carry save algorithm – my student’s work in 1954 – university presentation in
1955. The first western publication in 1956.
 High radix arithmetic – James E. Robertson, mid 50s. I had a meeting with him in
Moscow in 1958.
2.x Introduction of functionality of user defined data types (Objects) & operations
(Procedure)
This functionality must be defined with the main and maybe the only basic goal:
• To fully correspond to the natural meaning of these notions, without corruption
by trying to do any optimization, security or other goals.
• If this job is not constrained by any compatibility requirements (especially, with
early day’s architecture), this approach ensures the best possible byproduct
results for all these goals.
This problem was fully solved in Elbrus architecture (1978) and showed
outstanding results in two generations of computers widely used in our country.
Though it is difficult to prove this theoretically, however, it is rather evident that
this approach is the best possible, just because the above goal (natural meaning
of functional elements) has the only solution.
2.2.1 Intra & inter proc parallelism
Procedure definitions should be extended by intra (fine grained) & inter procedure
parallel execution semantics.
It was not possible to implement this in Elbrus time, because HW was unable to
support this.
This is part of the work to be done on parallel architecture implementation.
All basic approaches now have been already suggested in our team.
3. “New” HLL introduction
3.0.1 HLL parallelism extension
We have already implemented a New language for such a design in Elbrus (EL – 76)
According to the declared general design principle this language should be (and is)
with dynamic data types and with type safety approach.
It should be extended by parallel semantics.
4.1 Object oriented memory implementation
Unlike superscalar memory and cache organization, object oriented memory allows
to do efficient optimization for local to model compiler.
Object oriented memory is fully implemented in Elbrus.
4.1.1 To be extended to cache.
In Elbrus time there was no need to use caches.
All suggestions in this area have been already made.
4.2 Procedure implementation
For advanced architecture procedure is a highly important feature.
Elbrus made a very clean functional implementation of procedure. The basic result is highly modular
programming with strong inter-procedure protection.
This is also clean and best possible implementation.
The main design step to be done here is its extension for intra & inter procedure parallelism support.
4.2.1 Intra (fine grained) & inter procedure execution parallelism implementation
These are the main design efforts for finishing design of the best possible architecture.
Only about 10 year progress of silicon technology required and made it possible to implement a radical
parallel architecture.
Our team has reached this point with big past experience in this area.
The industry-first real OOO superscalar (Elbrus 1, 2) 1978
Even more important is that we found out that it is not the best approach and got rid of it after the
second generation (Elbrus 2) 1985
VLIW (Elbrus 3) with the first successful cluster ~2000
Strands (already in Intel) 2007- 2013
Clean loop implementation based on strands 2007 - 2013
All these approaches, while reaching good results, are not the best possible (including strands)
Now we have suggested a radical improvement close to Data Flow both for scalar and for loops (also
looks like for the first time in industry).
5. “New” OS kernel introduction
Elbrus 1, 2 are the first and the best possible full implementation of this
technology. Due to basic principles Elbrus did not need to use privileged
mode programming even in OS kernel.
OS kernel implementation having the same functionality is about four
times simpler (smaller in size) compared with today’s OSs and can be
implemented in application mode only.
Results
• Elbrus, Narch and Narch+ are made strongly according to the approach
presented in this paper. The results are impressive. These are the results
of work and application of widely used architecture Elbrus 1, 2, 3 and
detailed simulation of future design.
• This approach allows implementation of architecture unconstrained by
any compatibility restriction (Narch+) or compatible with one of existing
architectures - x86, ARM, POWER, etc., or even with all of them together
in one HW model with BT – (Narch).
Main results over most powerful Intel processors:
Narch
• Extremely high performance both in Single Job or MT applications –
unreachable for any existing architectures, maybe can reach an absolute un-
improvable level
already shown on detailed simulation,
before introduction of all performance mechanisms
2x+ on ST
2x on MT with the same area
After finishing of debugging
3x – 4x on ST
2,5x – 3x on MT with the same area
• Substantially more power efficiency and less area with the same performance
20% - 30% power efficiency
60% area
• More simpler architecture design
• Un-improvable for any current architecture, fully compatible with x86 or ARM
or any other current architectures.
Main results over most powerful Intel processors:
Narch+
• Performance is many tens of times higher both for ST and for MT
• Extremely simple and power efficient
• Substantially simpler and more reliable SW debugging (according to Elbrus
experience – by 10 times)
• Full solution of security problem both for HW, OS and for user programs
(with correctness proof) – all attackers will be jobless
• Really universal, this is a rather important feature. No one architecture after
the very first vacuum tube computer has this characteristic.
It is very likely that after Narch+ introduction (if this happens), it will not be
necessary to design a myriad of specialized architectures like graphics, computer
vision, machine learning and so on.
Narch+ will be absolutely un-improvable architecture nearly after the very first
design.

Más contenido relacionado

La actualidad más candente

Cisc vs risc
Cisc vs riscCisc vs risc
Cisc vs risc
Kumar
 
Fpga optimus main_print
Fpga optimus  main_printFpga optimus  main_print
Fpga optimus main_print
Sushant Burde
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
Sanjivani Sontakke
 

La actualidad más candente (20)

Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
Asic
AsicAsic
Asic
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
Cisc vs risc
Cisc vs riscCisc vs risc
Cisc vs risc
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
FPGA Overview
FPGA OverviewFPGA Overview
FPGA Overview
 
Vliw
VliwVliw
Vliw
 
Multicore and shared multi processor
Multicore and shared multi processorMulticore and shared multi processor
Multicore and shared multi processor
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip Basics
 
Cp uarch
Cp uarchCp uarch
Cp uarch
 
RCW@DEI - Design Flow 4 SoPc
RCW@DEI - Design Flow 4 SoPcRCW@DEI - Design Flow 4 SoPc
RCW@DEI - Design Flow 4 SoPc
 
Fpga optimus main_print
Fpga optimus  main_printFpga optimus  main_print
Fpga optimus main_print
 
Introduction to fpga synthesis tools
Introduction to fpga synthesis toolsIntroduction to fpga synthesis tools
Introduction to fpga synthesis tools
 
ASIC DESIGN FLOW
ASIC DESIGN FLOWASIC DESIGN FLOW
ASIC DESIGN FLOW
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Making of an Application Specific Integrated Circuit
Making of an Application Specific Integrated CircuitMaking of an Application Specific Integrated Circuit
Making of an Application Specific Integrated Circuit
 

Destacado

TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
chiportal
 
P3 computer bus system
P3   computer bus systemP3   computer bus system
P3 computer bus system
Jakealflatt
 
Intermediate machine architecture
Intermediate machine architectureIntermediate machine architecture
Intermediate machine architecture
John Cutajar
 
台積電
台積電台積電
台積電
5045033
 
Intro to Buses (Computer Architecture)
Intro to Buses  (Computer Architecture)Intro to Buses  (Computer Architecture)
Intro to Buses (Computer Architecture)
Matthew Levandowski
 

Destacado (14)

SDN & NFV: от абонента до Internet eXchange
SDN & NFV: от абонента до Internet eXchangeSDN & NFV: от абонента до Internet eXchange
SDN & NFV: от абонента до Internet eXchange
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
 
TSM
TSMTSM
TSM
 
WIDER Annual Lecture 20 – Martin Ravallion
WIDER Annual Lecture 20 – Martin RavallionWIDER Annual Lecture 20 – Martin Ravallion
WIDER Annual Lecture 20 – Martin Ravallion
 
P3 computer bus system
P3   computer bus systemP3   computer bus system
P3 computer bus system
 
Отечественные решения на базе SDN и NFV для телеком-операторов
Отечественные решения на базе SDN и NFV для телеком-операторовОтечественные решения на базе SDN и NFV для телеком-операторов
Отечественные решения на базе SDN и NFV для телеком-операторов
 
EZchip Open Flow switch by ARCCN
EZchip Open Flow switch by ARCCN  EZchip Open Flow switch by ARCCN
EZchip Open Flow switch by ARCCN
 
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchangeПрактическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
 
RUNOS OpenFlow controller (ru)
RUNOS OpenFlow controller (ru)RUNOS OpenFlow controller (ru)
RUNOS OpenFlow controller (ru)
 
Intermediate machine architecture
Intermediate machine architectureIntermediate machine architecture
Intermediate machine architecture
 
台積電
台積電台積電
台積電
 
Intro to Buses (Computer Architecture)
Intro to Buses  (Computer Architecture)Intro to Buses  (Computer Architecture)
Intro to Buses (Computer Architecture)
 
Internet, intranet and extranet
Internet, intranet and extranetInternet, intranet and extranet
Internet, intranet and extranet
 
Types of buses of computer
Types of buses of computerTypes of buses of computer
Types of buses of computer
 

Similar a Radical step in computer architecture

Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
ijceronline
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
Shobha Kumar
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
aamc1100
 
Software Architectures, Week 2 - Decomposition techniques
Software Architectures, Week 2 - Decomposition techniquesSoftware Architectures, Week 2 - Decomposition techniques
Software Architectures, Week 2 - Decomposition techniques
Angelos Kapsimanis
 
Specter - AAL
Specter - AALSpecter - AAL
Specter - AAL
PROBOTEK
 

Similar a Radical step in computer architecture (20)

Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
 
UNIT 1.docx
UNIT 1.docxUNIT 1.docx
UNIT 1.docx
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
Introduction to C to Hardware (programming FPGAs and CPLDs in C)
Introduction to C to Hardware (programming FPGAs and CPLDs in C)Introduction to C to Hardware (programming FPGAs and CPLDs in C)
Introduction to C to Hardware (programming FPGAs and CPLDs in C)
 
CS403: Operating System : Lec 1 Introduction.pptx
CS403: Operating System : Lec 1 Introduction.pptxCS403: Operating System : Lec 1 Introduction.pptx
CS403: Operating System : Lec 1 Introduction.pptx
 
SequenceL gets rid of decades of programming baggage
SequenceL gets rid of decades of programming baggageSequenceL gets rid of decades of programming baggage
SequenceL gets rid of decades of programming baggage
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processors
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
 
Arm based controller - basic bootcamp
Arm based controller - basic bootcampArm based controller - basic bootcamp
Arm based controller - basic bootcamp
 
Low cost embedded system
Low cost embedded systemLow cost embedded system
Low cost embedded system
 
Software Architectures, Week 2 - Decomposition techniques
Software Architectures, Week 2 - Decomposition techniquesSoftware Architectures, Week 2 - Decomposition techniques
Software Architectures, Week 2 - Decomposition techniques
 
Specter - AAL
Specter - AALSpecter - AAL
Specter - AAL
 
Embedded system
Embedded systemEmbedded system
Embedded system
 
Embedded 100912065920-phpapp02
Embedded 100912065920-phpapp02Embedded 100912065920-phpapp02
Embedded 100912065920-phpapp02
 
Esl basics
Esl basicsEsl basics
Esl basics
 
CST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic DesignCST 20363 Session 4 Computer Logic Design
CST 20363 Session 4 Computer Logic Design
 
Embedded systems introduction
Embedded systems introductionEmbedded systems introduction
Embedded systems introduction
 
MODULE 1 MES.pptx
MODULE 1 MES.pptxMODULE 1 MES.pptx
MODULE 1 MES.pptx
 
PPT MES class.pptx
PPT MES class.pptxPPT MES class.pptx
PPT MES class.pptx
 

Más de ARCCN

Más de ARCCN (20)

Построение транспортных SDN сетей для операторов связи
Построение транспортных SDN сетей для операторов связиПостроение транспортных SDN сетей для операторов связи
Построение транспортных SDN сетей для операторов связи
 
Магистерская программа «Распределённые системы и компьютерные сети»
Магистерская программа «Распределённые системы и компьютерные сети»Магистерская программа «Распределённые системы и компьютерные сети»
Магистерская программа «Распределённые системы и компьютерные сети»
 
Особенности интеграции сторонних сервисов в облачную MANO платформу
Особенности интеграции сторонних сервисов в облачную MANO платформуОсобенности интеграции сторонних сервисов в облачную MANO платформу
Особенности интеграции сторонних сервисов в облачную MANO платформу
 
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
 
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
 
Перспективы развития SDN  в МИЭТ на базе кафедры ТКС
Перспективы развития SDN  в МИЭТ на базе кафедры ТКСПерспективы развития SDN  в МИЭТ на базе кафедры ТКС
Перспективы развития SDN  в МИЭТ на базе кафедры ТКС
 
MetaCloud Computing Environment
MetaCloud Computing EnvironmentMetaCloud Computing Environment
MetaCloud Computing Environment
 
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
 
Возможности импортозамещения коммутационного оборудования в сетях нового пок...
Возможности импортозамещения коммутационного оборудования  в сетях нового пок...Возможности импортозамещения коммутационного оборудования  в сетях нового пок...
Возможности импортозамещения коммутационного оборудования в сетях нового пок...
 
Внедрение SDN в сети телеком-оператора
Внедрение SDN в сети телеком-оператораВнедрение SDN в сети телеком-оператора
Внедрение SDN в сети телеком-оператора
 
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператораОб одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
 
Облачная платформа Cloud Conductor
Облачная платформа Cloud ConductorОблачная платформа Cloud Conductor
Облачная платформа Cloud Conductor
 
Типовые сервисы региональной сети передачи данных
Типовые сервисы региональной сети передачи данныхТиповые сервисы региональной сети передачи данных
Типовые сервисы региональной сети передачи данных
 
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchip
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchipРазработка OpenFlow-коммутатора на базе сетевого процессора EZchip
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchip
 
Исследования SDN в Оренбургском государственном университете: сетевая безопас...
Исследования SDN в Оренбургском государственном университете: сетевая безопас...Исследования SDN в Оренбургском государственном университете: сетевая безопас...
Исследования SDN в Оренбургском государственном университете: сетевая безопас...
 
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
 
SDN и защищенные квантовые коммуникации
SDN и защищенные квантовые коммуникацииSDN и защищенные квантовые коммуникации
SDN и защищенные квантовые коммуникации
 
Отчет по проектах ЦПИКС
Отчет по проектах ЦПИКСОтчет по проектах ЦПИКС
Отчет по проектах ЦПИКС
 
Учебно-методическая работа по тематике ПКС и ВСС
Учебно-методическая работа по тематике ПКС и ВССУчебно-методическая работа по тематике ПКС и ВСС
Учебно-методическая работа по тематике ПКС и ВСС
 
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Radical step in computer architecture

  • 1. Radical step in computer architecture Boris Babayan
  • 2. Nearly all basic radical steps in architecture were made by our team before anybody in industry • “Carry save arithmetic” – one of the two basic technologies still in use for main arithmetic primitive operations – my student’s work (1954), presented at university conference (1955). • The best possible architecture functionality definition and implementation in Elbrus computer (1978) widely used in our country including – High level programming architecture support (not just support of the existing HLL corrupted by outdated architecture) – without parallel execution functionality (HW of that time was not ready for that) not implemented so far in any existing computers – Real HLL EL – 76 (1976) for Elbrus computers – Clean best possible OS kernel (no privilege mode) for supporting real High Level programming • Elbrus architecture, which main goal is a real HLL EL – 76, and Elbrus OS kernel as a byproduct, fully solved security problem including possibility of supporting user programs’ correctness proof.
  • 3. OUR RADICAL STEPS (first in industry) (cont.) • The very first-in-technology implementation of OOO superscalar (Elbrus 1 – 1978) and what is even more important at the early stage (after the second generation of Elbrus computers in 1985) getting rid of superscalar approach showing its weak points and starting to find more robust solution of parallel execution problem. • Successful implementation of cluster-based VLIW architecture with fine grained parallel execution (Elbrus 3, end of 90s), probably for the first time in technology. • Suggestion and the fist implementation of Binary Translation (BT) technology for designing a new architecture built on radically new principles but binary compatible with the old ones (Elbrus 3, end of 90s). • Design and simulation of radically new principles of fine grained parallel architecture and extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.
  • 5. Drawbacks of current superscalar (SS) • Program conversion in SS is rather complicated. Parallel algorithm  sequential binary  implicitly parallel inside SS  sequential at retirement • SS has performance limit (independent of available HW). • Inability to use all available HW properly. • Funny situation exists with SMT mechanism  using SMT instead of using natural algorithm parallelism. • Rather complicated VECTOR HW and MULTI-THREAD programming. • Current architecture corrupted all today’s HLLs. • Current architecture does not support dynamic data typing and object oriented data memory. This excludes possibility to support good security and debugging facility. • Current organization of computations does not allow good optimization. Compiler has no full information about algorithm and HW (corrupted HLL). Cache structure of today’s architecture hides its internal structure preventing compiler from good optimization of its operation. • Today’s architecture is far from being universal. • Etc. An extremely important point here is that all the above-mentioned drawbacks (including HLL, OS) have a single source – inheriting of principles of ancient, early days computing with strong HW size constraints for current architecture as its basic ones.
  • 6. EARLY DAY’S COMPUTING Main constraint – shortage of HW  single execution unit EU and small linear memory Execution unit was un-improvable Carry cave and high radix arithmetic Therefore, the whole architecture was un-improvable and universal with said constraints Basic architecture decisions Single Instruction Pointer binary (SIP) Simple unstructured linear memory (LM) No data types support (No DT) Binary was the sequence (SIP) of instructions for the main resource - single EU Argument of instructions – address of another resource – memory location (LM) No any data type support (No DT) – shortage of resources All execution optimization was programmer’s job, he knows algorithm and HW resources well. At that time both algorithms to be executed and HW were rather simple, so programmer was able to do his job very well Input binary includes instructions how to use resources, rather than the algorithm description. Design was best possible for those constraints.
  • 7. SUPERSCALAR (SS) With SS the situation became different: • No HW size constraint • The main constraint is requirement of user level compatibility with old computer (SIP, LM, No Dynamic data Types) • Program size, HW complexity and optimization job became very big Many drawbacks of superscalar presented above can be split in two areas: • Bad functionality (semantics of data and operations) Without supporting dynamic data types in HW it is impossible to correct this drawback. It is impossible to support real high level programming and full security.
  • 8. SUPERSCALAR (SS) (cont.) • Bad performance In SS optimization is executed by programmer, language compiler and HW. Programmer • Now it is too complicated for him and he doesn't know complicated HW • Due to corrupted HLL he cannot specify results of optimization correctly. Compiler Optimization is the right job for it (for it only), but there are no good conditions for that in SS • Due to corrupted HLL compiler has no full information about algorithm • Now compiler is not local to model – it has no enough info about model HW as well including cache structure, which is hidden from compiler for compatibility reasons. HW (BPU, prefetching, eviction) it is a wrong job for it HW has no algorithm information HW structure is not adjusted for algorithm structure (“artificial binding”)
  • 9. BEST POSSIBLE COMPUTER SYSTEM Radical step for Best Possible System (BPS) should move the design into a strongly opposite extreme – from resources to algorithms care Two BPS systems will be discussed. UNCONSTRAINED BPS with the only constraint – algorithm limitation and specific model HW resources size CONSTRAINED BPS with previous constraints plus user level compatibility with x86 (or ARM, etc.) All mechanisms designed for unconstrained BPS are best possible and should be used as basic in constrained BPS. Besides, a few mechanisms should be added for compatibility support. For this the following requirements should be satisfied for language, compiler and HW for unconstrained BPS
  • 10. New language for BPS Compiler should have full information about algorithm. That means that algorithm should be presented in a new universal language that is not corrupted by old architectures. Programmer’s job is to optimize algorithm only, but not its execution. His responsibility is only to give full information about algorithm to compiler. This language should have at least three important features: • Support of presentation of fine grained parallel algorithms (parallelism) • The right functionality (semantics) of its elements including dynamic data typing and capability feature • Possibility to present exhaustive information about algorithm The second feature is completely implemented in EL-76 language used in several generations of computers in our country.
  • 11. COMPILER for BPS Only compiler can and should do optimization in BPS , but it should have the following good conditions for that: • It should have full information about algorithm Programmer should give it using the new language • It should have full information about HW model Compiler should be local to the HW model Distributable binary should be just a simple recoding of new HLL without any optimizations Compiler will use some dynamic information from execution to be able to tune optimization dynamically • The structure of HW elements should be suitable for good optimization control by compiler (see next slide). Local to model compiler removes compatibility requirements from HW, because local compiler receives binary and, if needed for HW improvement, it can be changed together with compiler.
  • 12. HW requirements for BPS HW in BPS should not do any optimizations (BPU, prefetching, eviction, etc.) – it cannot do this good enough, it has no algorithm info and cannot do complex reasoning at run time for analysis. It should do resources allocation according to compiler instruction. The main point here is that HW structure should avoid “artificial binding” (AB) like SIP, Cache line, Vectors in AVX, Full virtual pages, etc. The data structure in HW should not contradict to that of algorithm. The data in HW should be like Lego Set, which will allow compiler to do restructuring for optimization. The BPS should use Elbrus like object oriented memory structure.
  • 13. CONSTRAINED BPS All past architectures reach un-improvable state for their constraints. This is true for current SS as well. Therefore, at least relaxation of current constraints, with retaining user level ISA compatibility (x86, ARM, etc.), is an absolutely necessary condition to step forward and build constrained BPS. We cannot change semantics of current ISA. The only possibility is to change binary presentation by means of BT. So, the only possible step forward for constrained computer architecture is usage of BT system. With BT constrained BPS will use all mechanisms of unconstrained BPS with adding three more mechanisms to support basic compatibility requirements (SIP, LM). These mechanisms are: • Retirement • Check Point • Memory Lock Table Unfortunately, for semantics compatibility reasons constrained BPS cannot support security and aggressive procedure level parallelization.
  • 15. In constrained architecture functionality (semantics) of all its elements (data and operations) is strongly determined by compatibility requirements In this section we are going to present the main functional features of unconstrained computer system and its elements, which were developed in accordance with the approach described above. All mechanisms implementation good for both constrained and unconstrained systems will be the subject of the following sections. Primitive data types and operations Besides the traditional ones (integer, FP, etc.) they include Data and Functional Descriptors – DD and FD – references to object and procedure DYNAMIC PRIMITIVE DATA TYPES For primitive data HW supports data types together with values dynamically (with TAGs). TYPE SAFETY APPROACH All primitive operations are checking types of their arguments.
  • 16. User defined data types (objects) functionality “Natural” requirements to the mechanism of user defined data types (objects) and their implementation 1) Every procedure can generate a new data object and receive a reference (DD) to this new object 2) This procedure, using the received reference, can do with this new object anything possible: – Read data from this object – Read full constant only – Update any element – Delete this object 3) No other procedure can access this object just after it has been generated, but this procedure can give a reference to this object to any objects it knows (has a reference to it) with all or decreased rights listed above 4) Any procedure can generate a copy of reference to any object it knows maybe with decreased rights 5) After the object has been deleted, nobody can access it (all existing references are obsolete) This “natural” description quite uniquely identifies rather simple HW implementation with very high overall execution efficiency (compared with traditional systems).
  • 17. User defined data types (cont.) Object can have user defined Object Type Name (OTN). OTN is also primitive data allocated to objet by its creator. Primitive HW operations check types of their arguments. Procedure also can check type of any object it is working with. Compaction algorithm - dangling pointer problem efficient solution (compared with less efficient Garbage Collection GC) was developed in Elbrus computer. It should be used in unconstrained BPS. With this approach, user (similarly to existing systems) explicitly kills the already used object, which (unlike GC) immediately frees physical (but, unfortunately, not virtual) memory. When virtual memory is close to overflow, background compaction algorithm searches the whole memory sequentially, deleting DD of killed objects and decrementing virtual memory value of still alive objects, which results in compacted virtual memory and possibility to reuse all virtual memory freed from killed objects.
  • 18. Procedure mechanism (user defined operations) Here also we would like to discuss the first “natural” requirement to procedure construction to support language level functionality consistent with the “abstract algorithm” ideas. 1) Any procedure can define another procedure, and define any information accessible to the original procedure as global data for the new procedure. In real running program the only thing to do for definition of the new procedure is to generate (this special instruction in ISA) Functional Descriptor (FD), which allows calling this new procedure. 2) Procedure, which generated this FD, can give this new FD to anybody it has access to, and this new owner also can call this new procedure (only call without access to its global data, executable code, etc., which can be used by the called procedure only). 3) Procedure, which generates FD, includes in FD virtual address of the code to be executed by the new procedure, when this procedure will be called, and this procedure also includes in FD a virtual address of global data object, which can be used by instructions of the new called procedure. Therefore both references are included into FD (a reference to code and a reference to global data) 4) Any procedure, which has FD of the new procedure, can call this procedure and can give it some parameters. Parameter passing logically is an atomic step – the new procedure does not work (no one instruction of the called procedure is executed) before caller specifies all parameters; caller has no access to the parameters passed to callee after call is executed 5) Caller can receive some return data as a result of procedure execution. These data can be used by caller code. Here also we have atomic return value passing
  • 19. Procedure mechanism (user defined operations) (cont.) An extremely important notion for procedure is procedure context – this is the only set of data, which the called procedure can use. The called procedure can use nothing besides the procedure context. Procedure context includes: • Global data given to procedure by creator procedure • Parameters data from caller • All data returned to procedure by procedures called by this procedure. Procedure restriction for context only access is the result of HW architecture features • Dynamic data type and primitive operations type safety support • Strong support semantics of references (DD and FD) This is foundation of capability technology, which ensures strong inter procedure protection. Implementation of all these features in HW is a rather simple and efficient job.
  • 20. Full solution of security problem Strong inter procedure protection ensures that no any attacker can corrupt functioning of system SW (if it has no internal mistakes) and model HW. Attacker cannot access any system data as a result of capability feature, just because attacker never will have any references (DD or FD) to system data. Nobody can sent it to him and he is unable to “create” it artificially. He is also unable to do something bad without real references to system SW. However, now a lot of security problems are results of possibility to use mistakes in user programs by attacker, which he is working with. Logically, the only remedy here is possibility to use a well developed technology of program correctness proof. However, with todays architecture (x86, ARM, etc.) even procedure without any mistakes can be corrupted by attacker due to imperfect old architecture. This is not the case with capability system and correctness proof gives reliable result. Presented approach fully solves security problem. This technology was fully implemented in Elbrus computer about 40 years ago. Unfortunately, nobody till now is even close to this solution.
  • 22. Object oriented memory (OOM) OOM was designed and used in two generations of Elbrus computer with good results. Unfortunately, at that time there was no requirement for cache. But now it can be easily extended into cache. Current Narch design was made on traditional memory and cache structure. However, this memory structure doesn’t correspond to above philosophy. OOM design can be used in full degree on unconstrained BPS. Unfortunately, it cannot be used for memory system of constrained BPS (Narch) due to compatibility reason. However, it can be used in its cache system. OOM structure even for constrained BPS according to preliminary estimations can decrease cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses.
  • 23. Object oriented memory (OOM) implementation Organizations of physical memory and all cache levels, in general, are the same. The following description is related to all of them. The size of physical memory allocated for an object is equal to the object size. However, each allocated object is also loaded in the virtual space. This space has fixed size pages. For each new object virtual space is allocated from the beginning of a new page. If size of the object is smaller than the page size, then the end of the virtual space of this page is empty (not used). If object is bigger than the virtual page, then a number of pages are allocated for it and the last one can be not fully used. One of the main results of this organization is that each page can include data of one object only. Any page can never include data of more than one object. All free space is explicitly visible for HW and compiler (no “artificial binding”). In memory, as well as in caches, an arbitrary physical part of the object can be allocated (by compiler local to model) in some specific cache. All physical space (of variable size) both in memory and in any cache levels is allocated dynamically. Therefore, the whole free space is in high degree fragmented. Therefore, it is very difficult, if possible at all sometimes, to allocate a rather big piece of an object. We split object into pages to cope successfully with this problem. However, for cache level even page size is big enough from this viewpoint. Therefore, parts of object allocated at cache levels are split by local to model compiler even into smaller parts (all these parts are a part of the same virtual page).
  • 24. Object oriented memory (OOM) implementation (cont.) The system supports special lists for all free spaces. Each list keeps the free areas of a certain set of the sizes (more likely, of power of 2). Each free area is listed in one of the bidirectional lists through the first word of this free piece. Actually, OOM uses virtual numbers of the objects instead of virtual memory addresses. Therefore, in the case of object with the size of many pages, all its pages will have the same virtual object number. Full identification of a specific element of the object will include virtual object number and its index inside object. However, descriptor includes virtual object number only. In OOM each object should not be necessarily presented in memory. Some objects can be generated, for example, in Level 1 cache only or in other levels of caches.
  • 25. Object oriented memory (OOM) implementation (cont.) This memory/cache system organization allows stronger compiler control on execution. Compiler knows all program semantics information and does a more sophisticated optimization. Compiler can do preload of the needed data to high cache level, at first without appointing a more valuable register memory, and moving this data from cache to register only at the last moment. But now even preloading directly into register sometimes could be a good alternative – now we have a big register file. This cache organization allows using access to the first level cache directly from instruction by physical addresses without using virtual address and associative search. To do this, base register (BR) can support a special mode, in which it includes pointers to the physical location of the first level cache together with its virtual address.
  • 26. Procedure mechanism (implementation) In the past we used “strands” approach for this implementation. While “strand” approach is substantially better than superscalar, it still allows dramatic improvement. In strand implementation each strand is HW resource. Parallelism level of dynamically executed program is varying depending on resources dynamic situation, therefore, execution should be able dynamically fork a new strand, which requires a new resource. Typically, for such a situation deadlock avoidance problem should be solved. Static solution of this problem decreases performance. This is less dangerous for loops, because loops can be executed nearly without stopping and forking strands. However, it is not so good for scalar code. Here we will discuss a substantially more advanced suggestion, which is good for scalar and increases performance for loops as well. It can be used in constrained BPS (Narch) as well. This will improve already declared performance data for Narch.
  • 27. Procedure mechanism (implementation) (cont.) For new approach, code to be executed is presented as a fine grained parallel graph with instructions in its nodes and dependencies presented by arcs of the graph. Compiler splits this graph into a number of streams similar to strands in current implementation. Instead of frontend in current design, the new approach has code buffer for whole graph (not for separate streams) only. Four basic technologies are used here: • Register allocation, which is not so trivial in the case of fine grained dynamic code execution. • Speculative execution (control and data speculation) – same as today in Narch • Dynamic execution of parallel instruction graph by “workers” • Instruction graph loading into instruction buffer
  • 28. Register allocation DL/CL technology Scalar code (streams) graph can be crossed both by DLs and CLs lines. Code can have several DLs and CLs, each having a corresponding number – DLn and CLn. All instructions, which cross DL or CL, include this information and HW knows when a specific line was crossed. When some DLn was crossed that means that some register WEBs are already free (all reads and writes are finished) and can be reused. The registers, which were freed with DLn, can be used by compiler in instructions after corresponding CLn. Therefore, corresponding CLn also can be crossed by corresponding streams. If some instruction marked by CLn is being executed and corresponding DLn is not crossed yet, this instruction will wait until this happens. Program will be executed normally, but the time of execution can be improved. Dynamic feedback (in HW) collects information whether any CL was waiting, and using this information compiler later can recompile procedure lifting a corresponding DL a little bit. Eventually, program will work without any time losses for CL wait.
  • 29. Speculative execution (control and data) Branch execution in the new approach is similar to the previous one. BT compiler in constrained and high level language compiler in unconstrained versions generate fine grained parallel binary for HW. Unlike superscalar with BPU technology, when all branches are critical and need predicted speculative execution for each branch with performance losses in case of miss prediction, in our case, due to explicitly parallel execution according to our statistics 80% of branches are not critical and can be executed without speculation. Even with critical branches in our case, when predicate is known well ahead, or has very strong prediction by compiler, there is no need in speculation. Critical branches with late predicate and bad compiler prediction should execute speculatively both alternatives, until predicate is known. As a result, in our case we have no performance losses for branches at all. Similar situation is with data speculation.
  • 30. Dynamic execution of parallel instruction graph by “workers” For constrained architecture, compiler will do all decoding itself instead of HW, therefore, each instruction on the code is ready to be loaded into the corresponding execution unit. For unconstrained case, each instruction also will not need any decoding. For each instruction compiler will calculate “Priority Value Number” (PVN). This number is the number of clocks from this instruction up to the end of scalar code along the longest path. Compiler will present the code in a number of dependent instruction sequences - “streams” (similar to strands in previous design). In this architecture, from the very beginning processor will execute not the “single instruction pointer” sequential code, but the whole graph of the algorithm - all streams with explicitly parallel structure visible to HW. To make it possible processor, besides register file, includes the code buffer The new technology removes frontend from HW entirely. There are many other advantages for this step as well. As code will be executed in fine grained parallel mode, each register should have EMPTY/FULL (E/F) bit to prevent reading from empty register and ask reading instruction to wait until result is assigned.
  • 31. Dynamic execution of parallel instruction graph by “workers” (cont.) Our engine has a number of “workers” in each cluster, whose job is to take the next instructions from the most important streams and to allocate them to a corresponding execution unit. The number of workers in each cluster should be enough to make all execution units busy each clock. Our preliminary guess is that each cluster should have about 16 workers. It loads into Reservation Station (RS) a candidate instruction, which is ready to be executed (all argument registers are FULL or instruction, which should generate its value, has already been sent into RS – needs yet another bit (RS) in each register) and destination is EMPTY). Besides E/F and RS bits each register has (one byte) the head of the line of the streams, which are waiting for the result to be written into this register from some other stream. If at least one argument of the next instruction to be allocated is not ready, worker stops working with this stream and puts this stream into one directional line of one of the registers, which is not ready. This work requires two register assignments, which can be done in parallel, however, at this point worker is free of work anyway, and it is searching any other stream ready to be handled.
  • 32. Loading of instruction graph into instruction buffer DL/CL technology helps to solve big code problem. For code buffer, it is necessary to have its extension. When code is executed before CLn, it is necessary to upload the next part of the code between CLn and CLn+k. Similarly, when DLn is crossed, all code area above can be free. The size of code between CLn and CLn+k is not bigger than the size of register file.
  • 33. Example: Structure of Recurrent Loop Dependencies  Use loop iteration parallelism (both iteration internal and inter-iteration) as fully as possible  Loop iterations analysis performed by the compiler: – Find instructions, which are self-dependent over iteration – Find the groups of instructions, which being self-dependent, are also mutually dependent over the iterations (“rings” of data dependency) – The rest of instructions create sequences, or graph of dependent instructions (a number of “rows”) – The result of each row is either an output of the iteration (STORE, for example), or is used by another row(s) or ring(s).  Each “ring” and/or “row” loop is producing data, which are consumed by other small loops. Each producer can have a number of consumers. However, producer and consumer should be connected through a buffer, giving possibility for producer to go forward, if consumer is not ready yet to use these data
  • 34. 1. Primitive data types and operations introduction 2.1 User defined data types functionality Objects introduction 2.2 User defined operations functionality Procedure Introduction 4.1 User structural data architecture support Object oriented memory implementation 4.2.1 intra (fine grained) & inter procedure execution parallelism architecture implementation 4.1.1 To be extended to cache 4.2 User ”operations” procedure implementation 2.2.1 intra & inter proc parallelism 3. “New” HLL introduction 3.0.1 parallelism 5. “New” OS kernel introduction Basic components of computer technology, their current state and our involvement in their implementation
  • 35. 35 Green parts of computer technology were fully implemented by our (ELBRUS) team in real design (1978) before anybody else in technology Yellow parts require moderate extensions of some of green technologies to support fine grained parallelism Red part is introduction of intra & (fine grained) inter procedure parallelism. All basic decisions are well developed, need to be implemented in real design.
  • 36. The block diagram above includes all basic parts of computer technology and indicates their current states. 1. Introduction of primitive data types and operations Implementation of arithmetic highlighted in green – over 60 years ago this implementation reached the un-improvable state  Carry save algorithm – my student’s work in 1954 – university presentation in 1955. The first western publication in 1956.  High radix arithmetic – James E. Robertson, mid 50s. I had a meeting with him in Moscow in 1958. 2.x Introduction of functionality of user defined data types (Objects) & operations (Procedure) This functionality must be defined with the main and maybe the only basic goal: • To fully correspond to the natural meaning of these notions, without corruption by trying to do any optimization, security or other goals. • If this job is not constrained by any compatibility requirements (especially, with early day’s architecture), this approach ensures the best possible byproduct results for all these goals. This problem was fully solved in Elbrus architecture (1978) and showed outstanding results in two generations of computers widely used in our country. Though it is difficult to prove this theoretically, however, it is rather evident that this approach is the best possible, just because the above goal (natural meaning of functional elements) has the only solution.
  • 37. 2.2.1 Intra & inter proc parallelism Procedure definitions should be extended by intra (fine grained) & inter procedure parallel execution semantics. It was not possible to implement this in Elbrus time, because HW was unable to support this. This is part of the work to be done on parallel architecture implementation. All basic approaches now have been already suggested in our team. 3. “New” HLL introduction 3.0.1 HLL parallelism extension We have already implemented a New language for such a design in Elbrus (EL – 76) According to the declared general design principle this language should be (and is) with dynamic data types and with type safety approach. It should be extended by parallel semantics. 4.1 Object oriented memory implementation Unlike superscalar memory and cache organization, object oriented memory allows to do efficient optimization for local to model compiler. Object oriented memory is fully implemented in Elbrus.
  • 38. 4.1.1 To be extended to cache. In Elbrus time there was no need to use caches. All suggestions in this area have been already made. 4.2 Procedure implementation For advanced architecture procedure is a highly important feature. Elbrus made a very clean functional implementation of procedure. The basic result is highly modular programming with strong inter-procedure protection. This is also clean and best possible implementation. The main design step to be done here is its extension for intra & inter procedure parallelism support. 4.2.1 Intra (fine grained) & inter procedure execution parallelism implementation These are the main design efforts for finishing design of the best possible architecture. Only about 10 year progress of silicon technology required and made it possible to implement a radical parallel architecture. Our team has reached this point with big past experience in this area. The industry-first real OOO superscalar (Elbrus 1, 2) 1978 Even more important is that we found out that it is not the best approach and got rid of it after the second generation (Elbrus 2) 1985 VLIW (Elbrus 3) with the first successful cluster ~2000 Strands (already in Intel) 2007- 2013 Clean loop implementation based on strands 2007 - 2013 All these approaches, while reaching good results, are not the best possible (including strands) Now we have suggested a radical improvement close to Data Flow both for scalar and for loops (also looks like for the first time in industry).
  • 39. 5. “New” OS kernel introduction Elbrus 1, 2 are the first and the best possible full implementation of this technology. Due to basic principles Elbrus did not need to use privileged mode programming even in OS kernel. OS kernel implementation having the same functionality is about four times simpler (smaller in size) compared with today’s OSs and can be implemented in application mode only.
  • 40. Results • Elbrus, Narch and Narch+ are made strongly according to the approach presented in this paper. The results are impressive. These are the results of work and application of widely used architecture Elbrus 1, 2, 3 and detailed simulation of future design. • This approach allows implementation of architecture unconstrained by any compatibility restriction (Narch+) or compatible with one of existing architectures - x86, ARM, POWER, etc., or even with all of them together in one HW model with BT – (Narch).
  • 41. Main results over most powerful Intel processors: Narch • Extremely high performance both in Single Job or MT applications – unreachable for any existing architectures, maybe can reach an absolute un- improvable level already shown on detailed simulation, before introduction of all performance mechanisms 2x+ on ST 2x on MT with the same area After finishing of debugging 3x – 4x on ST 2,5x – 3x on MT with the same area • Substantially more power efficiency and less area with the same performance 20% - 30% power efficiency 60% area • More simpler architecture design • Un-improvable for any current architecture, fully compatible with x86 or ARM or any other current architectures.
  • 42. Main results over most powerful Intel processors: Narch+ • Performance is many tens of times higher both for ST and for MT • Extremely simple and power efficient • Substantially simpler and more reliable SW debugging (according to Elbrus experience – by 10 times) • Full solution of security problem both for HW, OS and for user programs (with correctness proof) – all attackers will be jobless • Really universal, this is a rather important feature. No one architecture after the very first vacuum tube computer has this characteristic. It is very likely that after Narch+ introduction (if this happens), it will not be necessary to design a myriad of specialized architectures like graphics, computer vision, machine learning and so on. Narch+ will be absolutely un-improvable architecture nearly after the very first design.