This talk/tutorial was one that I delivered to multiple organizations -- ranging from semiconductor houses, to start-up system vendors, to research and academic institutions, back in the 2002 time frame. As the abstract below illustrates, it captures the key essence & principles behind the router designs of two of the most popular and landmark switch/routers in our industry -- the Cisco...
One may divide the evolution of switch architectures into roughly four generations. The first generation consisted of the simplest bus-based shared memory switches. The second generation of switches involved slightly advanced techniques. They distributing the processing and memory to the line cards. The third generation replaced the bus as a means of connecting the inputs to outputs by a variety of interconnection fabrics, such as a crossbar. The fourth generation took two forms: the interconnection of smaller ASIC-based components in a regular fashion, or interconnection of distributed line cards via a high performance centralized core.
Here each packet is stored in the centralized shared memory, and is examined by the CPU. In the absence of DMA, each packet crosses the backplane 4 times! The three bottlenecks here are:CPU, memory, and the backplane. The architecture is blocking if the bus bandwidth or CPU processing is less than 4.N.R. The delay incurred by the packets/cells is a function of the CPU speed and memory I/O. Assuming memory is adequate, the throughput is upper bounded by the CPU power or the bus speed.
This is still the architecture of many commercial smaller Ethernet switching platforms. Normal backplane speeds are in the neighborhood of 2 Gb/s. Very sophisticated techniques can yield upto about 20 Gb/s.
An example is the Cisco 2820 Catalyst series of Ethernet switches. With 24 10BaseT and 2 100BaseT ports, the min. bus throughput needed is 880 Mbps and the device has a 1 Gb/s bandwidth bus, so the switch is not bottlenecked by the bus. However, 10 Mbps ports require a forwarding rate for 64B packets of 20Kpps, where as only 14.8 Kpps is supportable. This implies that the performance is CPU limited.
Here some amount of routing functionality, in the form of a small cache of recently seen addresses, is distributed on to the line cards, so the line cards can have dedicated packet forwarding engines for the routing/lookup function. This allows for line-rate processing even for small sized packets. Packets whose destination address is found in the cache go through the fast path , crossing the backplane only once. Packets whose addresses are not found in the cache, must go through the CPU, or the slow path . The delay and blocking characteristics are the same as before, except that throughput could be limited by bus bandwidth or the performance of the lookup engines.
The 3COM CoreBuilder is an example. This is a larger Ethernet platform, with up to 17 slots of 24 10BaseT’s each. This gives a minimum required bus bandwidth of 4 Gb/s, where as the platform only has about 2 Gb/s (which, as I pointed out earlier, is the economic/technical limit for cheaper switches). The platform does have ASIC-based switching capable of handling upto 650 Kpps, which exceeds the required performance for Ethernet slots, but is a little below that for Fast Ethernet slots. So here was have a platform whose performance is bus-bandwidth limited.
Third generation architectures replaced the bus with a switched interconnect, that can handle multiple transactions in parallel. One could have a backplane or mid-plane design, with multiple switch interconnects. The point-to-point links can also be faster than the bus, reaching speeds of multigigabits/sec. The interconnect is usually non-blocking for unicast traffic. The delays in such a system range in the 10s of microseconds for an unloaded system. Theoretically, full line rate throughput is possible if scheduling across the interconnect and queueing on the line card can keep up, which isn’t often the case! Today, this is the state of the art for many switches/routers, including the Cisco GSR family.
The multi-gigabit and multi-terabit architectures involve interconnecting what are essentially smaller switches in some regular topology. Each node is an ASIC-based switch. Note that most if these architectures, distribute the forwarding or data plane, while keeping the routing or control plane centralized.
Another technique adopted in fourth generation architectures is to have a dense switch core away from the line cards, with the forwarding distributed on the line cards. The core and the line cards may be interconnected in a regular topology, combining the previous architecture. Today’s high performance systems are in the 20-100 Gb/s range, with between 8-32 ports of OC-48 or OC-192 speeds.
We’ll look now at data flow through an IP router. Our focus will be on the line cards, since much of the packet processing happens there. I will cover the scheduling of packets through the fabric in detail in Lecture 5 tomorrow. So here we will look at the packet flow through an incoming line card, through the fabric, and back through an outgoing line card. The optical signal is converted by the physical layer interface into an electronic bit stream. This is delineated by the input framer to extract packet data, which is passed to the packet processing section of the LC. This section consists of a forwarding or lookup engine, which is responsible for basic IP address lookup and forwarding, but also performs the functions of classification/marking, shaping and policing. It also applies filters and ACLs. The traffic manager or scheduler is responsible for ensuring QoS via shaping and virtual output queueing. The fabric interface fragments the packets into cells and prepares them for transmission through the fabric. In the output direction, the fabric interface assembles the incoming cells and stores the packets in the buffer memory, where buffer acceptance policies such as RED or WRED may be applied to them. The packets are queued based on class, flow, priority etc. Finally, the link scheduler schedules packets for transmission using one of several scheduling strategies such as RR, WRR, DRR, SCFQ, or fair queueing. The outgoing data is framed into outgoing frames and transmitted on the output links.
Let’s now see how the functional map is realized in chips. The input and output framers are in one chip. Framers include the Agere TADM, AMCC Ganges, Vitesse 9184, and the Cypress POSIC. The forwarding engine can be broken out, it might be a single chip, for example NPs from Agere NP10 or IBM Ranier or Sanford, or it may be a chip set with NP, co-processor and NSE, all of which together perform the forwarding and lookup function. The TM is usually an ASIC with two chipsets, for the ingress and egress direction, such as the QX1 from EZChip or the TMC10 from Internet Machines, or the TM10 from Agere.
Now let’s dive into looking at two contemporary examples. I’ll start with the Juniper M Series routers. This is a basic fact sheet. The important point to note here is that the M40 and the M160 have a throughput half of what their numbers suggest. Also, their oft advertised packet processing number of 40 Mpps is for 64B packets, which is significantly below the about 62 Mpps aggregate performance that you’d need for 40B packets.
The system architecture is cleanly separated into a CPU-based routing engine and an ASIC-based forwarding engine. The routing engine is responsible for running the routing process and other management software, all within the Juniper JUNOS operating system. The forwarding engine consists of a computer scale ASIC-based packet processor, called the Internet Processor family of ASICs. The Internet Processor had 1M gates (6.5 M transistors, compared to 7.5 M for the Pentium II and 55M for the Pentium IV), while the Internet Proc II has over 2.5M gates. The architecture is unique in that is similar to the centralized shared-memory CPU-based first generation architecture I spoke about in Lecture 2, except that it has replaced the central CPU with a very high performance ASIC. Also, as we’ll see in a minute, the centralized shared memory is actually implemented as a distributed pooled memory spread over all of the line cards. So although one might expect a high-end core router to push all packet processing to the line cards as the 3 rd and 4 th generation switch architectures do, we see that the M Series is unique in concentrating all packet processing in a single high-performance centralized ASIC.
One may divide the evolution of switch architectures into roughly four generations. The first generation consisted of the simplest bus-based shared memory switches. The second generation of switches involved slightly advanced techniques. They distributing the processing and memory to the line cards. The third generation replaced the bus as a means of connecting the inputs to outputs by a variety of interconnection fabrics, such as a crossbar. The fourth generation took two forms: the interconnection of smaller ASIC-based components in a regular fashion, or interconnection of distributed line cards via a high performance centralized core.
One may divide the evolution of switch architectures into roughly four generations. The first generation consisted of the simplest bus-based shared memory switches. The second generation of switches involved slightly advanced techniques. They distributing the processing and memory to the line cards. The third generation replaced the bus as a means of connecting the inputs to outputs by a variety of interconnection fabrics, such as a crossbar. The fourth generation took two forms: the interconnection of smaller ASIC-based components in a regular fashion, or interconnection of distributed line cards via a high performance centralized core.
One may divide the evolution of switch architectures into roughly four generations. The first generation consisted of the simplest bus-based shared memory switches. The second generation of switches involved slightly advanced techniques. They distributing the processing and memory to the line cards. The third generation replaced the bus as a means of connecting the inputs to outputs by a variety of interconnection fabrics, such as a crossbar. The fourth generation took two forms: the interconnection of smaller ASIC-based components in a regular fashion, or interconnection of distributed line cards via a high performance centralized core.