|
| ( 01 Jul 2003 ) |
| By Tom Riordan, Vice President and General Manager, MIPS Processor Division, PMC Sierra |
|
To set the stage to talk about the 'networking microprocessor' evolution, let's first take a quick look at the characteristics that distinguish networking applications from the better known computing, or PC-centric, applications. Before delving into this, however, one point of clarification needs to be made: Network processors (NPUs) are devices designed exclusively for networking, and networking microprocessors can be any microprocessor used in a networking application. Both types of devices will be touched upon in this article.
Networking applications are distinguished by the following characteristics:
- First, the speed required for one of the primary networking tasks, packet forwarding, is in the extreme case beyond the capability of general-purpose programmable devices.
- Second, and in direct opposition to the first, the features and services desired for more discriminating packet forwarding, packet 'processing,' often require execution of lots of software that can only be reasonably developed and maintained for a processor with high level language programmability and operating system-capable protection mechanisms.
- Thirdly, whereas in general purpose computing, where the average amount of time that a task takes is usually the most important metric, in networking, as in many embedded applications, it is the worst case time required for completing the task that drives the implementation.
- Finally, packet processing is inherently a data driven process. In many general purpose computing tasks, the data is relatively static and the program manipulates it in multiple times and in various ways -this is why caches are so effective in general purpose computing-whereas in packet processing the data is always in flux and caches must be tuned in order to be effective. More about this later.
So, where have we been and where might we go from here? In the beginning, networking microprocessors and general purpose microprocessors were one and the same. That is, the microprocessor that you would find in a typical router was no different than the one you would find in a large variety of other high-end embedded applications. Often, this was a variant of the venerable Motorola 68000 architecture. In the early 1990s, as router performance requirements rose, a switch to the more powerful RISC architectures occurred with early movers choosing the 64-bit MIPS architecture. In that time frame it was still the case that there was no difference between the MIPS microprocessor used in a high-performance router and the one used in a high-performance workstation.
For example, the first RISC processor used in a router was the R4600 designed by Quantum Effect Devices (QED). Silicon Graphics (SGI) used this same processor to power their first generation 'Indy' graphics workstation. The next generation routers used QED's R5000, which was also used in the second generation SGI Indy. The R4600 was a 133-MHz device in 0.8 micron CMOS, and the R5000 was a 200-MHz device built in 0.64 micron CMOS. A block diagram illustrating the main features of this class of processor appears in Figure 1. Key attributes of these processors for networking were quite simply high clock rate and the ability to load, store and process data in 64-bit chunks. A feature that became increasingly important in networking over time (it was, of course, mandatory in workstations from the start) was the full-featured, UNIX-capable, memory management unit and translation look-aside buffer.
 Fig 1: Early networking RISC processors
With the continued application of new process technologies, these processors that once sold for hundreds of dollars and powered equipment costing thousands of dollars are now running at nearly 1/2GHz, selling for tens of dollars, and powering equipment like laser printers and personal video recorders (PVRs). This is one of the most effective applications of Moore's Law that I have ever witnessed.
As with all types of electronic equipment, there is always a need for better performance. This has been particularly acute in networking equipment where the heavy Fermi-Dirac particles and electrons, which govern the speed of electronic devices, have been struggling to keep up with the massless Bose-Einstein particles and photons which were permitting dramatic speedups in fibre optic transmission rates.
To try and keep up, RISC processors used in networking made their next performance move by striving for greater efficiency. Using a dual-issue superscalar processor with a large integrated level-2 cache, QED's RM7000 processor delivered high MIPS per megahertz while maintaining a low power 5-stage pipeline. The RM7000 was introduced at 300MHz in 0.25 micron CMOS and has been scaled up to 600MHz in the latest 0.13 micron incarnation. A block diagram of the RM7000 appears in Figure 2.
 Fig 2: Second generation networking processors
To tune the cache subsystem for packet processing, a packet bypass capability was introduced so that the more transient packet data could go directly into the L1 data cache without polluting the L2 cache. The L2 cache could then be more effective for holding common tables, statistics and, of course, instructions.
Despite the best efforts on the part of QED and the other embedded RISC vendors, the scenario faced at the end of the decade is summarized in Figure 3. The drastic decline in the number of instructions available per 64-byte packet was coming at the same time that the desire to generate more revenue per packet (by adding more 'services' i.e., priority fees) was requiring more instructions to do more complex packet classifications.
 Fig 3: Packet processing gap
The growing disparity illustrated above was not lost on the VC and entrepreneurial community and a vast array of start-ups were launched or relaunched to address this problem. These start-ups were grouped into several categories with at least a dozen companies attacking each of the packet processing sub-tasks. With no attempt whatsoever at completeness, Table 1 shows some of the groupings and players.
 Table 1: Packet processing players.
While all of these company/devices are engaged in packet processing, only those in the network processor (NPU) column could actually be called processors in the sense that they execute a program of some sort. Products from companies in the other columns served either as assists to the NPUs or provided a complete hardwired solution for specific applications. Most of the start-ups in the NPU category have either been purchased or simply gone out of business. The better classification and encryption companies seem to have carved out a successful niche for themselves, but the traffic management solutions seem to have gained little, if any, traction.
Essentially all of the NPU vendors attacked the 'processing gap' by applying the oft-tried but seldom-successful technique of large-scale parallelism. While the hardware problems inherent in this solution are themselves substantial, they pale in comparison to the software problems. At the risk of over-generalization, it is probably fair to say that if the software is written at a high enough level to be portable and maintainable then most of the performance gain is lost, and if the software is written at a low enough level to achieve the desired performance then it is neither portable nor maintainable.
Cisco's chief development officer, Mario Mazzola sums up his view in an EE Times interview: 'Many of the current generation chips are microcoded, and the number of instructions available per packet and the memory space for the microcode is limited. This creates a difficult and laborious development environment. Even worse, when you want to add a feature, there is a lot more difficult work and regression testing.' Reflecting on Mazzola's words, it is worth noting that the Cisco Toaster and the Intel IXP both fall into the category of network processors to which he is referring.
At PMC-Sierra, which acquired QED in August 2000, we approached the processing gap by extending in a balanced, incremental fashion four key performance-critical axes while maintaining strict software compatibility with previous processor generations.
PMC-Sierra's new device, the RM9000x2 Integrated Multiprocessor: 1. Increases the critical pipeline depth by 50% to achieve a concomitant 50% frequency boost (achieving 1GHz in 0.13 micron) and added branch prediction to maintain a high IPC; 2. Provides fully integrated, cache-coherent, 2-way multi-processing, for a potential 2x performance boost; 3. Integrates the performance critical memory controller to minimize latency on table lookups, etc.; and 4. Integrates a 'next generation' high-performance packetized I/O bus, HyperTransport, to replace the bandwidth constrained PCI bus. HyperTransport has dual, unidirectional point-to-point differential links and is compatible with existing PCI software.
Relative to a non-integrated uniprocessor with 5-stage pipeline, one might expect to see an overall 3x performance boost from these enhancements where it is assumed that no performance gain comes from the integrated DDR memory controller or the Hyper-Transport link; i.e., the integrated DDR and packetized I/O are required just to achieve the performance potential from the frequency boost and two-way multiprocessing.
As illustrated in Figure 4, the HyperTransport link serves as an aggregation port for other interfaces like Ethernet, PCI, PCI-X, SPI4.2, etc. The RM9000x2 also incorporates EJTAG debugging capability for both cores including integrated per core instruction trace buffers to help programmers track down that elusive last bug.
 Fig 4: PMC-Sierra's RM9000x2 packet processing subsystem
Refinements to data caching Several refinements were made to the data caching system to maximize packet-processing performance including the ability to deposit packet headers or entire packets directly into either processor's L2 cache. This feature permits the processor to begin processing packet data immediately avoiding the latency of both the initial write to the DDR memory by I/O and the subsequent read by the processor.
Additionally, the RM9000x2 implements the 5-state MOESI cache coherency protocol allowing the processors to share both unmodified and modified cached data without the round trip write-memory, read-memory latency required by the simple 4-state MESI protocol. If the software model adopted requires that the processors share packet data in a bucket brigade or pipelined fashion, the MOESI protocol is critical for achieving the expected performance improvement. While PMC-Sierra was busily developing the high-performance technologies manifested in the RM9000x2, the telecommunications market itself was imploding. As a result, the performance gap has changed quite a bit. In particular, the widespread adoption of 10Gbit transmission has been pushed out and 40Gbit is nowhere in sight. With these changes, the graph in Figure 5 looks much less depressing today from a processor vendor's perspective.
 Fig 5: Closing the packet processing gap.
Given the data-thirsty nature of packet processing, one might expect future versions of the RM9000 family to target both greater memory and I/O through-put. Let's examine how each of these might be improved in turn.
First, memory system throughput must be looked at from both latency and bandwidth perspectives. For networking, raw bandwidth is important due to the streaming of packet payload in and out of memory prior to and after processing respectively. Latency is equally important due to all the dependent accesses for routing/forwarding and the statistics collection pick-and-poke accesses for network management.
While moving to DDR-II is a natural evolution for increasing bandwidth, that alone may not be adequate for the highest performance applications and it will do little, if anything, for latency. It is also possible to increase bandwidth by doubling or even tripling the number of memory interfaces, but the silicon and packaging cost of this alternative will be prohibitive in the majority of cases.
There are several competing memory vendor solutions to the latency problem, the best known being FCRAM and RLDRAM referring to fast cycle and reduced latency DRAM respectively. Both of these technologies suffer from being non-mainstream and consequently have limited adoption due to availability and pricing concerns. It remains to be seen whether either of these technologies becomes pervasive enough for adoption by networking equipment vendors who have to date been very conservative regarding DRAM selection, staying very close to the PC mainstream.The signs are not positive, as historically, memory has been treated as a commodity where bit-count was the only relevant metric, and system performance issues were addressed by the processor vendors via multilevel cache hierarchies.
Alternate solutions Since deep cache hierarchies are not adequate for packet processing, alternate solutions must be found. One reasonable possibility is to exploit the parallelism inherent in packet flow; since there are many packets in flight under full load, multiple packets can be processed simultaneously. The RM9000x2 is PMC-Sierra's first incarnation of this approach. In a single-threaded packet processing model, the memory system usage is bursty; that is, looked at in its entirety, processing a packet has periods with no memory references when computation and other book-keeping is occurring. By using the dual processor RM9000x2 to tackle two packets simultaneously, the 'holes' left in the memory stream by one processor can be used by the other permitting a potential doubling of the packet processing rate if the memory interface bandwidth was originally 50% unused. By providing a fully coherent multiprocessing environment, the RM9000x2 can achieve this performance gain while maintaining a software-friendly development environment. It is tempting to say, if two is twice as good, then four must be yet twice as good again and so on. This geometric form of 'the more is always better' is a fallacy. The reality of tightly coupled large-scale parallelism never quite matches the dream.
The precise future of processors in networking is unknown. Nevertheless, it never hurts to extrapolate based on experience, data points, intuition, etc., and speculation is fun. One question is whether the NPU as a distinct genre will survive; that is, will there be NPUs just as there are DSPs? Although at first blush the comparison to DSPs might give one hope for their long-term survival, a more detailed look leaves one far less certain.
To a large degree, DSPs were successful as a distinct processor class because the processing was very algorithmic and straight-line rather than the kind of branchy control flow-oriented processing that is found in both network and general purpose processing. This straight-line characteristic allowed DSPs to achieve significant performance advantages via deep pipelines, avoiding the software problems of large-scale parallelism. Moreover, while there is a relatively finite set of DSP algorithms that can be provided as library routines to ease software development and maintenance, the number of possible permutations of networking protocols and services are essentially infinite. As telecom spending continues to come into alignment with telecom revenues, there will be an increasing focus on the 'three Cs' of business success: cost, cost, and cost. For network processing, this translates to the cost of software development, the cost of software maintenance, and the cost of silicon. While highly parallel NPUs may survive in niche applications, software and silicon costs make it more likely that the mainstream of the networking market will be served by processors with simple, straightforward, backward compatible programming models, standard networking specific interfaces, and possibly some integrated packet processing assist hardware. |
| |
|
|
|
|
| |
|
|
Average Rate:
No rating yet |
| |
| |
|
|
|
|
|
|
| 8/1/2009 |
|
| 22/12/2008 |
|
| 16/12/2008 |
|
| |
|
|
|
|
|
|
|
| |
|
|
| |
|
| 6/1/2009 |
|
| 1/1/2009 |
|
| 18/12/2008 |
|
| |
|
|
|
|
|