Free Print Subscription Printer-friendly version Email to a Friend

Traffic management: a growing nightmare for SOC designers

( 01 Feb 2008 )
By Ron Wilson, Executive Editor, EDN

The SOC (system on chip) began life in the image of the board-level computers that preceded it: as a central processor that a CPU bus connected to local memory and peripheral controllers. That CPU-centric, bus-oriented architecture has since been the underlying plan for many SOCs. But integration has brought complexity in the form of complex peripherals with their own DMA (direct-memory-access) controllers, coprocessors, and additional central processors, all on the same die. Accordingly, the interconnect architecture of SOCs is changing. The old CPU-centric bus is fast retreating to within the functional blocks of the chip; multiple buses, specialized point-to-point links, and on-chipnetworks are replacing it.

Change is rapid, and architects are nearly unanimous in worrying that the change has far outrun the tools necessary to support it. “Today, we still see a number of classic SOC designs, with an ARM core, peripherals, and a memory interface,” observes Hugh Durdan, vice president of marketing at ASIC supplier eSilicon. “Even when these designs grow to include multiple processing cores, they often stick with the classic AMBA AHB [Advanced Microcontroller Bus Architecture Advanced High Performance Bus] structure.”

But there are growing indications that the centralized-bus approach to SOC interconnect is simply running out of steam (see sidebar “Is the problem bus bandwidth or processor bandwidth?”). This problem appears to be partly architectural. As the number of processing nodes on a chip increases and as the data traffic that those nodes generate or consume grows and becomes more varied, the simple demand for raw bandwidth becomes a problem (Figure 1). Yes, it is possible with nine layers of metal and statistical-timing tools to give a multimaster bus almost-arbitrary bandwidth. But the costs in layout complexity, signal-integrity analysis, power consumption, and congestion—especially in this day of stringent design-for-manufacturing rules—make this approach nearly intractable.

In part, too, the problem involves tools. The classic tool for provisioning the classic SOC bus is, to put it bluntly, Microsoft Excel. In simpler times, architects could just add up aggregate-bandwidth requirements of the blocks on the bus, add in a bit of head room to account for times of peak congestion, and use the sum to determine the necessary bus bandwidth. Available bus bandwidth so exceeded the needs of individual blocks that a problem was almost mathematically impossible.

But those days are gone. “You really can’t tell anything from aggregate-bandwidth estimates any more,” warns Silistix’s vice president of marketing, David Lautzenheiser. Just as the centralized bus is rapidly giving way to more complex interconnect architectures, the spreadsheet is yielding to a more complex brew of system-level modeling, statistical tools, and cycle-accurate models that the skill and patience of the architects bind together.

Assessing the problem

Aggregate bandwidth isn’t the right question, and a centralized bus isn’t always the right answer for two main reasons. First, traffic can differ enormously in its characteristics. Second, functional blocks can differ just as much in their data and timing needs. The problem of analyzing and provisioning on-chip interconnect is not a matter of providing enough for everyone to be happy but rather providing just the right kind of interconnect between just the right blocks. Often, you can achieve that goal with a bus. If you can’t, myriad other techniques present themselves. A multimedia SOC well illustrates the variety of data flows a designer must face. To begin with, a CPU will usually turn up somewhere. That CPU will produce at least two data flows with distinctive signatures: the continual fetching of new instructions and the sporadic two-way traffic of load and store operations.

Caches in the CPU block usually modify this traffic pattern. So, the traffic pattern from the CPU core is a random scattering of bursts as the caches empty or fill lines. This scenario differs greatly from the traffic signatures that emerge from other devices. A baseband signal in a radio SOC, for example, looks like a word or two of data at regular—sometimes very short—intervals from an ADC. Video entering from a camera or DVD player looks similar. But the intermediate data that the video-compression engine pushes to local memory can look like a series of macroblocks that the engine stores and loads in a nearly random sequence rather than a stream of pixels that the scan line organizes. Each type of data has a natural signature. And, as in the case of CPU cores, local memory and state machines can alter this signature.

Bandwidth and latency

Just as kinds of traffic have signatures, kinds of functional blocks have personalities. CPUs, hard-wired signal-processing pipelines, video encoders, serial ports, and DRAM interfaces all have different needs and wants. Despite all the focus on the bus bandwidth surrounding them, “processors are notoriously sensitive to latency, though their bandwidth requirements are modest when compared with some of the bandwidth hogs,” observes Gideon Intrater, vice president of solutions architecture at MIPS Technologies. A CPU’s cache controller may not often ask for data, but, when it does, the whole CPU may be sitting there waiting.

In contrast, some functional blocks are just interested in raw bandwidth. “These [products] include high-performance networking devices—PON [passive optical networking] is a great example—video engines, such as MPEG encoders in DVD recorders and H.264 decoders in HDTVs, and imaging engines, such as the rasterizers in printers and the JPEG encoders in digital cameras,” Intrater says. “Fortunately, in most systems, the bandwidth hogs are less sensitive to latency, and processors that are sensitive to latency are not much in the way of bandwidth hogs.”

Beyond this distinction, there are blocks with special requirements. Imaging or video processors that work with discrete cosine transforms typically process pixels in macroblocks—often an 8×8-pixel square of information—and so need to be able to easily load and store these blocks without explicitly gathering or scattering the pixels across a scan-line-oriented memory.

But the grand prize for fussiness has to go to that ubiquitous memory element, the DRAM. With a complex synchronous interface and a segmented memory array that imposes a significant latency for issuing requests in the wrong sequence, DRAM offers an effective bandwidth that can vary enormously, depending on how you access it. In many cases, only one organization of read and write requests can provide anything close to theoretical throughput. Unfortunately, although these strict requirements are reasonably well-matched to the traffic pattern from a CPU’s L2 cache controller, they don’t even resemble the aggregate-traffic pattern when several dissimilar processors are all sharing a DRAM port.

In simple language, then, the goal of interconnecting architecture is: Adapt the traffic patterns to the needs of the client blocks so that all data arrives in adequate time with minimum expended energy. Within the constraints of the architects’ knowledge of the traffic and clients and the power of their tools, this description fairly depicts what happens today. But those are serious constraints. The first problem is that, to understand traffic patterns, the architect must understand usage models for the completed system, Lautzenheiser points out. “Use profiles are important here, and they involve the intangibles of user perception. In a handset displaying video, the criterion for success may not be how many megabytes per second a bus delivers to the video decoder, but rather whether the user thinks his display seems to twinkle. Predicting such qualitative user perceptions requires searching out those unanticipated use models that put the most stress on the system interconnect. And [completing this task] before you can put a physical prototype in the hands of a real user requires system-level modeling.”

In fact, this area may be the one place in system design in which the elusive ESL (electronic-system-level) tools have established a firm role. “It is best if the system architects can begin modeling data flows even before the IP [intellectual-property] blocks themselves are fully defined,” says Charlie Janac, president and chief executive officer of interconnect-IP company Arteris. “The data flow in a real sense defines the chip.”

With these ideas in mind, Lautzenheiser and other experts suggest an architectural design flow that begins at the conceptual level: What blocks should connect to what? From this simple connection diagram, the flow then attempts—using ESL modeling of the blocks and the best possible information on use models—to quantify the data flows as to volume, signature, and client requirements. Finally, the architects attempt to prove that the on-chip interconnect meets these criteria. But this process is iterative. “There are always surprises,” Lautzenheiser says. “The trick is to have an easy path between architectural design and implementation so that you can iterate quickly, and an underlying scheme that is extensible enough to accommodate changes without major architectural revisions.”

Linking and shaping

There are two fundamental tools for building the on-chip network: links and shaping resources. Links provide a data pathway between blocks. You can implement them as a portion of a shared bus, as a dedicated connection, as a pathway in a switched network, or in more creative ways. Shaping resources include caches, FIFOs, queuing engines, and reorder buffers. Perhaps the easiest way to move from a classic CPU-bus architecture to a more powerful approach is to segment the buses. “In media SOCs, you can [achieve this goal] by dividing the traffic into low-latency transactions, high-bandwidth transactions, and less critical peripheral transactions,” suggests Janac. Then you can provision each type of transaction separately, even if doing so means multiple bus connections on some blocks. Brian Gardner, vice president of IP products at Denali Software, points out that this approach not only places each type of traffic on a physical medium appropriate to it, but also connects data flows that you can model in similar ways, substantially increasing modeling accuracy for the resulting link.

The result of this process is often a hierarchy of interconnect links, Lautzenheiser says. This result is rather clearly visible in the Raza Microelectronics network processor (Figure 1). A similar pattern emerges when you look inside some functional blocks—for instance, at the interconnect within a modern CPU, such as the ARM Cortex A9. Here, private links connect the processing core to its L1 instruction and data caches. A multimaster, high-speed-cache bus, in turn, connects the L1 caches to an L2 cache serving as many as four processor cores. Additional links for control and snooping logic lurk in the background. Two system-bus connections come from the L2 controller.

But what happens when it’s impractical to provide an appropriate physical link for each data flow? An excellent example is the SOC’s DRAM controller, which must take in whatever mix of traffic the multiple processors on the chip throw at it but must somehow mold that cacophony into the highly ordered stream of commands acceptable to a DDR DRAM. To get a look at the complexity of the problem, look inside one such controller (Figure 2). This sample device processes transactions in a number of discrete stages. First, an arbitration engine determines access to the controller based on a priority scheme. Then, the controller loads requests into command, read, and write queues, in which they await reordering based on quality-of-service needs and the realities of DRAM timing. Finally, the requests pass through a transaction-processing engine that looks ahead into the queues to reorganize and gather requests for optimum DRAM usage.

Versions of the same thinking are becoming more and more common in other functional blocks. Often, signal-processing or encoder/decoder blocks contain sophisticated DMA engines that provide many of the same functions as a DRAM controller—scatter/gather processing, reordering of data, and buffering to turn random requests into bursts. The engines perform these functions not so much to please the DRAM controller—in fact, they may puzzle the DRAM controller—as to adapt the traffic to the physical link over which it must travel.

Results and futures

The result of all this work can be substantial. The result of SOC tuning can improve system performance by a factor of four or five times for basically the same functional blocks, according to MIPS President and Chief Executive Officer John Bourgoin. When you put it into perspective, this magnitude of acceleration is more than you can get from a CPU upgrade or a faster process and even more than sometimes results from adding a hardware accelerator.

As SOC architectures move from functionally divided heterogeneous multiprocessing toward semisymmetric multiprocessing, the need for focus on interconnect architecture shifts from a performance-enhancement and energy-saving alternative to a mandatory part of system design. Without a detailed analysis of data flows, a dynamic multiprocessing system may simply require more interconnect provisioning than an affordable silicon design can offer.
And that possibility brings us back to the tools question. Today, Lautzenheiser says, the tool flow for architects—if they are not still working with white boards and spreadsheets—is an ad-hoc combination of ESL-modeling tools, statistical analyses based on queuing theory—which can give a good overall picture but may completely miss corner cases—and cycle-accurate models that can explore corners but may be too slow to find them.


“You need architecture-level modeling,” Arteris’ Janac agrees. “The problems have become too complex for a spreadsheet. But at ESL, there’s no reality to even the best tools—they represent an abstraction. So, you have to bind your system-level explorations to cycle-accurate models of your IP.”

One way to do this binding, MIPS’ Intrater suggests, is the use of hardware emulation. “The cycle-accurate models are based on RTL, and they are far too slow for booting Linux or running an application. The state of the art today is to manually combine coarse-grained statistical tools with cycle-accurate simulations. But if you have RTL models of the interconnect at an early enough stage, you can use emulation to run billions of cycles through cycle-accurate models.”

This state of the tools preserves architecture as art. Modeling a scheme at the right levels, having accurate use models from realistic users, looking in the right places for critical cases, and making accurate conclusions from the cycle-accurate models are still—despite the gradual emergence of system-level tools, especially from interconnect-IP providers—matters of art and experience. But it is an art that has grown increasingly vital, and, as we move into the age of large multiprocessing systems on dice, it will be indispensable.


For more information
· ARM: www.arm.com
· Arteris: www.arteris.com
· Denali Software: www.denali.com
· eSilicon: www.esilicon.com
· Microsoft: www.microsoft.com
· MIPS Technologies: www.mips.com
· Raza Microelectronics: www.razamicro.com
· Silistix: www.silistix.com
· Tensilica: www.tensilica.com


Author Information
You can reach Executive Editor Ron Wilson at 1-408-345-4427 and ronald.wilson@reedbusiness.com.


AT A GLANCE
SOCs (systems on chips) are outgrowing centralized-bus-based architectures.
Accurate use models are vital to understanding traffic patterns.
A combination of ESL (electronic-system-level) and cycle-accurate approaches is necessary to understand interconnect.
As SOCs evolve, interconnect modeling will become mandatory.



Is the problem bus bandwidth or processor bandwidth?

As communications and media-stream data rates increase, most chip designers quickly recognize that data bandwidth is a key factor in design success. Tensilica’s experience with more than 100 multiprocessor-chip designs suggests some basic ideas to untangle the bandwidth confusion.

For example, inadequate data communications causes a shortage of raw bandwidth—the total sustained-bandwidth demands for data communications for I/O, off-chip memory, and bandwidth between on-chip blocks exceed the maximum sustained bandwidth of the interfaces. It also causes excessive latency. The worst-case or typical latency for access to data causes data starvation under some important circumstances, such as when inadequate overall bandwidth adds contention latency.

The bottlenecks in communications appear in multiple places in the design. The three most common locations are at the off-chip memory controller; across the on-chip bus, especially when there are multiple masters, particularly processors and DMA controllers, accessing multiple slaves (memories and I/O interfaces); and at the interface between the processor and the bus.

The processor-bus bottleneck relates to the peak data-transfer rate across the interface, which is often close to the peak on-chip bus bandwidth. The bottleneck also relates to the rate at which you can move data from remote memories across the on-chip bus or from off-chip; into local memory; into and out of processor registers; through local memories; and out to remote memories. With a traditional RISC processor accessing data across an on-chip bus, this data movement might take 10 to 20 processor cycles per 32-bit data word, even if the data is already on-chip and the processor does not modify the data along the way.

You can break the processor-bus bottleneck in two ways. First, using a second bus master, such as another processor or a DMA (direct-memory-access) controller, to move the data into and out of the first processor's local memory can partially relieve the bottleneck. However, RISC's 32-bit registers and load-store register remain problems. Second, you can extend the processor interface to permit high-bandwidth data streams to bypass the bus-interface bottleneck. Processors with direct-connection ports and queues can move an order-of-magnitude more data through processor execution units than RISC processors.
Bus and nonbus interconnect can work in harmony in complex chip designs. Best practices suggest that, when a designer knows that a pair of subsystems requires high interconnect bandwidth and well-bounded latency, he should establish an optimized path between them. Other communications, including communications of the two sensitive subsystems, can safely use a common multimaster/multislave on-chip bus for less bandwidth and latency-sensitive data movement. This separation retains the generality of the common bus but reduces the risk of contention surprises, eases modeling, and enhances performance scalability of the platform.

Optimizing communications for the high-bandwidth links also helps energy efficiency. Direct connections, such as processor-to-processor-communications queues, often use 10 times less energy per data transfer than the equivalent bus-based transfer.

Author's Biography
Chris Rowen is president and chief executive officer of Tensilica Inc.


Captions

Figure 1 Raza Microelectronics’ XLR processor architecture suggests how far SOCs have departed from the classic CPU-centric, bus-based interconnect scheme.

Figure 2 Today’s high-end DRAM controllers have become list processors with queuing, reordering, and quality-of-service functions.

Click here for Illustrations:


Figure 1, Figure 2


 
Free Print Subscription Printer-friendly version Email to a Friend
 
Article Rating 
Average Rate: No rating yet
 
Poor Quite Good Good Very Good Excellent
 
Related Content 
 
 
KNOWLEDGE CENTER
Panasonic Key Devices Guide 2008 :
 
Fairchild Semiconductor :
 
Texas Instruments: DaVinci™ Technology
 
Texas Instruments: Safe Bet Series
 
 
 
Highest Rated  
Feedback Loop  

ADS BY GOOGLE 
 
 
 
ADVERTISEMENT
Press Release 
 
TECHNOLOGY NEWS
 
RESOURCE CENTER

 
 
PRODUCT NEWS
 
FEATURED SPONSORS
 
 
DESIGN CENTERS
 
ADVERTISEMENT
     
Reference Designs 
   
     
 
 
 


 
 
RSS
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

POLL
What type of environmental regulation do you think will be most beneficial for the tech industry?
Proper recycling and disposal
Push for power efficiency and energy conservation
Chemical/lead regulation
View results