|
| ( 01 May 2010 ) |
| By Robert Cravotta, Technical Editor, EDN |
|
Moore’s Law observes that the number of transistors doubles for the same area every two years. The relentless fulfillment of this observation has been the rallying point for those who predict that 32-bit processors will replace 8-bit processors. The argument starts with the fact that the relative size difference between an 8- and a 32-bit-processor core approaches zero compared with the other resources on the chip as the transistor geometry continues to shrink (Figure 1). As the difference in the silicon area of 8- and 32-bit cores shrinks to nothing, 8-bit processors lose the price advantage that they once enjoyed.
In 2004, 32-bit processors hit a pricing milestone when Philips, now NXP, and Atmel offered ARM7 processors with 8-bit features, such as atomic bit manipulation and brownout-detection circuits, for as little as $3. However, providing a low-cost processor does not change the evaluation process; other considerations also matter in a designer’s choice of processor. Although this price point brought 32-bit processors into consideration for a new set of applications, it did not spell the end of the market for 8-bit processors (Reference 1).
In 2006, Luminary Micro, now Texas Instruments, opened its doors for business with a 32-bit ARM Cortex-M3 microcontroller that sold for less than $1. At this price, 16-bit processors would surely feel some pressure. Once again, price is only one advantage that the smaller processors have. Like 8-bit processors, 16-bit processors have a class of applications to which they deliver just enough performance at the best price and power-consumption level, making it difficult for general-purpose 32-bit architectures to compete (Reference 2).
In late 2009, NXP rolled out an ARM Cortex-M0 processor that sells for 65 cents. This price places this device squarely in pricing competition with 8-bit processors. The lowest public pricing information puts 8-bit processors at 45 cents to $10 per device (Reference 3). As people predicted, the difference in pricing between 32- and 8-bit processors is trending toward zero.
A few other things make this new pricing milestone with the Cortex-M0 a little more interesting, however, and worthy of a deeper look. The Cortex-M0 has replaced the Cortex-M3 as ARM’s smallest, lowest-power, and most energy-efficient 32-bit-processor core to date, whereas the M3 is the clear migration target from the M0. Designers can implement the M0 core in as few as 12,000 gates. As a result, the M0 implements a substantially smaller subset of the 16-bit Thumb2 instruction-set architecture that the M3 fully supports (Figure 2). ARM based the subset on the statistical frequency of the most commonly used Thumb2 instructions. The loss of function of the constrained instruction set is that the system must use multiple instructions to perform what a single instruction in the full Thumb2 instruction set could do.
NXP claims that the code density of its M0 processor is better than the code density of the 8- and 16-bit processors on the market. Code density can be a loose proxy for processing performance; smaller code for the same function might correlate with fewer memory fetches and faster execution for the same task. There might also be a loose correlation to a lower energy budget for systems that switch between sleep and active modes. The faster system might consume more power, but it may also require less energy to perform the same task as the slower system because the faster processor can go back to sleep sooner. So there are some technical issues that bear investigation with regard to M0 and smaller processors.
Inflection point? Another reason to explore how low 32-bit processors can go is that ARM claims that the Cortex-M0 has the fastest adoption rate of any of the company’s processor cores. ARM also claims that half of its M0 licensees are new to ARM, with a strong implication that those vendors were traditionally serving the 8- and 16-bit-application areas. The public list of licensees lists only NXP, Triad Semiconductor, and Melfas from a list of at least 15 licensees, so it is hard to draw any conclusions. However, considering ARM’s statements, the Cortex-M0 may have crossed a key threshold, and its adoption by so many new licensees may signal an inflection point in the market serving 8-, 16-, and low-end 32-bit applications.
In addition to processor vendors’ rolling out smaller and lower-priced 32-bit processors, some traditional 8- and 16-bit-processor vendors have rolled out their own 32-bit products. Microchip in 2007 added the 32-bit, MIPS-based PIC32 processor to its line of more than 650 PIC processors. The PIC32 uses the same development tool set as the 8- and 16-bit devices, and the Explorer 16 platform hosts the processor because the platform maintains the software, peripheral, and pin compatibility that the 16-bit processors supported on that same platform.
In 2009, Cypress Semiconductor rolled out the 32-bit Cortex-M3-based PSoC5 (programmable system on chip) alongside the single-cycle, 8051-based PSoC3. The 32-bit PSoC5 roll-out is not a big surprise. The 8051-based PSoC3 is a surprise, however, because the company had for years offered a proprietary 8-bit PSoC1 product. The PSoC Creator software tool set supports development for both new processor families, and PSoC Designer supports the PSoC1. PSoC Creator also makes it easier for developers to migrate from or between 8- and 32-bit designs.
In 2007, Freescale took the 8- and 32-bit common tools a step further with the Flexis line of processors. These processors share pin, tool, and common peripheral IP (intellectual property). In each of these cases, the companies provide not just a silicon migration path between their 8- and 32-bit-processor options but also a common tool set and common peripheral API (application-programming interface) to reduce the pain of an architectural migration.
The 8- and 32-bit-processor markets continue to approach pricing parity, but part of the basis of that pricing parity is the fact that 8-bit processors rely on older, fully depreciated, process geometries and the fact that 32-bit processors rely on advanced process geometries to approach matching that pricing. The assumption in the market seems to be that 8-bit processors will not continue to move down the process curve. Until recently, however, little price competition existed to drive the need to make that move. So pricing parity alone is probably not sufficient to replace the 8-bit-processor market.
Benchmarks At this point, NXP’s code-density claim for the M0 becomes more important. However, measuring code density and processing performance is tricky at best, especially when the processing architectures differ significantly and aim at different problems. In the case of NXP’s claim, the company was comparing the code density and processing performance of the CoreMark benchmark. CoreMark’s developers introduced it in 2009, and it focuses exclusively on the processor core rather than the memory architecture’s ability to hide latency. It comprises several core functions that try to exercise 8-, 16-, and 32-bit operation in roughly equal amounts. A state-machine component, which is basically an 8-bit implementation, covers 8-bit operation, and 8-bit processors are strong in this task.
A double-link list is another component of the benchmark that is a processing sweet spot for 16-bit architectures; however, the benchmark sizes the link list to be appropriate for 8-bit architectures because the list contains only 14 elements. This detail is important because the designers of the benchmark considered the implications of using the benchmark on different-sized architectures. When running the benchmarks, however, you must understand these types of trade-offs to ensure that the compiler is generating code with the appropriate assumptions.
In the case of the double-link list, it is a reasonable assumption that a compiler will specify 32-bit pointers for a 32-bit processor and 16-bit pointers for a 16-bit processor. However, what size pointers should the compiler use for an 8-bit processor? Remember that the benchmark should exercise a task that would be reasonable for the target processor to perform; otherwise, the exercise will produce noise.
Unfortunately, when you are compiling code like this, you probably need to explicitly tell the compiler to use 8-bit pointers. Implementing 16- or 32-bit pointers on an 8-bit processor in this way grossly overstates the needed code and data memory for a data structure that you would never use on such a small machine. Rather than occupying 3 bytes per data element, the structure would occupy 5 or 7 bytes per data element. Additionally, the code would require additional instructions to load the 16- or 32-bit addresses.
On an 8-bit processor, a double-link list would reasonably use 8-bit data with two 8-bit pointers. Using 8-bit pointers in this data structure might necessitate the use of a base or an index pointer, and it would place a hard limit on the size of the list so that the entire data structure would fit within the 8-bit address. In this case, the list is 14 elements long—far short of the approximately 80-element maximum for implementing 8-bit pointers with this type of data structure.
Another component of the benchmark is matrix manipulations. This component favors those architectures that can implement looping optimizations and comprises 16- and 32-bit operations that favor architectures with 32-bit math units or other features, such as SIMD (single-instruction/multiple-data) extensions. The final component of the CoreMark benchmark is a 16-bit CRC (cyclic redundancy check) that acts as a verification task and helps balance the 16-bit operations with the 8- and 32-bit operations. However, just because an operation is a 16- or 32-bit operation does not mean that an 8-bit processor is completely inappropriate for the task. Infineon’s 8-bit XC878 core has 16- and 32-bit extended, semiautonomous peripherals that allow the system to perform these extended tasks without overburdening the processor core (Figure 3). These extended peripherals are appropriate for an application-specific processor with a well-known set of tasks and constraints to meet tight cost and power targets.
Unfortunately, when comparing 8- and 32-bit architectures, you cannot completely separate out the performance of the components in the CoreMark benchmark so that you examine only those that are relevant to your target. As with the Infineon processor, however, you can in a sometimes economically feasible way make a specialized part that further complicates apples-to-apples comparisons without a deep understanding of the problem and target processors. Code density is a tough measurement to compare because, as you expand for the double-link list, each processor size becomes appropriate for analogous implementations of different types and sizes of data sets.
Power and energy The 8- and 16-bit processors also often have an advantage over 32-bit processors in power consumption or, more important, system-level energy. When comparing processors, you must measure the energy the system consumes while asleep, the energy it wastes when it wakes up, and the energy it consumes to actively perform these tasks. The energy the system loses while the processor wakes up from sleep is a function of system settling times during which the clock-signal-propagation time is in the range for proper processor operation. How often the system has to wake up compared with how much energy it consumes performing active processing determines the impact of this requirement on the system.
Other than for pricing reasons, 8- and 16-bit processors use older process geometries because the larger geometries allow a much lower leakage current than do more advanced processes. This fact is especially important for systems that sleep most of the time. However, choosing a process geometry that yields a lower sleep or leakage current is a trade-off because it also means that the system has a higher active current when the system is awake. As a result, the energy consumption represents a trade-off of the ratio of sleep and active processing the system will experience. Smaller and larger processors can take significantly different amounts of time to finish a task, further complicating the trade-off. A 32-bit processor’s ability to complete active processing more quickly than an 8-bit processor can offset the 32-bit device’s higher power consumption because it can spend even more time in sleep mode and yield a net savings in system energy dissipation.
Contemporary processors are implementing ever-more-sophisticated power-management techniques. These innovative approaches go beyond the process-geometry issues to the heart of resource allocation and sizing. A small example, such as NXP and Texas Instruments are using, is the use of ROM to house system drivers and libraries that represent the final integration of a function as a hardware block and a firmware block. Using ROM in this way provides stability to targeted low-level functions, and it may reduce the amount of program flash a design might otherwise need if the designer left those functions as software for the end developer. The need for smaller flash can in a small way affect the total silicon cost and energy requirements for the system. By itself, this savings is not large, but combining many of these small types of savings can result in real and measurable cost and energy savings.
Crystal ball Although 32-bit processors can approach cost and energy parity with 8-bit processors, contemporary discussion about low-end 32-bit processors often overlooks an analogous relationship between FPGAs and ASSPs (application-specific standard products) at the high end of the processing market. The processing sweet spot for FPGAs is a task that can leverage arbitrarily wide signal-processing algorithms that designers implement as hardware-acceleration blocks. FPGAs have an advantage over DSPs when the signal-processing algorithm is specialized or wide enough to benefit from using more parallel execution units than a hardened processor architecture has.
Designers base the number of execution units they implement in a DSP or a microprocessor on a trade-off between silicon cost, energy consumption, and the ability of the target applications to keep all of the implemented execution units busy enough to justify their inclusion in the device. Texas Instruments and Freescale offer the C6472 and MSC8156 DSPs, respectively, which have six cores. Both companies explored the choice of eight-core configurations, but the six-core configurations struck the best balance of cost, power, and resource usage for the range of targeted wireless applications. An FPGA need not balance the execution units across multiple application designs as an ASSP does because each design can independently implement the optimum number and type of execution resources for each design.
However, as algorithms and application mature, patterns emerge. Architects of ASSPs can take advantage of these patterns to provide systems that are better than FPGAs from cost and energy perspectives in high-volume applications. DSP vendors, such as Texas Instruments and Freescale, have integrated into their processor architectures the Turbo and Viterbi decoding algorithms as hardware accelerators. An FPGA has a tough time competing with these types of processors with hardware accelerators when they provide a perfect match with the processing requirements of a design.
Does this relationship mean the eventual end of FPGAs, or does it mean that a key value of FPGAs is that designers can feasibly, technically, and economically implement innovative designs with an FPGA years before ASSPs can competitively support those same designs? In a similar fashion, 8- and 16-bit processors will be able to reach lower price and energy thresholds years before 32-bit processors can feasibly support those same thresholds.
At the low end of the embedded-design market, the key constraint is not the amount of processing performance you can cram into a unit of time but rather what kind of processing you can perform with ambient energy. Energy-scavenging-based designs and low-speed, cascading, or feedback-based-processing mesh designs are potentially huge emerging applications that smaller processors will be able to enable years before 32-bit processors feasibly can. However, these types of applications have an additional significant hurdle to overcome before they can explode onto the scene because the programming paradigm of the past few decades has directed programming languages, development tools, and processor architectures to focus on optimizing processing capability over a unit of time rather than how to extract processing value in a variable and energy-starved environment. Author Information You can reach Technical Editor Robert Cravotta at 1-661-296-5096 , 1-661-296-5096 and rcravotta@edn.com.
References 1. Cravotta, Robert, “Reaching down: 32-bit processors aim for 8 bits,” EDN, Feb 17, 2005, pg 31. 2. Cravotta, Robert, “Putting the squeeze on 16-bit processors,” EDN, Feb 15, 2007, pg 60. 3. Cravotta, Robert, “'I’d like to buy a µ’: the 36th annual microprocessor directory,” EDN, Oct 22, 2009, pg 28. Captions
Figure 2: The Cortex-M0 implements a subset of the Cortex-M3 Thumb2 instruction-set architecture (courtesy NXP).
Figure 3: Semiautonomous peripherals allow a smaller system to perform extended tasks without overburdening the processor core (courtesy Infineon).
|
| |
|
|
|
|
| |
|
|
Average Rate:
No rating yet |
| |
| |
|
|
|
|
| |
|
|
| |
|
|
| 25/4/2012 |
|
| 25/4/2012 |
|
| 24/4/2012 |
|
| |
|
|
|
|
|
|
|
| |
|
| |
|
| 30/3/2012 |
|
| 22/3/2012 |
|
| 1/3/2012 |
|
| |
|
|
|
|
|