Free Print Subscription Printer-friendly version Email to a Friend

Consumer ICs: Designing for reliability

( 01 Jun 2008 )
By Michael Santarini, Senior Editor, EDN

If you’ve purchased a Microsoft Xbox 360 or a Sony Playstation 3, the person who sold you the system most likely recommended adding a cooling fan to go along with the game system. Chances are good that you reluctantly forked over the extra $30, even though the fan corrects what would appear to be a design flaw that shouldn’t be there in the first place. And, if you were an early Xbox 360 customer, you probably received a recall notice from Microsoft offering free replacements of ICs, IC-cooling systems, or both that were prone to system slowdowns or even failures. Even if you didn’t yet own a 360, you had probably heard about the recall and the likelihood that 360 may have some design flaws but then went ahead and bought one anyway.

It’s an increasingly interesting phenomenon in the industry: Consumers are buying products that they know are prone to early failures. In the consumer-electronics market, the drive for the latest and greatest digital “bling” often overcomes better judgment, so purchases of consumer electronics are quickly becoming emotional ones. Many people buy a new game console every four years when they become available, a new mobile phone and MP3 player every year, and a TV and PC every four to six years.

Although consumers now seem willing to fork over cash for the bling, will they be willing to do so in the future if product failures start occurring before their mobile-phone contract has expired? Even if consumers aren’t worried, the makers of consumer products should be, because early defects will sooner or later cause costly recalls and may even turn consumers and OEMs against the defective brands. In the game-console world, consumers have only three choices: the Xbox 360, the PS3, and the Wii. In the TV, cell-phone, and most other consumer-electronics niches, however, consumers and OEMs have many choices—and very long memories.

Product longevity—or the lack thereof—becomes an even more daunting problem when you take into account the ever-increasing complexities of designing and manufacturing the leading-edge IC designs that power consumer devices. The semiconductor industry now focuses largely on ensuring that IC-design fabrication produces sufficient yields, that the ICs pass functional tests so they can go into products in large volumes, and that those products will land on store shelves sooner than those of the competition. But, as IC processes become more advanced and consumer demand for greater performance and system functions increase, IC failures will become more commonplace unless vendors tackle reliability issues.

Providers of military, automotive, and medical ICs have long practiced high-reliability techniques to ensure that devices last. Those designing and manufacturing ICs for the consumer and OEM market also have paid close attention to reliability, and most target an MTBF (mean time between failures) of at least 10 years—longer than most consumers will keep the products. Experts say that reliability will always be a major concern for semiconductor vendors, but these vendors must overcome many obstacles before they can produce reliable products that meet customers’ increasing demands for faster, smaller, and higher performance products. Most consumer-device manufacturers employ reliability-engineering groups that set guidelines for and closely monitor designs through each step of the design, manufacturing, packaging, and burn-in processes. Burn-in is an important step because it puts designs through accelerated-lifetime tests for the best performance under worst-case conditions—high temperature and humidity. As a manufacturer develops each new silicon process, these reliability-engineering groups are constantly on the lookout for both new and re-emerging failure mechanisms (Figure 1). Today, they also must monitor trends, such as gate leakage and process variability, that can complicate the manufacture of reliable ICs (see box “The shifting sands of silicon” in the Web version of this article at www.edn.com/080306cs).

“There’s no such thing as the 'same old, same old’ in the reliability world,” says Jack Hergenrother, PhD, manager of technology for System Z Test in the IBM systems and technology group. “We’ve been progressing continuously in our understanding of new failure mechanisms and new ways of looking at potential wear-out and failure mechanisms.” According to Hergenrother, this phenomenon is not unique to IBM. “It’s an industrywide thing,” he says. “In the last 10 years [as Moore’s Law has evolved], some new mechanisms have come up, and you need to factor in those [mechanisms] during the qualification and design process. That [need] is true from both a chip- and a system-reliability perspective.”

Experts say that the industry has been able to adequately and quickly address reliability issues at all phases of development. According to John Chen, vice president of technology and foundry operations at graphics-processor vendor Nvidia, the industry will be able to overcome these problems in the next few years. “Designers need to be aware of these issues, so they can take full advantage of advanced technology and avoid pitfalls,” he says. Both Nvidia and Xilinx are in the forefront when it comes to creating designs employing new logic processes, so they and their foundry partners must be aware of potential failures, according to Glenn O’Rourke, senior director of product-development engineering in the advanced-products group at Xilinx (see box “Fab or fabless: Reliability is still top goal”). “We more than double the [number of] transistors in our designs every 18 months because graphics engines require massive processing power,” says Chen.

Nvidia’s co-founder, Chris Malachowsky, in 1996 designed the company’s first chip, a 1 million-transistor design that was massive for its time. In comparison, the company’s latest graphics processor, which the company built using 65-nm technology, exceeds 1 billion transistors. “We can always use smaller, faster, and higher performing transistors, unlike some applications that are pad-limited and are not scalable,” says Chen. “We have a great opportunity for riding the wave of Moore’s Law; we are always at the forefront of the technology. However, new challenges come with being one of the first companies to use a new technology.”

IC-failure mechanisms
For the 130-, 90-, 65-, and 45-nm process nodes, IC-reliability groups have paid the most attention to failure mechanisms such NBTI (negative-bias-temperature instability), hot-carrier effects, EM (electromigration), gate-oxide integrity, and SERs (soft-error rates). NBTI and hot-carrier effects are two commonly monitored failure mechanisms, both leading to a loss of gate control (References 1 and 2). NBTI is a key reliability issue that is of immediate concern in CMOS devices enduring stress from negative-gate voltages. Hot-carrier effects occur when an electron, or “hole,” gains sufficient kinetic energy to overcome a potential barrier, becoming a “hot carrier,” and then migrates to a different area of the device. In both NBTI and hot-carrier effects, the driving current to a transistor becomes smaller, degrading or locking up the timing of the gate, potentially causing failures.

The issue of NBTI became a problem at the 90-nm node, but manufacturers quickly addressed the issue. The initial studies of NBTI typically focused on devices running on always-on dc current, in which the problem is worse, according to Li-Pen Yuan, group R&D director for extraction- and power-integrity products at Synopsys. Devices running on ac have less of a problem with NBTI because the current is discontinuous, thus it does not overstress the transistors. NBTI remains an issue that reliability and design groups must monitor, however, especially if their designs target dc-system applications, such as mobile computing or handheld devices.

NBTI hasn’t disappeared but has gone into the background, says IBM’s Hergenrother. “A few years ago, it caused some problems,” he says. “You don’t hear about it too much anymore because we have figured out how to deal with it. Today, you hear more about PBTI [positive-bias-temperature instability], which is similar to NBTI, except that it happens on a PFET rather than an NFET. The physics of PBTI are different enough that it will become a problem at later technology nodes. This time, the industry will likely be more ready for it.”
IC manufacturers squeeze more speed from transistors and minimize leakage power by using strain engineering—a technique for enhancing performance by modulating strain, or stress, in the transistor channel. Modulating strain enhances electron mobility and, thus, conductivity through the channel. One of the side effects of the technique is that it can introduce hot-electron effects into the design. These effects can shift the voltage threshold and reduce the lifetime of an IC. “Intuitively, if you use strain engineering, you make the transistor faster and higher power and may cause more hot-electron, or hot-carrier, effects,” says Chen. He explains that strain engineering induces a higher electrical field near the drain side of the transistor and causes the electrons in the N channel to quickly reach velocity saturation. Electrons must move as fast as they can because doing so provides current. “[The moving electron] will hit other electron-hole pairs and generate other electrons,” he says. “It’s an avalanche effect—impact ionization—that creates more electrons, and, when they get too much energy, they jump into the MOS-gate dielectric and get trapped there, causing a threshold shift and ultimately device failures. But manufacturers have figured out how to increase the barrier entering the gate dielectric. In that regard, it helps: It increases the hot electrons but creates a barrier to stop electrons from getting trapped in the dielectric. The net effect is equal or fewer hot-carrier effects.”

The most diligently monitored failure mechanism, EM, occurs when too much current passes through thin metal traces connecting transistors. When two thin traces are close together and carrying current or switching at once, one can splinter, causing an open. This splinter can then touch the adjacent trace, causing a short circuit, which can lead to a device failure. EM usually occurs over time, leading to failures long after the chips have left testing. Both the semiconductor and the EDA industries have been aware of EM for many years. “EDA vendors offer analysis tools to detect areas of a design that are susceptible to EM,” says Synopsys’ Yuan. As new processes emerge, EM has grown but not excessively. “A typical design 10 years ago would have a few areas that were sensitive to EM,” says Yuan. “Today, a design may have just 10 [areas]. It isn’t like the problem is exploding.” As EM continues to be a problem, however, tools for preventing it will likely become more common in the mainstream designer’s toolbox.

Another failure mechanism is gate-oxide breakdown or integrity, in which current causes a slow breakdown of the gate dielectric, which can lead to failures. Chen notes that new materials, such as high-k-metal gates will help improve reliability in this area. Intel pioneered these materials, and the rest of the silicon manufacturers will soon follow. Chen notes that some 45-nm and, more likely, 32-nm designs will likely use high-k metal dielectrics composed of hafnium oxide instead of the more traditional gate oxide. Manufacturers grow gate oxide on the silicon during the manufacturing process, and doing so creates a remarkably smooth surface. But in high-k fabrication, manufacturers deposit the hafnium oxide on the silicon in composite layers. “If you use one type of layer, it usually doesn’t work,” says Chen. Using multilayer, high-k dielectric usually means having fewer pin holes because it’s harder to align pin holes for multiple layers. Using high-k dielectric materials usually improves time-dependent-dielectric-breakdown performance. However, unlike silicon dioxide, composite layers have more traps, and more traps can cause electron or N- or P-channel hole trapping, which can cause soft breakdowns, he says. Those things degrade mobility and, in the long term, can create threshold instability. The manufacturers have come up with various process tricks to overcome this issue. “One way is to put a silicon-dioxide layer between the high-k-metal layer and the silicon,” says Chen.
SER, another failure mechanism that has long been a concern in the military- and aerospace-IC and memory markets, is now becoming a greater concern in logic devices (Reference 3). Alpha particles in packaging materials or neutron strikes that occur naturally in the environment are the typical causes of soft errors. Essentially, an alpha particle or neutron can strike a device, generate noise, and flip bits in memory devices or even flip latches in your circuit. “It is getting to be a bigger challenge with each technology generation,” says IBM’s Hergenrother. “At the active areas of the devices, the volume of the critical-area devices keep going down, which means that you have to deposit smaller and smaller amounts of charge to create an upset in your transistor.” It’s difficult to remove alpha particles from packaging materials, so you must build immunity to both cosmic and alpha particles into your system. You can address soft errors at many levels. “[IBM] looks at SER at the technology level to make transistors soft-error-tolerant, at the circuit level … to arrange transistors into latches and flip-flops so that it is robust even if one of the transistors does flip,” he says. “Then, we look at the chip level for robust error-detection and -correction mechanisms, so, even if there is an error, we catch it and correct it before it propagates any undesirable data. On top of those mechanisms, we have system-level protection, which is another layer of error detection and correction.”

Several failure mechanisms can lead to reliability issues. The semiconductor industry has been diligent about identifying and thus correcting failure mechanisms before they ever reach consumers. As devices move closer to the limits of physics and CMOS, however, you may wonder whether reliability will become a worse problem to deal with.


For more information

Apache Design Solutions, www.apache-da.com
IBM, www.ibm.com
Intel, www.intel.com
Microsoft Corp, www.microsoft.com
Nintendo, www.nintendo.com
Nvidia Corp, www.nvidia.com
Sony, www.sony.com
Synopsys, www.synopsys.com
Toshiba, www.toshiba.com
UMC, www.umc.com
Xilinx, www.xilinx.com
References
• Peters, Laura, “NBTI: A Growing Threat to Device Reliability,” Semiconductor International, March 1, 2004, www.semiconductor.net/article/ CA386329.
• Peters, Laura, “Strained Silicon: Essential for 45 nm,” Semiconductor International, March 1, 2007, 2007, www.semiconductor.net/article/CA6418539.
• Santarini, Michael, “Cosmic radiation comes to ASIC and SOC design,” EDN, May 12, 2005, pg 46, www.edn.com/article/CA529381.


Fab or fabless: reliability is still top goal

IC reliability is a top concern for both companies that own their own fabs and those that don't. For example, IBM creates its products in house, allowing the company to address reliability concerns and even design trade-offs at all levels of product development: technology and transistor development, circuit-level design, chip design, package design, and system implementation. "Fabless" companies, such as Nvidia and Xilinx, on the other hand, must rely on external foundries to manufacture products. And because the two companies are in highly competitive markets, they tend to be the first to jump to a new foundry's process when it becomes available. But before they begin designing with that process, they make sure it passes rigorous qualifications, such as ISO (International Organization for Standardization) 9000x, ISO 14000, and OCEA (Office of the China Economic Area) standards.

Glenn O'Rourke, senior director of product-development engineering in the advanced-products group at Xilinx, says that the company uses both UMC (United Microelectronics Corp) and Toshiba as suppliers, so Xilinx must ensure that both sources can make comparable versions of Xilinx's devices. To achieve this goal, Xilinx develops a reference model of its designs and ensures that both suppliers can meet their objectives. "We develop a silicon reference model upfront, and … both fabs drive to those goals," says O'Rourke. He also notes that, because Xilinx's chips find use in a variety of applications, the company analyzes its designs' lifetime performances (Figure A). "We do an accelerated burn-in to mimic the lifetime of the product," he says. "We do full characterization across temperature and voltage to see how the product's performance is changing over its lifetime. We leverage that data to cover the usage and the specifications for the lifetime of the product." Xilinx then makes the results of those reports available to customers.



Captions

Figure 1 Xilinx devices find use in multiple applications, which in turn target long lifetimes. Xilinx uses lifetime-performance analysis, which runs full characterization during various points in the burn-in process.

Figure A Xilinx and other large companies examine reliability at multiple phases in IC development.

Click here for the illustrations:


Figure 1, Figure A


 
Free Print Subscription Printer-friendly version Email to a Friend
 
Article Rating 
Average Rate: No rating yet
 
Poor Quite Good Good Very Good Excellent
 
Related Content 
 
 
KNOWLEDGE CENTER
Panasonic Key Devices Guide 2008 :
 
Fairchild Semiconductor :
 
Texas Instruments: DaVinci™ Technology
 
Texas Instruments: Safe Bet Series
 
 
 
Highest Rated  
Feedback Loop  

ADS BY GOOGLE 
 
 
 
ADVERTISEMENT
Press Release 
 
TECHNOLOGY NEWS
 
RESOURCE CENTER

 
 
PRODUCT NEWS
 
FEATURED SPONSORS
 
 
DESIGN CENTERS
 
ADVERTISEMENT
     
Reference Designs 
   
     
 
 
 


 
 
RSS
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

POLL
What type of environmental regulation do you think will be most beneficial for the tech industry?
Proper recycling and disposal
Push for power efficiency and energy conservation
Chemical/lead regulation
View results