Intel Cougar Point has failed
January 31, 2011 1:43 PM Subscribe

As part of ongoing quality assurance, Intel Corporation has discovered a design issue in a recently released support chip, the Intel® 6 Series, code-named Cougar Point, and has implemented a silicon fix. Intel has identified the Northbridge chipset to the new Sandy Bridge processors to have an issue that will likely require the recall of all existing motherboards. Price tag for the recall is currently estimated to be $700 million.

The Sandy Bridge processors are the second generation of Core i-Series processors released by Intel. All of the current generation runs on the P67 chipset, codenamed Cougar Point. The new series was released officially on January 9, 2011. The specific issue is that the SATA 3Gbps ports that typically run hard disk drives and CD/DVD drives may degrade with age and eventually fail. It may take until April for Sandy Bridge to return to the market.

This isn't the first time Intel has had quality control questions. The most recent involved reports of P55-based motherboards having the sockets burn out. There is talk already that the cost may greatly exceed that of the Pentium processor FDIV bug for which Intel took a write-down of $475 million in 1995.

Lastly, in a testament to Intel's size, the market seemed to not care much. The stock price closed at $21.46, down under 1% on the day. "This is a minor negative and not as big an issue as it seems," said Miller Tabak analyst Brendan Furlong.

posted by Mister Fabulous (27 comments total) 4 users marked this as a favorite

For a company with a $120B market cap and $35B in '09 revenue, a $700M price tag to fix this isn't quite the company shattering problem. Instead it's more of a QC issue and a ding to their quality image.
posted by msbutah at 1:53 PM on January 31, 2011 [1 favorite]

From what I've heard, this doesn't affect the first two SATA ports on the motherboard, so most consumer/business computers are likely unaffected, since they typically won't be using more than just those first two SATA ports.
posted by yeoz at 1:56 PM on January 31, 2011

Well, duh. Sandy Bridge. You make bridges out of concrete and steel, not sand.
posted by Cool Papa Bell at 1:58 PM on January 31, 2011 [4 favorites]

For a company with a $120B market cap and $35B in '09 revenue, a $700M price tag to fix this isn't quite the company shattering problem.

Still, I wouldn't want to be the engineer who cost the company about 4000 years' salary.
posted by Combustible Edison Lighthouse at 2:09 PM on January 31, 2011 [7 favorites]

This is crappy timing for me, since I just ordered last week an ASUS P8P67 which is due tomorrow. Along with this I have now discovered that there are apparently a ton of problems with this mobo, all of which just started to come to light this weekend on places like HardOCP and etc.
So, the moral of the story is, if I'm ordering the unit, it's probably going to be broken and you should avoid buying it. I plan on filing a business model patent soon.
posted by Old'n'Busted at 2:10 PM on January 31, 2011 [4 favorites]

Good thing I own AMD stock. Which is actually up 4.5% today.
posted by delmoi at 2:10 PM on January 31, 2011 [1 favorite]

Yeah, I wouldn't want to be on the team that gets the lucite slab for this one. It's like an inverse ship-it award.
posted by GuyZero at 2:11 PM on January 31, 2011

But that's only like 6 years of Goldman Sachs officers.
posted by a robot made out of meat at 2:15 PM on January 31, 2011

Still, I wouldn't want to be the engineer who cost the company about 4000 years' salary.

Unlike bonuses, quality control problems have a fabulous way of dissipating themselves over every division the product touched.
posted by geoff. at 2:26 PM on January 31, 2011 [1 favorite]

The specific issue is that the SATA 3Gbps ports that typically run hard disk drives and CD/DVD drives may degrade with age and eventually fail.

So once this fix is implemented I'll never need to replace any more hardware?!
posted by DU at 2:31 PM on January 31, 2011

Has anyone seen details of the bug and how it destroys hardware? It seems... subtle.
posted by GuyZero at 2:47 PM on January 31, 2011

You make bridges out of concrete and steel, not sand.

Sand is often a component of concrete.
posted by kenko at 2:47 PM on January 31, 2011 [1 favorite]

Anand sez, "Intel expects it’ll cost $700M to actually recall and fix hardware in the market today and another $300M of lost revenue for the chipset business while this is all happening. Altogether we’re talking about a billion dollar penalty."

He also notes this will put back the launch of new MacBookPros 'til at least April.
posted by BeerFilter at 2:48 PM on January 31, 2011

GuyZero, from the same AnandTech article I linked above:

"The symptoms are pretty simple to check for. Intel says you’d see an increase in bit error rates on a SATA link over time. Transfers will retry if there is an error but eventually, if the error rate is high enough, you’ll see reduced performance as the controller spends more time retrying than it does sending actual data. Ultimately you could see a full disconnect - your SATA drive(s) would not longer be visible at POST or you’d see a drive letter disappear in Windows."

So to clarify, it doesn't destroy your disks. The four 3Gb/s SATA ports they are might be connected to slowly degrade.
posted by BeerFilter at 2:52 PM on January 31, 2011

It's actually costing Intel $1bn - $700 mill for sorting out the stock that's out there, and $300 mill in lost sales 'cos it won't be able to resume shipping until the end of Feb. An acute observer can probably derive some new information about Intel's channel figures from that.

There's a lot that's odd about the bug. The company says that the chips will gradually go wrong under some circumstances, and that under ten percent will be affected over the three year lifespan most PCs have. Intensive use of SATA (or USB 3) hastens the problem, so I'm guessing that it's some sort of localised thermal overload that gradually cooks the driver circuitry - the power handling bits of the chip that convert the very weedy internal signals to beefy thrusts of electrons that can survive in the outside world.

However, the fix was simple. It took the engineers almost no time to find the problem, once enough faulty chips had been returned by customers who'd been testing them, and 'a few days' to come up with the solution and qualify it to the point that they were happy to shut down production and issue the recall. Given the variables involved when you make any change to circuitry as complex as modern computer logic, this is astonishingly fast.

This fix involves a late metal layer. Chips are built up through a complex series of etching and deposition: the metal layers, of which there are eight (I think) on this chip, connect components that have previously been created in the silicon itself. The metal is deposited in the layers as the chip moves through various stages on the production line. (If you ever get a chance to visit a chip fab, grab it. Awesome places.)

A late metal layer is very good for Intel, as it means all the chips in the fab plant up to that stage - the vast majority - can be completed correctly (a complete fresh start is extraordinarily expensive and time consuming). But that means that it's just signals that need to be changed, whereas thermal hotspots tend to arise when you've got the basic component design or placement wrong. It could be a wholesale rewiring of the transistors in the drivers, but I'd expect that to use more than one metal layer. Could be wrong.

So, hm. But Intel has been unusually quick off the mark and remarkably forthcoming so far. More may yet be on its way.
posted by Devonian at 2:53 PM on January 31, 2011 [9 favorites]

Given the variables involved when you make any change to circuitry as complex as modern computer logic, this is astonishingly fast.

When I worked at Intel, there was a poster in one conference room that diagrammed all the steps involved in making changes to a chip after tape-out. The shortest possible trip through the process was over 1,000 steps long.
posted by nomisxid at 3:04 PM on January 31, 2011 [1 favorite]

Sand is often a component of concrete.

Hyperbole is often a component of humor.
posted by Cool Papa Bell at 3:08 PM on January 31, 2011 [9 favorites]

Metafilter: converting the very weedy internal signals to beefy thrusts of electrons that can survive in the outside world.

posted by lalochezia at 3:26 PM on January 31, 2011 [3 favorites]

code-named Cougar Point, and has implemented a silicon fix

When will cougars realize that being a good conversationalist can attract the young dudes, too: it's not all about your, er, motherboard.
posted by anothermug at 3:59 PM on January 31, 2011 [2 favorites]

Well, that puts off plans for trying to build a computer for a while. My little old laptop will have to wheeze it's way through the next couple months, I guess.
posted by Ghidorah at 3:59 PM on January 31, 2011

If the fix involves a metal layer then I would speculate that the problem was metal migration. You can think of this as a fuse that blows very very slowly. A metal line gradually thins over time because it's too small for the amount of current it is being asked to conduct. If the metal line is supplying power to a high performance circuit (like a SATA IO driver), then gradually the resistance will increase and the supply voltage to the circuit will drop, lowering its performance, until eventually the voltage is too low for operation.

This is a very old problem in semiconductors, usually the design rules protect you from making a line to small. It will be interesting to hear the details if they ever come out.
posted by Long Way To Go at 4:12 PM on January 31, 2011

Long Way To Go, if you ever find these details, please put them in a hyperlink named "[via]"
posted by 7segment at 4:26 PM on January 31, 2011 [2 favorites]

Fresh from Anandtech, The Source of Intel's Cougar Point SATA Bug:

The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports. The fact that the 3Gbps and 6Gbps circuits have their own independent clocking trees is what ensures that this problem is limited to only ports 2 - 5 off the controller.
posted by Mister Fabulous at 4:29 PM on January 31, 2011

So, nuttin to do with thermals and IO - voltage too high for a gate. And it didn't get caught, which is interesting in its own right. That article continues that as the transistor wasn't actually used (left over from a previous version), it was just disconnected. Like cutting a PCB track with a scalpel, I guess, only not.

Sounds like the blame might not be too thinly spread, as well.
posted by Devonian at 6:16 PM on January 31, 2011 [1 favorite]

A pretty surreal week: last week Intel tries to be hip and announces Will.i.am as "Director of Creative Innovation" and this week falls back to the language of arcane and incomprehensible details of chip architecture.
posted by marvin at 7:56 PM on January 31, 2011

Pretty clearly this is will.i.am's fault. He must've been screwing around with the PLL clocking trees just to see what they do on his first day on the job, and who can really blame him?
posted by felix at 6:48 AM on February 1, 2011 [2 favorites]

The company says that the chips will gradually go wrong under some circumstances, and that under ten percent will be affected over the three year lifespan most PCs have.

I am quite far downstream from all of this—we last bought new PCs for our center in 2003—I'll find out about this problem in 2017.
posted by beelzbubba at 9:50 AM on February 1, 2011

« Older The U.K. Explained for non-residents | NYC's MTA.. Visualized Newer »

This thread has been archived and is closed to new comments

MetaFilter

Intel Cougar Point has failed
January 31, 2011 1:43 PM Subscribe

Tags

Share

Intel Cougar Point has failed January 31, 2011 1:43 PM Subscribe

Tags

Share

Intel Cougar Point has failed
January 31, 2011 1:43 PM Subscribe