Comments on: Breaking the Coppersmith-Winograd barrier

Breaking the Coppersmith-Winograd barrier

albrecht — Tue, 29 Nov 2011 13:48:43 -0800

For twenty years, the fastest known algorithm to multiply two n-by-n matrices, due to Coppersmith and Winograd, took a leisurely O(n^2.376) steps. Last year, though, buried deep in his PhD thesis, Andy Stothers discussed an improvement to O(n^2.374) steps. And today, Virginia Vassilevska Williams of Berkeley and Stanford, released a breakthrough paper [pdf] that improves the matrix-multiplication time to a lightning-fast O(n^2.373) steps. [via]

First section of the Introduction copied from the paper: The product of two matrices is one of the most basic operations in mathematics and computer science. Many other essential matrix operations can be efficiently reduced to it, such as Gaussian elimination, LUP decomposition, the determinant or the inverse of a matrix [1]. Matrix multiplication is also used as a subroutine in many computational problems that, on the face of it, have nothing to do with matrices. As a small sample illustrating the variety of applications, there are faster algorithms relying on matrix multiplication for graph transitive closure (see e.g. [1]), context free grammar parsing [19], and even learning juntas [12]. Until the late 1960s it was believed that computing the product C of two n x n matrices requires essentially a cubic number of operations, as the fastest algorithm known was the naive algorithm which indeed runs in O(n^3) time. In 1969, Strassen [18] excited the research community by giving the ﬁrst subcubic time algorithm for matrix multiplication, running in O(n^2.808) time. This amazing discovery spawned a long line of research which gradually reduced the matrix multiplication exponent omega over time. In 1978, Pan [13] showed omega < 2.796. The following year, Bini et al. [4] introduced the notion of border rank and obtained omega < 2.78. Schonhage [16] generalized this notion in 1981, proved his tau-theorem (also called the asymptotic sum inequality), and showed that omega < 2.548. In the same paper, combining his work with ideas by Pan, he also showed omega < 2.522. The following year, Romani [14] found that omega < 2.517. The ﬁrst result to break 2.5 was by Coppersmith and Winograd [9] who obtained omega < 2.496. In 1986, Strassen introduced his laser method which allowed for an entirely new attack on the matrix multiplication problem. He also decreased the bound to omega < 2.479. Three years later, Coppersmith and Winograd [10] combined Strassen's technique with a novel form of analysis based on large sets avoiding arithmetic progressions and obtained the famous bound of omega < 2.376 which has remained unchanged for more than twenty years. In 2003, Cohn and Umans [8] introduced a new, group-theoretic framework for designing and analyzing matrix multiplication algorithms. In 2005, together with Kleinberg and Szegedy [7], they obtained several novel matrix multiplication algorithms using the new framework, however they were not able to beat 2.376. Many researchers believe that the true value of omega is 2. In fact, both Coppersmith and Winograd [10] and Cohn et al. [7] presented conjectures which if true would imply omega = 2. Recently, Alon, Shpilka and Umans [2] showed that both the Coppersmith-Winograd conjecture and one of the Cohn et al. [7] conjectures contradict a variant of the widely believed sunﬂower conjecture of Erdos and Rado [11]. Nevertheless, it could be that at least the remaining Cohn et al. conjecture could lead to a proof that omega = 2.

By: NerdcoreRising

NerdcoreRising — Tue, 29 Nov 2011 13:53:43 -0800

Aren't the static coefficients on the algorithms huge? What I'm getting at is, isn't calling them "fast" a bit of a mislabel for normal sized matrices? They are lower asymptotically, yes, but there would need to be astronomically sized matrices for it to be any faster in practice than algorithms with a higher Big-O? Probably a nitty comment, but I think calling them "fast" indicates a misunderstanding of the basic concepts discussed here.

By: the painkiller

the painkiller — Tue, 29 Nov 2011 13:54:21 -0800

I can honestly say with all sincerity that I never saw this coming, not in a million years.

By: beagle

beagle — Tue, 29 Nov 2011 13:56:41 -0800

Question: other than the fact that the previous omega had remained in place for over 20 years, why is it a breakthrough to go from 2.376 to 2.373, which is an improvement of less than 0.13 percent? If the potential target is 2, wouldn't a breakthrough be something that moved the mark a lot closer to 2?

By: 0xFCAF

0xFCAF — Tue, 29 Nov 2011 13:57:36 -0800

I was going to ask about static coefficients as well. I don't have the exact quote, but I think Knuth said something like "O(kⁿ) is generally worse than O(n^k), but in this case, k is 64".

By: theodolite

theodolite — Tue, 29 Nov 2011 13:58:38 -0800

The epigraph on that Stothers thesis is pretty heavy duty. I was hoping there would be another one at the beginning of every section, and then at the end of the paper it turns out that matrix multiplication disproves the possibility of G > 1 wherein G represents the number of deities found in the set of real numbers.

By: GallonOfAlan

GallonOfAlan — Tue, 29 Nov 2011 13:59:11 -0800

What direct implications will this have in computing?

By: yoink

yoink — Tue, 29 Nov 2011 14:00:26 -0800

So...um...why is this important, exactly? I'm not doubting its importance, I hasten to add, but if someone could point out some of the practical applications (or even theoretical applications) for those of us who haven't a clue it'd be nice.

By: jedicus

jedicus — Tue, 29 Nov 2011 14:02:32 -0800

other than the fact that the previous omega had remained in place for over 20 years, why is it a breakthrough to go from 2.376 to 2.373, which is an improvement of less than 0.13 percent? Check out this graph of the difference between x^2.374 and x^2.373. It doesn't take a very large x (in computer science terms) before the efficiency gain becomes very significant.

By: aubilenon

aubilenon — Tue, 29 Nov 2011 14:03:44 -0800

beagle: Well, it's an exponent. So, disregarding all lower order coefficients, if n=1000, it's a full 2.1% faster. And for n=1000000, it's 4.1%. And so on. But yeah, you need a pretty enormous data set to actually make a big difference.

By: qxntpqbbbqxl

qxntpqbbbqxl — Tue, 29 Nov 2011 14:04:31 -0800

What direct implications will this have in computing? None. Like NerdcoreRising said, it's probably not practical. You know how it is when a new Olympic sprinting record is set? This is like that. It will not make a difference for ordinary joggers.

By: jepler

jepler — Tue, 29 Nov 2011 14:08:03 -0800

this blog on the "breakthrough" posits that the new algorithm doesn't do better than the old algorithm for matrices smaller than n✕n where n=2¹⁰⁰⁰, in other words not in any real problems where the new method runs faster than the old one:

This value is far above the Bekenstein Bound for the number of particles that could be fit into a volume the size of the observable universe without collapsing it into a black hole. In this sense the algorithm itself is even beyond galactic.

By: aspo

aspo — Tue, 29 Nov 2011 14:08:06 -0800

jedicus: This the more interesting version of that graph. That's not even 1% faster at 10k. Even at 1m it's still only about 1.3% faster.

By: jepler

jepler — Tue, 29 Nov 2011 14:09:25 -0800

errr, fail on my part: 2¹⁰⁰⁰ is the matrix size where the new method is twice as fast as the old best method.

By: TheJoven

TheJoven — Tue, 29 Nov 2011 14:17:17 -0800

Almost all engineering computational analysis is based on matrix math. This will make a huge difference on computational fluid dynamics as well as finite element analysis. On my desktop computer I do FEA analysis with 100,000 node systems. This will speed all of this up significantly, especially the supercomputer based systems.

By: RobotVoodooPower

RobotVoodooPower — Tue, 29 Nov 2011 14:18:02 -0800

I think jepler's right, and this comment seems to indicate that these recent improvements aren't breakthroughs but just ricing* the CW algorithm -- which no one uses in practice either. Mere mortals like you and I will keep on multiplying our matrixes with rusty tire irons, as always. *apologies for using 'rice' as a verb in a mathematical context, but it was funny

By: dephlogisticated

dephlogisticated — Tue, 29 Nov 2011 14:19:40 -0800

Meanwhile, the fastest known Al Gore rhythm clocks in at a tepid 60 BPM, and is frankly just embarrassing to listen to.

By: Jon_Evil

Jon_Evil — Tue, 29 Nov 2011 14:19:59 -0800

Computer science types: how big do ns get for things that we do, like encoding video, or transmitting game data on the internet? Moreover, maybe there's cool stuff that previously wasn't practical because it involved multiplying huge matrices, but now that it's easier to do, we can rely on it more for computing stuff.

By: homotopy

homotopy — Tue, 29 Nov 2011 14:27:10 -0800

This will speed all of this up significantly, especially the supercomputer based systems. No it won't. Coppersmith-Winograd (and this improvement) is never used in practice, because the matrices on which it performs better are so large that they can't be stored on modern computer hardware. These algorithms are of theoretical interest, and that is it. As NerdcoreRising mentioned right at the top of the thread, there are coefficients in front of the n^omega term, and they are HUGE.

By: sixohsix

sixohsix — Tue, 29 Nov 2011 14:35:27 -0800

Here's something embarrassing: my training as an engineer taught me that matrix multiplication was O(n^3) and I never had any reason to doubt it. Anyway, these algorithms may be impractical for most systems, but advancing new mathematical arguments that lead to theoretically-better algorithms can help solve practical problems in other domains. It's also good when work brings a problem closer to a theoretical bound. Mathematicians are "pretty sure" that the perfect algorithm is O(n^2) but they can't prove it. Similarly, mathematicians were "pretty sure" that x^n + y^n /= z^n for all n > 2 but it took a hundred years of work to prove it. The theorems discovered along the way were tremendously valuable. Math is not always about eureka moments. Incremental advances are worth noticing.

By: louigi

louigi — Tue, 29 Nov 2011 15:01:02 -0800

I wasn't expecting this. Even had I been, I would not have expected to learn about it on Metafilter, but on the arXiv. Thanks.

By: atrazine

atrazine — Tue, 29 Nov 2011 15:07:18 -0800

Here's something embarrassing: my training as an engineer taught me that matrix multiplication was O(n^3) and I never had any reason to doubt it. To be fair, the entire mathematical and CS establishment was quite sure of that until the 60's. I'm not sure if the actual algorithms used in practice are the naive O(n^3) ones, but they don't use these asymptotically fast ones. For multiplication of numbers on the other hand, the naive 'long multiplication' method is not used. Instead, modern maths libraries use methods based on fast fourier transforms.

By: ryanrs

ryanrs — Tue, 29 Nov 2011 15:10:49 -0800

RobotVoodooPower, you realize that's a racial slur, yes?

By: delmoi

delmoi — Tue, 29 Nov 2011 15:13:33 -0800

Does anyone use the better then O(n³) algorithms in practice?

By: Wolfdog

Wolfdog — Tue, 29 Nov 2011 15:19:18 -0800

Many undergraduates use a well-known O(n^2) algorithm, regarding the increased speed (and simplicity) as a favorable tradeoff for getting the wrong answer.

By: scose

scose — Tue, 29 Nov 2011 15:23:04 -0800

delmoi: Strassen's algorithm (the n^2.78 one) is faster IRL for large matrices, but it's less numerically stable.

By: fistynuts

fistynuts — Tue, 29 Nov 2011 15:40:03 -0800

it'll be fine, just throw more sand at it interesting post thanks .. having a fully geeked up day ... brilliant

By: formless

formless — Tue, 29 Nov 2011 15:57:24 -0800

but if someone could point out some of the practical applications (or even theoretical applications) for those of us who haven't a clue it'd be nice. In addition to the fluid dynamics and other apps others have mentioned, matrix and vector multiplication is also very important in many modern social algorithms. For example, apps like PageRank for computing relevancy in search. Recommender Systems like Amazon's and Netflix's use matrix operations to compute recommendations. Anytime you're dealing with large graphs, chances are you're going to boil it down to a matrix problem. And the web and social relationships on social networking sites are really just several giant graphs / networks.

By: Pyry

Pyry — Tue, 29 Nov 2011 16:08:57 -0800

Those tend to be very sparse matrices, though, which have their own specialized sets of algorithms.

By: monkeymadness

monkeymadness — Tue, 29 Nov 2011 16:11:25 -0800

I definitely understand why folks are wondering why this matters in practice, but to a Discrete Mathematician this sort of news is awesome. Algorithms have been around far longer than computers, and algorithmic complexity/asymptotics are really cool to theoreticians. There is some algorithm that does this in the fastest possible way for arbitrarily large matrices. We just haven't found it yet. This algorithm is closer than we've ever gotten, and that's cool.

By: Wolfdog

Wolfdog — Tue, 29 Nov 2011 16:15:03 -0800

There is some algorithm that does this in the fastest possible way for arbitrarily large matrices. Just in the interest of hair-splitting - in the spirit of the thread about a 0.001 improvement in an asymptotic constant - that's not necessarily true. There might be O(n^(2+e)) algorithms for every e>0 but no O(n^2) algorithm.

By: monkeymadness

monkeymadness — Tue, 29 Nov 2011 16:17:41 -0800

...and for a perspective on something similar, I was at a talk recently by Donald Knuth. Someone in attendance asked him if he thinks P=NP. He said he supposes the answer is `yes' but we'll probably never actually find a polynomial time algorithm for an NP-complete problem. We'll just be able to show one exists. It might not be practically useful, but it would be extremely awesome.

By: monkeymadness

monkeymadness — Tue, 29 Nov 2011 16:24:49 -0800

Wolfdog: I'm not sure. I mean, we're not looking for a smallest number strictly greater than x here. We're looking for an algorithm that is `the best'. If there is a requirement that an algorithm be expressible in, say, at most a billion kazillion words, then among all of those possible there is a best one. But, yeah, who knows? Now you have me wondering if for every e there could be an algorithm in O(n^(2+e)).

By: feloniousmonk

feloniousmonk — Tue, 29 Nov 2011 16:43:48 -0800

I don't know about large matrices being transmitted, Jon_Evil, but in terms of their use in computation, where something like this might theoretically be pretty useful, n will vary widely in size, but generally speaking, to a computer, a million is not a big number. I think something like this sort of improvement, which only provides benefit under certain circumstances, is less likely to have a direct impact on workloads on a home PC or gaming console. I am not a scientific computing guy, but this seems like something that would be useful in things like weather simulation.

By: ryanrs

ryanrs — Tue, 29 Nov 2011 16:44:28 -0800

you have me wondering if for every e there could be an algorithm in O(n^(2+e)) Yeah, bubble sort with strategically placed nops.

By: delmoi

delmoi — Tue, 29 Nov 2011 16:44:52 -0800

Just in the interest of hair-splitting - in the spirit of the thread about a 0.001 improvement in an asymptotic constant - that's not necessarily true. There might be O(n^(2+e)) algorithms for every e>0 but no O(n^2) algorithm.

Well, first of all monkeymadness was talking about hypothetical unknown algorithms. (actually the algorithm in the paper is just a re-application of the same algorithm 8 times, previously it was proven that re-applying it twice made it faster but unknown if more re-applications made it go faster. It would be impossible to practically measure it) The second thing is that we are talking about is 'arbitrary', that is the algorithm is faster on arbitrary matrices even if it's not faster on 'practical' matrices or 'matrices that could be encoded given all the mater in the universe'

By: sixohsix

sixohsix — Tue, 29 Nov 2011 16:53:10 -0800

> For multiplication of numbers on the other hand, the naive 'long multiplication' method is not used. Instead, modern maths libraries use methods based on fast fourier transforms. atrazine, I think you just blew my mind. I really should have taken some computational algorithm courses.

By: Canageek

Canageek — Tue, 29 Nov 2011 17:02:29 -0800

When I took linear algebra back in my first year of Uni my prof mentioned that he used to work for a bank on gigantic matrices that they had to do all sorts of diagonalizations and such to work with: Would this help that type of work? Also, isn't pagerank based on matrices? What about scientific software such as Guassian or other quantum mechanical simulations?

By: qxntpqbbbqxl

qxntpqbbbqxl — Tue, 29 Nov 2011 18:41:07 -0800

For multiplication of numbers on the other hand, the naive 'long multiplication' method is not used. Instead, modern maths libraries use methods based on fast fourier transforms. atrazine, I think you just blew my mind. I really should have taken some computational algorithm courses. My BS degree has "computational mathematical sciences" in its title, and I didn't know that. Neat.

By: delmoi

delmoi — Tue, 29 Nov 2011 18:45:55 -0800

My BS degree has "computational mathematical sciences" in its title, and I didn't know that. Neat.

Woah, me either. Of course for most mathematics you're going to use hardware, this stuff would only be used for arbitrary precision digits. Still interesting.

By: Pyry

Pyry — Tue, 29 Nov 2011 19:00:49 -0800

It's because multiplying two numbers is like a convolution in their digits, and the FFT allows large convolutions to be done efficiently.

By: indubitable

indubitable — Tue, 29 Nov 2011 19:03:14 -0800

Those tend to be very sparse matrices, though, which have their own specialized sets of algorithms. So are CFD and FEA systems, for that matter.

By: quanta and qualia

quanta and qualia — Tue, 29 Nov 2011 19:26:03 -0800

I'll wait for the quantum computers to arrive.

By: pwnguin

pwnguin — Tue, 29 Nov 2011 21:22:52 -0800

Pyry: "It's because multiplying two numbers is like a convolution in their digits, and the FFT allows large convolutions to be done efficiently" I never thought I'd have the opportunity to discuss this on mefi, but convolutions baffle me. Is there any intuitive notion of the concept, or have I simply reached my limit in mathematics?

By: Harvey Kilobit

Harvey Kilobit — Tue, 29 Nov 2011 21:24:46 -0800

The best remark I can imagine on this result involves reusing a quote from an interview with Donald Knuth:

A lot of the recent literature is academic one-upmanship of limited interest to me; authors these days often introduce arcane methods that outperform the simpler techniques only when the problem size exceeds the number of protons in the universe. Such algorithms could never be important in a real computer application. I read hundreds of such papers to see if they might contain nuggets for programmers, but most of them wind up getting short shrift.

By: twoleftfeet

twoleftfeet — Tue, 29 Nov 2011 21:45:03 -0800

For people joining the conversation late, I will attempt to summarize. I believe that "big O" refers to orgasms and that some nerds have figured out how to improve these. Let's get more people through the door! Free cocaine!

By: troublesome

troublesome — Tue, 29 Nov 2011 22:16:04 -0800

My immediate mental reaction upon reading this headline.

By: jepler

jepler — Wed, 30 Nov 2011 07:43:29 -0800

Computer science types: how big do ns get for things that we do, like encoding video, or transmitting game data on the internet?"

In 3D graphics, the most frequent matrix multiplication is with n=4, because points in 3d space are represented as (x,y,z,1) and the matrices that operate on them are 4x4. (Introducing a 4th value makes the 3d coordinates into homogeneous coordinates, which "are ubiquitous in computer graphics because they allow common operations such as translation, rotation, scaling and perspective projection to be implemented as matrix operations")

By: IAmBroom

IAmBroom — Wed, 30 Nov 2011 11:37:00 -0800

I never thought I'd have the opportunity to discuss this on mefi, but convolutions baffle me. Is there any intuitive notion of the concept, or have I simply reached my limit in mathematics? pwnguin, I'm an optical engineer, so my metaphor to make it "intuitive" will come from that field. Feel free to propose your own analogy... Imagine two pictures, printed on B&W slide film (that is, "white" areas are transparent and "black" areas are opaque). Place one picture on top of photographic paper, turn on the exposure light, and move the other picture across at a steady rate. Turn off the light, and develop the paper. The resulting "picture" is the convolution of the two images, left-to-right. A different convolution image would occur if you moved the second picture top-to-bottom. It should be obvious that right-to-left is the same as left-to-right, and fairly clear that picture 2 stable while picture 1 moves is no different (convolution is commutative). If picture two has a white line across its center, any white spots on the center of picture one will show up as white. Anything less (a broken white line, or only a white spot) will result in a less-than-perfectly-white spot on the convolution image (mathematically, you are representing infinity here - the integral of 1x1 over all infinity, where "infinity" = (left side -> right side)). Analogously, a black line across either picture will result all black spots on the other picture showing up as "true black", instead of dark gray: the integral of 0 x f(x) over all infinity = 0 (where f(x) is real). The animations atop this Wikipedia page demonstrate this, for a 1-D case.

By: wjzeng

wjzeng — Thu, 01 Dec 2011 02:56:28 -0800

Let's not forget that behind every theorem (or algorithm improvement) lies a human story. A check back at the original blog post contains an update and cautionary warning from Andrew Stother's experience. On the other hand, I warned Andrew that his LinkedIn profile, which unselfconsciously mentions improvements to his Word and Excel skills as one of the benefits of his PhD research breaching the Coppersmith-Winograd barrier, might have earned him a place in scientific folklore forever! - Scott Aaronson

By: Jpfed

Jpfed — Mon, 05 Dec 2011 21:33:11 -0800

I never thought I'd have the opportunity to discuss this on mefi, but convolutions baffle me. Is there any intuitive notion of the concept, or have I simply reached my limit in mathematics? posted by pwnguin at 23:22 on November 29 [+] [!] There are a couple intuitive ways you can approach convolution. But they both have the same math behind them. Let's say you're singing into a microphone in an anechoic chamber. There's the pressure waves coming out of your mouth; the pressure at the microphone as a result of those waves forms a function of time we'll call V(t) (for Voice). Ok, now let's have you produce the same pressure waves with your mouth, but in St. Peter's Basilica, which is a huge cathedral with crazy acoustics. There are all sorts of echoes and stuff. The microphone is getting a new pressure signal now, which we'll call B(t) (for Basilica). Now, we're getting a very fast stream of fluctuating pressure readings from that microphone, but there's a certain regularity to the signal. We can explain why the B(t) signal is the way it is. But we're going to need one more function to help us figure it out- a function that encodes the characteristics of the space. Let's say that, for where we're standing relative to the microphone inside the cathedral, there's a function E(t) that encodes the strength of the echoes. If we made a very short impulse at our mouth, a little "pop" sound, E(t) would have a spike at the beginning and decaying spikes later on as progressively more- and more-damped echoes reached the microphone. Ok, now we can explain the values of B(t). B(t) gets some direct contribution from V(t) - our voice going straight into the microphone. But it could also get some contribution from V(t-1) (some audio you emitted in the past) that took a millisecond to echo back. We know the strength of the echo at 1 ms; that's E(1). The total contribution of your echoed voice from 1ms ago is V(t-1)*E(1). But there will also be echoes from further in the past; an echo from 2ms ago would be V(t-2)*E(2), and 3ms ago would be V(t-3)*E(3), etc. To cover all the bases, we should sum up all of the echoes that could have occurred since the moment we started singing (t = 0). To write this concisely, I'm going to say that B(T) = sum t from 0 to T of V(T-t)*E(t). Well, looky there, the formula for convolution! Yet more concisely, B = V * E ----------------------------------- Another fun example of convolution would be finding the probability distribution of the sum of two independent random variables (e.g. die rolls). Let's say we have two independent random variables X and Y; we'll say that X(v) is the probability that X takes on the value v, and Y(v) is the probability that Y takes on the value v. What is the probability distribution S(v) where S = X + Y? In other words, for any given value v, what is the probability that X and Y sum to v? Well, for any given value v that S could take on, there are a bunch of scenarios that could have resulted in that value. Maybe X = v and Y = 0. Maybe X = (v-1) and Y = 1. Or X = (v-2) and Y = 2. And so on. Each of these scenarios is mutually exclusive, so we can just add up their probabilities to find the probability that S = v. Since X and Y are independent, the probability that X = v and Y = 0 is just X(v) * Y(0); the probabilities for the other scenarios can be computed analogously (just taking the product of their constituent probabilities). So we would express this as S(v) = X(v) * Y(0) + X(v-1) * Y(1) + X(v-2) * Y(2) ... etc. Equivalently, S(v) = sum of t from 0 to v of X(v-t) * Y(t) BAM! Convolution, baby.