The Real Black Box Was the Friends We Made Along the Way
October 12, 2021 9:39 AM   Subscribe

A New Link to an Old Model Could Crack the Mystery of Deep Learning. But a number of researchers are showing that idealized versions of these powerful networks are mathematically equivalent to older, simpler machine learning models called kernel machines. If this equivalence can be extended beyond idealized neural networks, it may explain how practical ANNs achieve their astonishing results.

A little intro and summary for those who might appreciate it:
  • Parametric and non-parametric models are two ways of describing data.
  • Parametric models describe the distribution of data by relying on a bunch of numbers (parameters), where each number tends to correspond to some feature of the data (e.g. its average or spread). These numbers are adjusted to best match the data using numerical procedures like gradient descent.
  • Non-parametric models tend to separate data into classes by blowing it up into a super high-dimensional space, and dividing it up with hyperplanes. Defining these hyperplanes does not in general require numerical methods.
  • Although neural networks are parametric (defined by tons of numbers), and require sophisticated numerical methods to optimize, it turns out they may have more in common with non-parametric methods.
  • This helps explain why the parameters of neural networks are so hard to interpret: they were never really parameters in the probabilistic sense, but rather just a means to approximate a non-parametric object.
posted by Alex404 (13 comments total) 32 users marked this as a favorite
 
An interesting theory... but as usual one that needs practical demonstration. Certainly the existing trend of enormous networks isn't sustainable in the long run. Something like GPT-3 or its inevitable successor are far too cumbrous to exist as anything other than a hosted service, and it seems to be a universal goal to bring the user-facing process (i.e. not the training and dataset) to edge devices.

We're definitely in an era of rapid change in these technologies and it would be silly to hold onto any particular approach as anything but today's best guess at how it all works. That's a good thing, since it means there's huge room for improvement and also efficiency gains that put powerful tools in reach of local machines, where companies like Google etc have little or no agency.
posted by BlackLeotardFront at 11:16 AM on October 12 [1 favorite]


as usual one that needs practical demonstration

Well, it's only been proved for infinitely-wide neural nets, so it's not practical even in theory yet. Actually proving it for finite neural nets will probably be hard.
posted by BungaDunga at 11:52 AM on October 12


Wait, proving a finite is harder than infinite?
posted by sammyo at 1:43 PM on October 12


Throw in some modulo and you might be surprised... Finite Fields & Return of The Parker Square - Numberphile - YouTube.
posted by zengargoyle at 2:03 PM on October 12


Wait, proving a finite is harder than infinite?

That's my vague impression from reading about a number of equations used in various disciplines. Often the infinite version can be mapped to an equation that's solvable with calculus, giving you a nice simple tidy equation as your end result to describe the infinite version. The finite version, by contrast, gets you into all sorts of complications.

To get the nice smooth curves of the Hardy-Weinberg model in biology, for example, you have to assume infinite population size. The smaller your population, the less smooth and predictable the outcome. In economics, I believe that perfect competition is a lot easier to do math for if part of the perfection is an infinitely large number of producers and consumers. In most engineering, the discrete particles of the subatomic world are completely ignored because assuming infinitely divisible continuous materials makes the math (and the proofs) much easier.
posted by clawsoon at 2:34 PM on October 12 [4 favorites]


It's a curious point because even with a finite-sized neural net, you have parameters of arbitrary precision or magnitude, and isn't that part of why neural nets can be Turing complete? And Turing machines have an infinite tape.

A different thing I noticed from the article was the researchers refer to "reduction":

During training, the evolution of the function represented by the infinite-width neural network matches the evolution of the function represented by the kernel machine

But this is reducing a kernel to a neural net. Not the other way around as the author wrote a couple times.
posted by polymodus at 3:11 PM on October 12


Wait, proving a finite is harder than infinite?

Often, yes. A physics professor of mine once told us "Tables with one leg are trivial, tables with two legs are easy, tables with three legs are very hard, and tables with four or more legs are often impossible -- but tables with infinitely many legs are usually easy." I'm not really sure why she framed it in terms of table legs, but despite that confusing metaphor I've generally found it to be true.
posted by biogeo at 4:43 PM on October 12 [7 favorites]


It's pretty trivial to prove that infinitely many colors are sufficient to color a map, as one obvious example.
posted by NMcCoy at 8:41 PM on October 12


Wait, proving a finite is harder than infinite?

Infinity is a liquid: fills the volume (roughly) it is set to occupy.
posted by JoeXIII007 at 9:04 PM on October 12 [4 favorites]


Infinity is a liquid: fills the volume (roughly) it is set to occupy.

Infinity is pudding; the finite adds raisins.
posted by clawsoon at 11:54 PM on October 12 [3 favorites]


There appears to be a relationship between solution sparsity and training effort - RVM>SVM>old-school ANN with back-propagation>RBM/etc - that reflects something fundamental about the nature of information. In principle just the raw data tells one everything, it's the process of condensing the relevant that's computationally expensive.
posted by memetoclast at 8:29 PM on October 15


DeepMind aims to marry deep learning and classic algorithms
Algorithms are a really good example of something we all use every day, Blundell noted. In fact, he added, there aren’t many algorithms out there. If you look at standard computer science textbooks, there’s maybe 50 or 60 algorithms that you learn as an undergraduate. And everything people use to connect over the internet, for example, is using just a subset of those.
DeepMind is developing one algorithm to rule them all
There’s algorithms ‘left, right and center inside the reinforcement learning pipeline,’ as Veličković put it.

Blundell on his part reiterated that there aren’t that many algorithms out there. So the question is, can we learn all of them? If you can have a single network that is able to execute any one of the algorithms that you know already, then if you get that network to plug those algorithms together, you start to form really quite complicated processing pipelines, or programs. And if it’s all done with gradients going through it, then you start to learn programs:
“If you really take this to its limit, then you start to really learn algorithms that learn. That becomes very interesting because one of the limitations in deep learning is the algorithms that we have to learn. There hasn’t been that much change in the best optimizers we use or how we update the weights in a neural network during training for quite a long time.

There’s been a little research over different architectures and so forth. But they haven’t always found the next breakthrough. The question is, is this a different way of looking at that, where we can start to find new learning algorithms?

Learning algorithms are just algorithms, and maybe what’s missing from them is this whole basis that we have for other algorithms that we’re using. So we need a slightly more universal algorithm executor to use as the basis for better methods for ML.”
Deac also noted she would like to pursue a network which tries multiple algorithms — all algorithms, if possible. She and some of her MILA colleagues have taken some steps in that direction. They are doing some transfer learning, chaining a couple of algorithms together and seeing if they can transfer between one algorithm, making it easier to learn a separate related algorithm, she said.
posted by kliuless at 2:48 AM on October 18 [1 favorite]


Latest Neural Nets Solve World's Hardest Equations Faster Than Ever Before - "Two new approaches allow deep neural networks to solve entire families of partial differential equations, making it easier to model complicated systems and to do so orders of magnitude faster."
Last year, Anandkumar and her colleagues at Caltech and Purdue University built a deep neural network, called the Fourier neural operator (FNO), with a different architecture that they claim is faster. Their network also maps functions to functions, from infinite-dimensional space to infinite-dimensional space, and they tested their neural net on PDEs. “We chose PDEs because PDEs are immediate examples where you go from functions to functions,” said Kamyar Azizzadenesheli of Purdue.

At the heart of their solution is something called a Fourier layer. Basically, before they push their training data through a single layer of a neural network, they subject it to a Fourier transform; then when the layer has processed that data via a linear operation, they perform an inverse Fourier transform, converting it back to the original format. (This transform is a well-known mathematical operation that decomposes a continuous function into multiple sinusoidal functions.) The entire neural network is made of a handful of such Fourier layers.

This process turns out to be much more computationally straightforward than DeepONet’s and is akin to solving a PDE by performing a hairy mathematical operation called a convolution between the PDE and some other function. But in the Fourier domain, a convolution involves a simple multiplication, which is equivalent to passing the Fourier-transformed data through one layer of artificial neurons (with the exact weights learned during training) and then doing the inverse Fourier transform. So, again, the end result is that the FNO learns the operator for an entire family of PDEs, mapping functions to functions.

“It’s a very neat architecture,” said Mishra.

It also provides solutions at dramatically improved speeds. In one relatively simple example that required 30,000 simulations, involving solutions of the infamous Navier-Stokes equation, the FNO took fractions of a second for each simulation (comparable to DeepONet’s speed, had it been tested on this problem), for a total of 2.5 seconds; the traditional solver in this case would have taken 18 hours... From a computational perspective, however, there’s more good news. Mishra’s team has shown that the new techniques don’t suffer from the curse of dimensionality.
also btw...
How Wavelets Allow Researchers to Transform, and Understand, Data - "Built upon the ubiquitous Fourier transform, the mathematical tools known as wavelets allow unprecedented analysis and understanding of continuous signals."
posted by kliuless at 10:32 PM on October 18 [1 favorite]


« Older Welp, there goes my evening ...   |   Flash Tuesday Baguette Fun Newer »


You are not currently logged in. Log in or create a new account to post comments.