Music for 18 CNN layers
May 9, 2017 9:58 PM   Subscribe

This is so good!
posted by iloveit at 10:43 PM on May 9, 2017

Very cool.

(But Steve Reich jumped the shark and has yet to land...)
posted by Joseph Gurl at 10:55 PM on May 9, 2017 [1 favorite]

Music for 18 musicians is my favorite song.

That's it. My favorite. Anything in any way related to it can only be elevated.
posted by fnerg at 11:11 PM on May 9, 2017 [1 favorite]

Thank you for sharing this! It's beautiful! Brings to mind a sort of post-digital reinterpretation of some of the visuals Baille was working with in Castro Street: trains, defamiliarized. Here, mediated by CNN, there mediated by glass and silhouettes. It's doing something different than Castro Street, of course, but enough visual similarity for me to make the connection.
posted by Alterscape at 11:13 PM on May 9, 2017

Needs more Star Guitar.
posted by progosk at 11:45 PM on May 9, 2017 [1 favorite]

As Damien Henry says, it perfectly captures that feeling of half dozing on a train journey. I also like the detail that the train algorithm seems to have elected to depict movement at a constant speed. If we make some working assumptions about the relative of the nearby objects, then we can work out how far we have travelled during Reich's piece - about 75 miles I think.
posted by rongorongo at 12:35 AM on May 10, 2017

Ah, so train videos, not Train videos.
posted by RobotVoodooPower at 3:35 AM on May 10, 2017 [4 favorites]

Rather surprised he didn't use Different Trains instead.
posted by solarion at 3:38 AM on May 10, 2017 [3 favorites]

Damien Henry trained a neural network on train videos and used it to generate a Steve Reich video.

Trained it to do what?
posted by sour cream at 5:20 AM on May 11, 2017

Yes, quite. Recognise things, I guess, and then... uh... smear them into an overlay?

Something that recognised things as potential pattern elements and then matched them to music, like the Star Guitar video (which I love) - that would be an interesting and intriguing idea. But I don't quite get what's going on here.
posted by Devonian at 9:55 AM on May 11, 2017

Sorry, the description was a little skimpy. So the way this works is (I'm going to give an explanation in an unusual order here that hopefully lets the reader bail once they have enough detail to satisfy themselves).

Imagine you have a black box with knobs on it. The black box can take one picture and print out a new one, modified in some way depending on exactly what the knobs are set to. This is the neural network.

Training the network on a video means separating out the video into many many pictures (frames). Feed the first frame into the machine and see what picture it prints. Once for each knob, nudge the knob a little bit and feed that first frame in. You'll get a printout for each knob from this, and we're going to treat these as if they were guesses as to what the next frame is. Now, we actually have the real next frame from the original video, so we're going to be able to figure out that nudging the 4th knob or the 7th knob changed the output of the machine in a favorable way- a way that makes the predicted next frame closer to the real next frame. We call the relationship of the knobs to the observed changes the "derivative" and the most favorable combination of knob-changes the "gradient". Training the network on a video means doing this knob-nudging ("gradient-descending") step many times for many pairs of successive frames of video.

"Trained it to do what?" - to predict, as accurately as possible, one frame of video given the previous frame.

In this project, the author trained the network using videos shot from trains. Then, they took the first frame from one such video and used the trained network to guess what the next frame would be. Then, they took that guess as if it were the second frame and fed it back into the network to make a guess for the third frame, and so on until they had enough frames of video for the song they wanted to set the video to (the song itself was not used during the training or generation processes).

Okay, so now that you have a notion of what training does... what's inside the box? Well, each image is composed of pixels, and each pixel is composed of color components (such as red, green, and blue- but often for compressed formats, something else like lightness, redness minus greenness, and blueness minus yellowness). Each color component of each pixel of each frame is represented by a number. The black box is there to multiply and add those numbers around with a bunch of tables of numbers. Each number in those tables is a knob ("parameter") that could be affected by training.

Now, there are some relationships between input and output that are consistent no matter where in the image you are. In the linked example, there's consistent motion from right to left, and that is true no matter where or when you are in the video. It would be a shame to have to represent this information with separate parameters for every (right,left) pair of pixels! That would be waaay too many parameters to have to find good values for. So instead, we cheat and have some parameters that represent relationships between all pairs of pixels that are separated by a common direction and distance. The action of this kind of parameter is called "convolution" (in the title, that's the C in "CNN layers"). This drastically cuts down on the number of parameters that need to be learned, which makes the whole learning process achieve good results faster.

Now, this fixed relationship between neighboring pixels is only a simplified (not perfectly accurate) representation of how things move in naturally-recorded video, which is why instead of looking like a perfectly-reproduced video it gets all smudgy.
posted by Jpfed at 11:07 AM on May 11, 2017 [5 favorites]

Cool, thanks for the explanation!
posted by sour cream at 4:54 AM on May 12, 2017

« Older "The gloves will help me keep my hands warm. And...   |   We’re different. The leopards are not going to eat... Newer »

This thread has been archived and is closed to new comments