Man in black shirt is playing guitar
November 23, 2014 10:53 AM   Subscribe

Deep Visual-Semantic Alignments for Generating Image Descriptions. A model that generates free-form natural language descriptions of image regions. Holy crap.
posted by signal (36 comments total) 28 users marked this as a favorite
 
He's really tuning it, isn't he? Also that shirt may be dark blue.
posted by thelonius at 11:18 AM on November 23, 2014 [1 favorite]


Yeah, but: these were done automatically? That's incredible, far beyond what I'd thought image processing was capable of.
posted by valrus at 11:22 AM on November 23, 2014 [1 favorite]



Code
Coming soon
Our Full Predictions
Coming soon
Region Annotations
Coming soon
See many more examples on our demo page. [Coming soon]


Just sayin...

(Seriously, I saw a talk on the state of the art of this stuff [from different researchers] a few weeks ago and, well, it's come a long way in those three or so weeks.)
posted by advil at 11:23 AM on November 23, 2014 [1 favorite]


valrus: Yes, they were. This is the real thing. Was wondering when it would show up on mefi. Here's the nytimes article.

(So, advil, it's only been more like two weeks.)
posted by tss at 11:29 AM on November 23, 2014


I wonder what their starting salary at google will be.
posted by el io at 11:32 AM on November 23, 2014 [1 favorite]


We are now officially living in the future.
posted by Mr. Justice at 11:47 AM on November 23, 2014


The the applications of this are amazing. For example automatic time-line annotations in a video editor, and a major leap in accessibility for the web (or even the real world) for people with vision impairment.
posted by idiopath at 11:50 AM on November 23, 2014 [1 favorite]




We are now officially living in the future.

"Alan Johnson resisting arrest at pipeline protest."
posted by Dr. Send at 11:58 AM on November 23, 2014 [2 favorites]


This is the real thing.

And how, exactly, do you know that without the code, non-cherry-picked examples, or a live demo? The actual numbers in the paper showed incremental improvement on basically their own tasks against 5 other models, two of which were labeled as "our own" model, two of which are out of the same lab with one or both of the authors as a co-author despite the lack of such label, and 1 of which was from the google team also in the nytimes article.

Look, I actually think this work is good, and deep neural nets are magical*, but the PR spin this is getting is really inaccurate relative to what they did (and also relative to what quite a few other, unmentioned, research groups are doing). For example, this system most certainly does not produce "free-form natural language descriptions", it is likely to be highly constrained as to the kinds of images it can handle, the performance is nowhere near human (and arguably their evaluation metric is not great for evaluating against human performance, something they acknowledge) etc. This is basically all acknowledged in the paper, especially section 4.3, though you may need to know a bit about statistical natural language processing / computational vision to decode precisely what they are saying there. E.g. "A more sensible approach might be to use multiple saccades around the image to identify all entities, their mutual interactions and wider context before generating a description." .. yes, that would be more sensible, but it's also the part of the problem that is hard.

* I don't actually believe that but many people do.
posted by advil at 12:06 PM on November 23, 2014 [8 favorites]


advil: valrus asked if the captions were done automatically. I said yes.

I'm not claiming that this solves computer vision, but I don't think "near human" performance is necessary for this work to be significant. Outcomes shown here on only 10% of the input images would still be remarkable.

I have some additional familiarity with this line of work that I'm not disclosing here, but "the real thing" simply refers to the fact that images go in, sentences come out automatically.
posted by tss at 12:20 PM on November 23, 2014 [1 favorite]


Also that shirt may be dark blue.

Skynet doesn't need to know whether you shop at L.L. Bean or American Apparel in order to mark you as a target for termination.
posted by coolxcool=rad at 12:24 PM on November 23, 2014 [2 favorites]


This is inevitable and incredibly disquieting.
posted by johnnydummkopf at 12:24 PM on November 23, 2014 [1 favorite]


There is some serious nit-picking going on in here, which I appreciate and it's one of the reasons I love MeFi. But seriously, for those of us who are tech-savvy but not in the industry, this looks pretty amazing. I had no idea the state of the art was this far along.
posted by The Bellman at 12:30 PM on November 23, 2014


coolxcool=rad: Skynet doesn't need to know whether you shop at L.L. Bean or American Apparel in order to mark you as a target for termination.

People who shop at the Banana Republic, however, are automatically targeted for termination.
posted by surazal at 12:37 PM on November 23, 2014 [3 favorites]


I'm in the midst of finalizing a taxonomy before I begin to upload and tag thousands of images into a digital asset management system. To have a tool like this that would generate the descriptions of images on the fly, even if only 80% accurate, would be an incredible time saver.
posted by furtive at 12:41 PM on November 23, 2014


"Boot is stamping on human face forever." One suspects the real value of this is that you can combine it with facial recognition, and then lay off half of the stormtroopers you hired to screen all those surveillance videos. Think of the savings...
posted by Sing Or Swim at 12:42 PM on November 23, 2014 [9 favorites]


Anything like this will have error rates and bloopers which may or may not make it useless except as a starting point for humans to fill in. Even a 1% error rate can be horrible. Still, if computers can handle 90% of the easy stuff it makes the human job easier. A few successful captions doesn't say much how well it does with thousands of real world pictures. It may be fixable with crowd sourcing the same way Google Translate works.
posted by stbalbach at 12:56 PM on November 23, 2014


While I was driving to work the other day I was thinking about google-car and google-glass and the sort of needs of vehicles to discriminate and identify things. I'm not quite sure what, exactly I was thinking, but having a strong ontology database as a reference was one thing that popped into my head. Noticing that these sentences are very clear in their structure Adjectived Noun Verbs Adverbedly Adjectived Noun... I can see how the database can plug into play objects and context into sentence forms.

It seemed way more profound, thinking of the ontological web and it's evolution... I think AI has to come out of the network and embodied in the real world. I was expressing these thoughts much better on my way to work that morning. Not so much tonight. I'm not high (yet).
posted by symbioid at 1:02 PM on November 23, 2014 [1 favorite]


Also - yes - how much do these researchers worry about inhumane use of this technology. I would imagine they have to be concerned somewhat with the ramification. But clearly it's also very fascinating and fun to work on as an area of knowledge. There's nothing inherently wrong or evil about this, but rather, the people and systems who will use this, alas.
posted by symbioid at 1:03 PM on November 23, 2014


Adjectived Noun Verbs Adverbedly Adjectived Noun

Thanks for the sockpuppet name!
posted by Jpfed at 1:04 PM on November 23, 2014


Today, this is vaporware.
Eventually, it will kill us all.
posted by MeanwhileBackAtTheRanch at 2:38 PM on November 23, 2014 [1 favorite]


Real or not, several of those descriptions are wrong. It seems to see wakeboards and swings where there are none and cannot detect the colour of cats.
posted by Sys Rq at 2:41 PM on November 23, 2014 [2 favorites]


What Sys Rq said. I suspect that their network is actually matching details to user-descriptions of other photos, and it doesn't "know" what color the cat is: it just responds to a cat-thing on a square-thing, and it assembles its description from a database that happened to have a lot of strings that said something like "black cat sitting on a ...."

Similarly, the "boy is doing backflip on wakeboard" was presumably composed by matching things like "person tumbling", "flat shape at bottom", and "water at horizon". In this case I think it's someone tumbling on a trampoline in front of a lake.
posted by Joe in Australia at 3:02 PM on November 23, 2014


Sure, a full half of the descriptions are wrong, but (at least from my point of view) that's really cool! The errors, moreso than the correct descriptions, highlight just how amazing this is. I mean, brown blob with blue stripes overhead and tan splots arrayed underneath... is a boy? Upside down! So he's doing a trick. In front of water. So, sure, wakeboarding. It's easy to overlook how much perceptual processing we take utterly for granted, and the fact that pink-blob at an angle + triangular frame-type thing = girl on a swing is pretty astonishing pattern matching for an artificial system.
posted by not the fingers, not the fingers at 3:17 PM on November 23, 2014


symbioid: I'm not quite sure what, exactly I was thinking, but having a strong ontology database as a reference was one thing that popped into my head.
I've been nibbling around the edges of the problem of The Ontology of Everything for 20 years. I definitely agree that having an ontology that could meaningfully describe all the articles in, say, The Concise Columbia Encyclopedia, would be a huge step forward in having useful artificial intelligence.

Instead, while brilliant, what we have is FreeBase, because movies, music and celebrities are most of what people search for. Yes, much of Wikipedia is there, but that that still mostly covers major people, places, and things. Meanwhile, despite the efforts of the IASB and the XBRL consortium, there's not even a ontology that usefully describes the concepts embodied in ERP software.
posted by ob1quixote at 3:52 PM on November 23, 2014


"Alan Johnson resisting arrest at pipeline protest."

It wasn't my fault that Buttle's heart condition didn't appear on Tuttle's Google+ site...
posted by a lungful of dragon at 4:16 PM on November 23, 2014 [4 favorites]


I wonder how Toys 'R' Us would pay for a database listing, e.g., "everyone who has appeared in the same photo as an infant in the last month"? And I wonder how much the DHC would pay for a database listing "everyone who has appeared in the same photo as an assault rifle, ever"?
posted by Joe in Australia at 5:34 PM on November 23, 2014 [1 favorite]



Yeah, but: these were done automatically? That's incredible, far beyond what I'd thought image processing was capable of.



It seems like it might be based on a large sample of humans describing images with a very strict set of instructions, or finding images of a particular object in an image. It might be able to do it in a way that feels automatic, but it requires a lot of human data in order to teach it how to do it that way.
posted by louche mustachio at 6:49 PM on November 23, 2014


One of the defining features of neural networks is that their knowledge is "learned" rather than programmed. So yes, the model almost certainly had to be trained first using a large data set of input and correct output.
posted by dephlogisticated at 7:14 PM on November 23, 2014


I suspect that their network is actually matching details to user-descriptions of other photos, and it doesn't "know" what color the cat is: it just responds to a cat-thing on a square-thing...

I've picked up some Japanese words just by having watched enough subtitled videos, and whatever algorithm my mind has for learning things feels like what you describe the neural network as doing -- pattern-match repeated correlations of sound X with situation Y, and conclude that "X" means Y. Human understanding is deeper than state-of-the-art computers' only because we are simultaneously pattern-matching situation Y with related situations, other humans' reactions to the situation, our emotions while experiencing the situation.... Basically our concept maps are a lot more detailed, but in principal I think an artificial neural network could achieve human-level understanding of a subject -- although whether it could also have "subjective experience" (qualia) is up in the air, since nobody even knows what qualia are.
posted by Rangi at 7:41 PM on November 23, 2014 [2 favorites]


Ignoring the creative ways this program could be abused, this is soooooo cool. I like to try to be organized with the images I save but it's pretty much impossible to put all the information I'd like into the filename. Taxonomy was brought up by furtive, and along with that for myself to give a very specific example: If I had this system I would use it to categorize the photographs I have saved from fan photographers of Korean pop groups. "141123 Z Idol performing the song "Y" at X City Autumn Festival by fansite W." That's the information that rolls through my head when I see any (fansite) photograph, not to mention the side information that happened during x event (groups collaboration, wardrobe/stage malfunction, attempted abduction, bug invasion, fireworks, etc). But. Omg. *___*

Honestly, I may regret posting this ridiculous example. But whatever because wow! TECHNOLOGY!
posted by one teak forest at 9:07 PM on November 23, 2014


"Boot Rubber-coated steel leg pod from bigdog robot is stamping on human face forever."
posted by mcrandello at 1:39 AM on November 25, 2014


How long before blind people can wander around with a Google Glass providing audio narration of the view?

"There is a field with a line of trees on the left. To the right is a man in a black shirt playing a guitar".

If you could map different kinds of objects to musical instruments, you could process the contents of the view into a kind of soundscape, where the line of trees would be represented by a distant row of trumpeters, and the man playing a guitar by a man playing a guitar.
posted by emilyw at 2:56 AM on November 25, 2014


Surely blind people would already know that someone is playing a guitar?
posted by Joe in Australia at 3:11 AM on November 25, 2014 [1 favorite]


I actually worked on something like this back in grad school. Our approach was a bit different; we were using massive self-organizing maps instead of deep neural nets, the rationale being that images with similar visual content would also have similar semantic content. It would probably never work out so well as the model in the link, and anyway we were bumping up against some serious hardware limitations. This is back when a 100gb hard drive was still kind of a big deal. We backed up data on Zip disks, ffs.

But, anyway, it was a multi-university collaboration, including a couple folks from Brown, one from UC Irvine, and, for some reason, a group located at some university in Hungary. So at one point we hosted one of the researchers from the Hungary group locally, and he basically rewrote our code from scratch as soon as he got here. We went from getting, like, 1% accuracy to 5%, which is obviously pretty crappy compared to the results in the link, but given the limitations we were working under, it was pretty impressive. Even when the model missed, it missed plausibly. If we counted those plausible misses as hits, we were probably up around 20%. We were pretty enthusiastic, and my graduate advisor at the time was already spending the money from getting the grant renewed.

The thing is, sometimes when it missed, it really missed. Things like mistaking a Christmas tree for "Man is holding a knife" or a Buick Skylark for "Children are standing in a circle looking up." Part of my project was to figure out why the model was making these really odd choices - demonstrating the model to our funding agency when it was liable to off-the-cuff decide that a flying kite was "Meteor is hitting a city" spelled almost certain doom for a renewal. After the Hungarian researcher had left, I was the only one with the time to go through the code to find the problem, and naturally it was all commented in Hungarian (where it was commented at all), so this involved a bunch of late nights in the lab trying to figure out what the code was doing, and it put a huge strain on what was already a fairly tenuous relationship.

One night I was in the lab at like 2AM, feeling pretty excited because I thought I had it nailed; the plausible accuracy was around 30%, actual accuracy around 7%, and nothing weird had cropped up yet. My girlfriend had called around 10, sounding sad and asking me to come over, but there wasn't really any way I could drop what I was doing when I felt this close. Around 2:30 I had exhausted the image dataset we stored on the servers, so I tried testing it on live video through the webcam. "Man is sitting at computer." "Man is playing guitar." (I kept a guitar in the lab in order to have something to do when the code was compiling.) And then: "Woman is taking a bath." "Woman is bleeding." "Woman is staring at ceiling." "Woman is staring at ceiling."

I switched labs a couple weeks later. My former advisor kept trying to get the model to work right, but I don't think he had much luck. When he died a couple years ago, they found him in front of his computer, an empty bottle of scotch next to him, and the words "Man is sleeping" on the computer monitor.
posted by logicpunk at 1:10 PM on November 25, 2014 [4 favorites]


« Older A picture is made of a thousand notes   |   Portraits of love and sex in the 21st century Newer »


This thread has been archived and is closed to new comments