

When we look at photographs, we look for patterns and objects. We identify a photograph that is 10% brown and 90% green as a brown horse in a grassy field. So when searching for similar images, we would not be confused by a photograph of a green river dotted with 10% brown fallen tree branches. But general-purpose VSEs could identify a horse in the field and tree branches in a green river as very similar. They look at the image as one big undifferentiated group of RGB values.(See also: this Corbis search intro [pdf] and these Wired articles.)
An object-based VSE like eVision, tries to first identify the objects in an image before doing a comparative search. While it can't attribute the meaning of horse to the brown object, it can say that the photo is composed of two distinct objects - a brown one with a particular shape and a green background. Then it runs visual comparisons to other images based on these regions.
For example, with the photo of a horse in a green field and a color similarity search, a general-VSE would say "This photo has 90% green in it and 10% brown so find photos that have this same proportion of colors." eVision would say "This photo has two objects in it, 1 object is 100% green and the other is 100% brown so find photos that contain a 100% green object and a 100% brown object." For the non-object way, you would get horses in fields, a forest (trunks are brown), brown scum in a green river, a green lawn covered in 10% dog droppings etc. With eVision you would get horses in fields, a horse-sized dog on a grassy background etc. The latter matches are certainly closer to the sample image and much much more like the way humans see things. We see objects, not distributions of color.
« Older Three Invitations to a Far Reading... | Every weekly meeting causes me... Newer »
This thread has been archived and is closed to new comments
posted by Rothko at 9:52 PM on January 2, 2006