Google Unveils Neural Network with “Superhuman” Ability to Determine the Location of Almost Any Image
Guessing the location of a randomly chosen Street View image is hard, even for well-traveled humans. But Google’s latest artificial-intelligence machine manages it with relative ease.
By Emerging Technology from the arXiv
Feb 24 2017
Here’s a tricky task. Pick a photograph from the Web at random. Now try to work out where it was taken using only the image itself. If the image shows a famous building or landmark, such as the Eiffel Tower or Niagara Falls, the task is straightforward. But the job becomes significantly harder when the image lacks specific location cues or is taken indoors or shows a pet or food or some other detail.
Nevertheless, humans are surprisingly good at this task. To help, they bring to bear all kinds of knowledge about the world such as the type and language of signs on display, the types of vegetation, architectural styles, the direction of traffic, and so on. Humans spend a lifetime picking up these kinds of geolocation cues.
So it’s easy to think that machines would struggle with this task. And indeed, they have.
Today, that changes thanks to the work of Tobias Weyand, a computer vision specialist at Google, and a couple of pals. These guys have trained a deep-learning machine to work out the location of almost any photo using only the pixels it contains.
Their new machine significantly outperforms humans and can even use a clever trick to determine the location of indoor images and pictures of specific things such as pets, food, and so on that have no location cues.
Their approach is straightforward, at least in the world of machine learning. Weyand and co begin by dividing the world into a grid consisting of over 26,000 squares of varying size that depend on the number of images taken in that location.
So big cities, which are the subjects of many images, have a more fine-grained grid structure than more remote regions where photographs are less common. Indeed, the Google team ignored areas like oceans and the polar regions, where few photographs have been taken.
Next, the team created a database of geolocated images from the Web and used the location data to determine the grid square in which each image was taken. This data set is huge, consisting of 126 million images along with their accompanying Exif location data.
Weyand and co used 91 million of these images to teach a powerful neural network to work out the grid location using only the image itself. Their idea is to input an image into this neural net and get as the output a particular grid location or a set of likely candidates.
They then validated the neural network using the remaining 34 million images in the data set. Finally they tested the network—which they call PlaNet—in a number of different ways to see how well it works.
The results make for interesting reading. To measure the accuracy of their machine, they fed it 2.3 million geotagged images from Flickr to see whether it could correctly determine their location. “PlaNet is able to localize 3.6 percent of the images at street-level accuracy and 10.1 percent at city-level accuracy,” say Weyand and co. What’s more, the machine determines the country of origin in a further 28.4 percent of the photos and the continent in 48.0 percent of them.
That’s pretty good. But to show just how good, Weyand and co put PlaNet through its paces in a test against 10 well-traveled humans. For the test, they used an online game that presents a player with a random view taken from Google Street View and asks him or her to pinpoint its location on a map of the world.