Given a large repository of geo-tagged imagery, we seek to automatically find visual elements, for example windows, balconies, and street signs, that are most distinctive for a certain geo-spatial area, for example the city of Paris. This is a tremendously difficult task as the visual features distinguishing architectural elements of different places can be very subtle. In addition, we face a hard search problem: given all possible patches in all images, which of them are both frequently occurring and geographically informative? To address these issues, we propose to use a discriminative clustering approach able to take into account the weak geographic supervision. We show that geographically representative image elements can be discovered automatically from Google Street View imagery in a discriminative manner. We demonstrate that these elements are visually interpretable and perceptually geo-informative. The discovered visual elements can also support a variety of computational geography tasks, such as mapping architectural correspondences and influences within and across cities, finding representative elements at different geo-spatial scales, and geographically informed image retrieval.
Figure 1. These two photos might seem nondescript, but each contains
hints about which city it might belong to. Given a large image database
of a given city, our algorithm is able to automatically discover the
geographically informative elements (patch clusters to the right of each
photo) that help in capturing its "look and feel." On the left, the
emblematic street sign, a balustrade window, and the balcony support are
all very indicative of Paris, while on the right, the neoclassical
columned entryway sporting a balcony, a Victorian window, and, of
course, the cast-iron railings are very much features of London.
1. Introduction
Consider the two photographs in Figure 1, both downloaded from Google Street View. One comes from Paris, the other one from London. Can you tell which is which? Surprisingly, even for these nondescript street scenes, people who have been to Europe tend to do quite well on this task. In an informal survey, we presented 11 subjects with 100 random Street View images of which 50% were from Paris, and the rest from eleven other cities. We instructed the subjects (who have all been to Paris) to try and ignore any text in the photos, and collected their binary forced-choice responses (Paris/Not Paris). On average, subjects were correct 79% of the time (std = 6.3), with chance at 50% (when allowed to scrutinize the text, performance for some subjects went up as high as 90%). What this suggests is that people are remarkably sensitive to the geographically informative features within the visual environment. But what are those features? In informal debriefings, our subjects suggested that for most images, a few localized, distinctive elements "immediately gave it away." For example for Paris, things like windows with railings, the particular style of balconies, the distinctive doorways, the traditional blue/green/white street signs, etc. were particularly helpful. Finding those features can be difficult though, since every image can contain more than 25,000 candidate patches, and only a tiny fraction will be truly distinctive.
In this work, we want to find such local geo-informative features automatically, directly from a large database of photographs from a particular place, such as a city. Specifically, given tens of thousands of geo-localized images of some geographic region R, we aim to find a few hundred visual elements that are both: (1) repeating, that is, they occur often in R, and (2) geographically discriminative, that is, they occur much more often in R than in RC. Figure 1 shows sample output of our algorithm: for each photograph we show three of the most geo-informative visual elements that were automatically discovered. For the Paris scene (left), the street sign, the window with railings, and the balcony support are all flagged as informative.
But why is this topic important for modern computer graphics? (1) Scientifically, the goal of understanding which visual elements are fundamental to our perception of a complex visual concept, such as a place, is an interesting and useful one. Our paper shares this motivation with a number of other recent works that do not actually synthesize new visual imagery, but rather propose ways of finding and visualizing existing image data in better ways, be it selecting candid portraits from a video stream,5 summarizing a scene from photo collections,19 finding iconic images of an object,1 etc. (2) More practically, one possible future application of the ideas presented here might be to help CG modelers by generating the so-called "reference art" for a city. For instance, when modeling Paris for PIXAR'S Ratatouille, the co-director Jan Pinkava faced exactly this problem: "The basic question for us was: 'what would Paris look like as a model of Paris?', that is, what are the main things that give the city its unique look?"14 Their solution was to "run around Paris for a week like mad tourists, just looking at things, talking about them, and taking lots of pictures" not just of the Eiffel Tower but of the many stylistic Paris details, such as signs, doors, etc.14 (see photos on pp. 120–121). But if going "on location" is not feasible, our approach could serve as basis for a detail-centric reference art retriever, which would let artists focus their attention on the most statistically significant stylistic elements of the city. (3) And finally, more philosophically, our ultimate goal is to provide a stylistic narrative for a visual experience of a place. Such narrative, once established, can be related to others in a kind of geo-cultural visual reference graph, highlighting similarities and differences between regions. For example, one could imagine finding a visual appearance "trail" from Greece, through Italy and Spain and into Latin America. In this work, we only take the first steps in this direction—connecting visual appearance across cities, finding similarities within a continent, and differences between neighborhoods. But we hope that our work might act as a catalyst for research in this new area, which might be called computational geo-cultural modeling.
2. Prior Work
In the field of architectural history, descriptions of urban and regional architectural styles and their elements are well established. Such local elements and rules for combining them have been used in computer systems for procedural modeling of architecture to generate 3D models of entire cities in an astonishing level of detail, for example, Mueller et al.,12 or to parse images of facades, for example, Teboul et al.22 However, such systems require significant manual effort from an expert to specify the appropriate elements and rules for each architectural style.
At the other end of the spectrum, data-driven approaches have been leveraging the huge datasets of geo-tagged images that have recently become available online. For example, Crandall et al.2 use the GPS locations of 35,000 consumer photos from Flickr to plot photographer-defined frequency maps of cities and countries. Geotagged datasets have also been used for place recognition8, 17 including famous landmarks.10, 11 Our work is particularly related to Schindler et al.17 and Knopp et al.,8 where geo-tags are also used as a supervisory signal to find sets of image features discriminative for a particular place. While these approaches can work very well, their image features typically cannot generalize beyond matching specific buildings imaged from different viewpoints. Alternatively, global image representations from scene recognition, such as GIST descriptor13 have been used for geolocalization of generic scenes on the global Earth scale.6, 7 There, too, reasonable recognition performance has been achieved, but the use of global descriptors makes it hard for a human to interpret why a given image gets assigned to a certain location....MUCH MORE
The Irreconcilable Conflict At the Heart Of Bitcoin (plus some other stuff on cryptocurrencies, blockchains, and smart contracts)
"Ethical Theories Spotted in Silicon Valley"
A Major Flaw: "Ethical Trap: Robot Paralyzed by Choice of Who to Save"
You don't want hesitation in your robotrader.
From New Scientist via Communications of the ACM:
Bristol Robotics Laboratory's Alan Winfield and colleagues recently tested an ethical challenge for a robot, programming it to prevent other automatons--representing humans--from falling into a hole.At the same time you want the computer to discriminate between the command "Execute the trade" and the command "Execute the trader".
When researchers used two human proxies, the robot was forced to choose which to save. In some cases, it saved one proxy while letting the other perish, while in others, it saved both. However, in 14 out of 33 trials, the robot spent so much time making its decision that both proxies fell into the hole.
Winfield describes his robot as an "ethical zombie" that has no choice but to behave as it does....MORE