2014-01-05

¡Buscamos Nueva York!

The Internet Archive has published an interesting map geocoding the locations of the place names mentioned in US television news (actually, San Francisco and Washington, DC television news) over several years.

Parsing the data to extract place names, and then geocoding them, must have been quite an effort. However, as the authors themselves have noted, there is plenty of space for improvement.

It seems that once you go outside of the US borders, it's not "a lot of errors" (as mentioned in the original post); rather, almost everywhere, the noise greatly outweighs the signal! This is not entirely surprising, of course: San Francisco or Washington, DC, television programming does not have a lot of reasons to mention small towns in China or El Salvador, so pretty much every time a map has a small dot in one of those countries, it is the result of misattribution.

The gazetteer for El Salvador, for example, must have included lots of towns (and villages) whose names are common Spanish nouns ("La Nueva", "La Union", "Los Campos", "Libertad", "La Puerta", "La Reina", "Los Canales", "Los Blancos") ... as well as a town named "Nueva York" (guess what place that word usually refers in Spanis!) and another one called "Chiapas", which must have absorbed some hits meant for a state in Mexico. The same situation prevails in the neighboring Guatemala and Honduras. In Spain references to "Madrid" are (hopefully) genuine, but almost everything else - "Los Alamos", "La Copa", "El Canal", "Las Cuevas" - you must have guessed their source. Even the Albanian capital Tirana reveals itself, most of the time, as a typo for "Tehran"!

In China, it seems, the gazetteer includes lots of small places which are homonymous with personal names occurring in the news reports, and that's how they got mapped.

In Russia, it's a rather interesting mixed bags of misattributions. The comparatively large dot on Nizhny Novgorod looks so reasonable, until you click on it and see that most occurrences came from phrases such as "Gorky Park".

In a lot of cases, it seems, the system makes wrong dismbiguation choices, assigning a hit for X to location X1, even when there is a much better known place X2 with the same name. For example, Italy gets hits for "Monte Carlo"; Russia, for "Balkan", "Bogot" (Bogotá?), as well as for "Yalta" and "Rovno", and for "Strasb[o]urg" too!

Trying to improve the name identification quality in a real-life system like this could be an interesting topic for a computer science student's term paper (and maybe even for a master's thesis). Some Bayesian statistcs could be helpful. For example, for words like "la nueva", "el canal" or "la unin", one can estimate the prior probability of them occurring in a text as a place name vs. a common noun or adjective. (As a rough idea of the likelyhood of the former, one can look at the size of the Wikipedia article for a place with a given name; if there is no article for a given place name, then, most likely, it is not going to appear in TV news. )

No comments:

Post a Comment