April 16, 2010

Internationalization is also intranationalization

There are currently four big more-or-less homogeneous markets. By population: China, India, the EU, and the USA. With EU and Indian data, one is quickly aware of the need to detect language early in data processing. While I am not certain about the situation in China, the other "big market" gets certain assumptions made about it. This note covers a "gotcha" about the addresses.

When processing US addresses, you might be tempted to assume that there is a name - like "first" - and then maybe a type - like "street" or "avenue". If you have a DBA-like denormalization instinct, then you'll be tempted to peel off the street name to one field, and have a table of street types. In fact, this is automatic for a lot of address validation software.

And assuming that all the inputs are valid USPS deliverable addresses, you should get valid outputs right?

Let's momentarily ignore the "obvious" challenges ... PO Boxes... APO/FPO... highway contract, rural route... after all, the commercial software does this pretty well. For this example, we'll even cheat, and not really check for deviations from expected formatting. We can just glance at the first few ZIP codes. In 00677, we find one of the hosts of the Rincón International Film Festival at the very typical, for Rincón, PR:
Casa Isleña Inn & Tapas Bar
Barrio Puntas
Carr. Int. 413 km. 4.8
Rincón, PR 00677
Or the Starbucks, a very anglophile evironment:
1417-1425 Ave Ashford, San Juan 00907, Puerto Rico 
Note the word order. This is not on "Ashford Avenue", but on Avenue Ashford. A more local establishment would be:
Cafe con Leche, 1-9 Calle Bajada Matadero,
San Juan 00901, Puerto Rico
Make certain that your data parsing software can deal with all of these as well. "Calle de" as a street "name" has been the most common language difficulty for US address standardization that I have seen, with significantly fewer problems coming from other local language uses, like the "Rue" prefix in southern Louisiana, native names in tribal areas, and the use of Hawai'ian.

The take away is that even in the "homogeneous" market, a certain amount of "internationalization" is still needed, even if you somehow manage to totally avoid having any foreign data - hard if the internet is involved. In the US, the selection of an official longague is a power left to the states and territories, and last I knew, at least three included languages other than English. (Puerto Rico, New Mexico, and Hawai'i)

No comments:

Post a Comment