April 16, 2010

Internationalization is also intranationalization

There are currently four big more-or-less homogeneous markets. By population: China, India, the EU, and the USA. With EU and Indian data, one is quickly aware of the need to detect language early in data processing. While I am not certain about the situation in China, the other "big market" gets certain assumptions made about it. This note covers a "gotcha" about the addresses.

When processing US addresses, you might be tempted to assume that there is a name - like "first" - and then maybe a type - like "street" or "avenue". If you have a DBA-like denormalization instinct, then you'll be tempted to peel off the street name to one field, and have a table of street types. In fact, this is automatic for a lot of address validation software.

And assuming that all the inputs are valid USPS deliverable addresses, you should get valid outputs right?

Let's momentarily ignore the "obvious" challenges ... PO Boxes... APO/FPO... highway contract, rural route... after all, the commercial software does this pretty well. For this example, we'll even cheat, and not really check for deviations from expected formatting. We can just glance at the first few ZIP codes. In 00677, we find one of the hosts of the Rincón International Film Festival at the very typical, for Rincón, PR:
Casa Isleña Inn & Tapas Bar
Barrio Puntas
Carr. Int. 413 km. 4.8
Rincón, PR 00677
Or the Starbucks, a very anglophile evironment:
1417-1425 Ave Ashford, San Juan 00907, Puerto Rico 
Note the word order. This is not on "Ashford Avenue", but on Avenue Ashford. A more local establishment would be:
Cafe con Leche, 1-9 Calle Bajada Matadero,
San Juan 00901, Puerto Rico
Make certain that your data parsing software can deal with all of these as well. "Calle de" as a street "name" has been the most common language difficulty for US address standardization that I have seen, with significantly fewer problems coming from other local language uses, like the "Rue" prefix in southern Louisiana, native names in tribal areas, and the use of Hawai'ian.

The take away is that even in the "homogeneous" market, a certain amount of "internationalization" is still needed, even if you somehow manage to totally avoid having any foreign data - hard if the internet is involved. In the US, the selection of an official longague is a power left to the states and territories, and last I knew, at least three included languages other than English. (Puerto Rico, New Mexico, and Hawai'i)

April 13, 2010

Quick Check: water surface temperature

The check: Daily average water surface temperature is usually well correlated with the lagging three day average of the daily maximum air temperature. This is robust enough that it can be used as the basis for an interpolation model, and outperforms a lag of the average air temperature.

Yes, this summary does skip a lot of analytical details: the tables of correlations between site observations for multiple sites, the various linear models fit and tested, and the analysis of the overlap of gaps in data. This method was found to be a particularly robust method, since daily high/low temperatures are frequently available from multiple sources at a research site. A perfect model for filling in missing data does not help if the data it needs is missing at the same time!

Both daytime temperature and water warming seem likely to covary with daytime insolation - a major source of heating.  Water cools off a lot by evaporative heat loss, which will drop when the air is cool. The air primarily cools off by radiative cooling, which also happens to water but perhaps not at the rate that evaporative cooling happens.

I have measured clear blue sky with my little non-contact thermometer as being 4F (-14C) in mid-afternoon in Colorado on an 80F day. There's a lot of potential in radative cooling.

April 9, 2010

Data Sources and Software lists

In an effort to simplify things for me, I am making a "data sources" page to merge a large scattered set of references into one place. If you find this helpful for you, I'd be pleased to hear what it helped you with, and perhaps a pointer to the results.

In a similar vein, I have started a "software" page. There are so many moving parts in a complete analytical system that I find it helpful to have a list of which pieces can go where. From ETL (extract-transform-load) to presentation, with useful stops at analysis, programming, data management, and even operating systems and environments. And side notes on geodata/GIS capability if that needs to be in the system too. I hope to have a post at some point giving a couple of recommended flows through these, for people of various means and needs.

It has already made one thing clear: SAS, C/C++, SQL, C#, perl, Python, and friends cover a lot - at least up to the tens of terabytes - but I see I still have some gaps at the low-cost petabyte scale. I might do well to sharpen my Java skills, and add Pig (the programming language) skills.

April 8, 2010

What is a Data Gorilla?

Data Gorilla (noun) - a Data Monkey that has grown up and learned to use power tools.

This particular data gorilla uses a lot of windage, common sense and "let's ask what they really meant by this" calls as data power tools, with backup from SAS, R, C++, C, Python, perl, and whatever else is handy. Done previously in contexts of climate research, consumer modeling, behavioral forecasting, machine learning, error analysis, and plain old sanity checks.