## July 27, 2010

### Scientific Grammar

Scientific writing is sometimes hard to read because of bad grammar, even more than because of strange abbreviations and technical terminology. This is sadly expected in journal articles, even though clear writing will make it more likely that someone will read far enough through your research to use it and cite it. It is also the reason that so many whitepapers are written by non-experts. The sponsoring organization wants people to read them, not fear them.

"The Science of Scientific Writing" (Gopen and Swan, American Scientist, Nov-Dec 1990) is an article that does a great job at documenting these problems and showing how to fix them. The article stresses a simple pattern: start with the familiar: end with the new. As they put it:

"In our experience, the misplacement of old and new information turns out to be the No. 1 problem in American professional writing today."

Gopen and Swan back their thesis up with "worked examples." Taking passages from published articles, they show how to revise them for clarity.

The article ends with seven rules to summarize what they have found. I am putting the rules here to remind me of them, and to entice the reader unfamiliar with them, to visit the original article and learn what they are reminders for.
1. Follow a grammatical subject as soon as possible with its verb.
2. Place in the stress position the "new information" you want the reader to emphasize.
3. Place the person or thing whose "story" a sentence is telling at the beginning of the sentence, in the topic position.
4. Place appropriate "old information" (material already stated in the discourse) in the topic position for linkage backward and contextualization forward.
5. Articulate the action of every clause or sentence in its verb.
7. In general, try to ensure that the relative emphases of the substance coincide with the relative expectations for emphasis raised by the structure.
This article won't replace "The Elements of Style" by Strunk and White, but it is a useful addendum for the scientific writer.

## July 26, 2010

### Missing people in US phone surveys

It is true that people often like to denigrate statistics derived from survey data, but the reason that I hear most frequently - "But five thousand is less than 0.1% of 300 million!" - is not actually a significant source of error. The error to watch for more carefully is sampling bias.

For a long time, "random dialing" has been a great way to get a random sample that could include about 99% of the population (while the 1% without phones were generally considered to not need consideration for most purposes - everyone imagined jails and wilderness hermits in cabins). While incoming calls were free to telephone subscribers, and most of the population had phones, this was almost ideal. It wasn't actually sampling people, but households, since there was generally one phone per house. However, with this sampling frame it is possible to cleanly stratify by households size, and get back to estimating individuals fairly easily. At the beginning in the early 20th century, there was some problem with the number of households that had no telephone service, but as this shrank toward 1%, the error became negligible.

In the early 21st century, we now have a different problem: mobile phones often charge for incoming calls, so they are not allowed to be "random dialed", and more and more households are relying on them exclusively. How many? Well... I'll show you a picture.

You can clearly see a fairly dramatic age bias. It appears that 25 year-olds are at about 50% reachability by "random dialing", while other age groups may be as high as 90% reachable. Les obviously, that 50% has a pretty good chance at being correlated with other aspects of their lives. Anything that can be done? Maybe. At the very least, stratify your sample by age and use data like that in the above chart to correct for reachability by age group.

If you have another sampling technique available, you can try to use it to infer differences within an age group between with-house-phone and without-house-phone, and then adjust your results appropriately for that. If it's at all controversial though, brace yourself. Even though you should now have more valid results, the people who agree with the groups that were originally overrepresented will now direct very pointed charges at you of "lying with statistics." Even though all you actually did was "question with statistics to the best of your ability."

There are many biases that will be harder to find. The most usual one: the survey was planned by people who wanted to show "regular people" that is, the people like themselves. And a sampling plan was drawn up that seemed like a good way to meet "random" people. Often the same way they meet random people - where they met the people they now hang out with - people they generally are similar to and agree with. It's a very confirmatory feeling to have your survey agree with you, so people who have executed a survey this way may be a bit defensive if you suggest that it is perhaps just maybe a little tiny bit biased. Unfortunately, these surveys seem to be some of the most common ones in politics.
Sigh.

Update: The National Marine Fisheries Service is trialling a switch to postal surveys in light of the increasing problems with telephone surveys.

## July 22, 2010

From the realm of hard to interpret statistics based on easy to get data with unknowable biases:

From Google searching, with English language set:

 "single mother" 3,030,000 "single father" 568,000 "single parent" 5,400,000

 "single mother" 24,000,000 "single father" 12,800,000 "single parent" 3,670,000

Major news stories have been based on less than this. Easily accessible and reliable data is great, but when it's not, the easy should not replace the reliable. Check your data before taking it seriously.

The serious data, for the US, from the US Census Bureau (Jan, 2010 press release):

In 2009, 12 percent of the 1.7 million father-only family groups with children under 18 were maintained by an unemployed father, compared with 7 percent in 2007. Of the 9.9 million mother-only family groups, 10 percent were unemployed in 2009 compared with 6 percent in 2007.

Or reformatted, the 2009 Census data:

 "single mother" 9,900,000 "single father" 1,700,000 sum, single parents 11,600,000

And a lot of unemployment.

PS: The pattern in the unemployment numbers is recurring for US data. Women have greater unemployment than men when unemployment is low, and men have greater unemployment than women in times of high unemployment. All kinds of odd questions are suggested by this: Do women have more stable jobs? Is gender-correlated pay inequality causally related to apparent gender-correlated job security? And if so, which way? Would low pay cause secure work, or secure work cause low pay?

PPS: There's not any immediately obvious link on Googlefight to find out how they get with their numbers. Anyone know why it's so different from what I see "fresh from Google"?

## July 18, 2010

### Mathematical Statistics (1) in Public: indecency, or pearls before swine?

This post is a re-edit of a half serious note that I was inspired to write by the use of words in a snippet of tongue in cheek conversation about professional statistics. Hopefully this will be found to be an enjoyable read for at least a few people.

The inspiration, with no apologies for the really awful pun, stop reading this post now if you can't suffer punning. That's the mood I was in, so it is required for accurately setting the scene.

Sean:

[...]In statistics one should be opinionated. After all, it is generally not considered a field of discreet mathematics. This is a conclusion most often reached when indiscretion prevails, no?

Anna:

I disagree - in statistics, one should not be opinionated, one should be *right*. And clear, succinct, and to the point.

Well... I pretty much completely agreed with her, and felt that her rebuttal didn't really address what I had said. So already being in a very silly mood, I wrote some explanations of how I thought the words used applied to statistics.

Result:

Discretion:
Advised in survey preparation and use, and in development of techniques which will be subject to intellectual property protections (2). In announcing results, discretion greatly increases the chance that someone else will get credit for the work (3).

Opinionated:
Use at exactly the same time as being discreet, at all times in the planning stage of a study, and in the execution and data preparation. Stop immediately upon reaching EDA or analysis. If you are not strongly opinionated in the pre-analysis, you get a bad formulation of a poorly thought out question, a biased crappy design, crappy biased data, results that are pure crap, and a bunch of people saying "One time, at band camp, I read 'How to Lie with Statistics'...". During analysis of the data, shut up and let the data and the PLANNED analysis have their turn. If you want to outsmart the data to support an opinion during the analysis phase, quit mathematics, go back to school for an MBA, and sign up as a derivatives trader or "investment banker". (4)

Rightness/correctness:

Clear:
Look at "rightness". See that part about the question? You better know as exactly as you can what question you answered or you will quickly leave the field of mathematics and enter the field of mumbling. Beyond that, "clear" is a communication issue. (8)

Succinct:
The less succinct the design and analysis, the less likely you or anyone else is to understand what was done and what happened. Suspect anything that requires too much explanation or is excessively sophisticated (9) of being misunderstood and misapplied

To the point:
As phrased, this is not part of m. statistics, but part of communications. See succinct for a related topic in mathematics. (10)

Notes:

(1) My latest job search has revealed to me the semantic difference between mathematical statistics and statistics. "Statistics" is what a manager with an MBA and an Excel plug-in does and "has 10+ years of experience" in. "Mathematical Statistics" is the kind you study in school and have publications in.

(2)Carefully note that pure mathematics is unpatentable, categorized as properly being either a discovered work of God, or a discovered element of Nature and in either case a discovery, not an invention. Applied math has fewer such protections, so long as the application of the math(s) is part of the patent application.

(3)Think of it this way, if the results generate scandal with your name on it, then 'everyone knows' it's your work!

(4)Yes, this does lead to the conclusion that one should announce results in an unopinionated but indiscreet manner. It's no longer opinion, so announce the discovery with great excitement, not with great opinion.

(5)Capitalization used to indicate the common name of a species.

(6)Seriously, of "description", "methods", "analysis/results", and "discussion", only "methods" and "results" really matter. "Description" and "discussion" are opinion pieces about what the author thinks was asked, and what they think the results might mean in relation to what they think was asked. Check them with propaganda filters fully powered.

(7)The choice of which question should be asked seems to be a religious issue amoung mathematicians. And like most religions, they have sophisticated definitions for what they actually measure, that when carefully evaluated can only be answered if you have access to all possible universes, meaning that humanity can never actually test against their "gold standards" and just have to depend on the chosen method as a matter of faith. Like the "do parallel lines in this universe actually stay the same distance apart" problem, but with even less hope of ever having a usable answer. (11)

(8)Only use it when you want to be taken seriously.

(9)Check the pre-nineteenth century definitions for "sophisticated" to really understand how much to trust sophistication. While the dictionary might say that these definitions are obsolete, using them seems to have a tendency to make sentences with "sophisticated" in them more correct than the "modern" definition does.

(10)By the way, "to the point" is a highly advisable communications strategy, people have short attention spans. Two hundred words for a press release, five hundred for a press article, as few as possible for a published paper. When both papers about a discovery are published, observation has led me to suspect that the shorter one both gets read more and gets more credit. Make it shorter than what I wrote here, capice?

(11) For any English as a second language readers, yes, it really is "a usable" and not "an usable". _Spoken_ English governs the use of 'a' versus 'an', and 'usable' is pronounced as though there is a 'y' at the beginning. Just like saying "an 's.d.r.'," in which 's' sounds as though it starts with an 'e', but vice versa. Too many misuses in academic papers, making them noticeably harder to read smoothly, has made this a pet peeve of mine.

## July 14, 2010

At some point, academics have to give presentations to classes, as students, lecturers, or just simply trying to raise interest in our work. Here's a couple of notes for those terrified of the day they are presenting their research.

If using a slide presentation program, have a backup copy on a USB stick. At least with a math audience, there's going to be another working computer in the room. It might not be running the same system, so a PDF based backup is recommended - almost every laptop you meet will have a pdf reader that can be set to full screen to PDF display slides.

Find out the location, and check the room before you present. This way you can make certain to have the right "dongle" for your computer that day. And even though it seems like half of the time the room is changed at the last minute, usually the new room has a similar technical setup.

Avoid ever giving a speech using new software or a new computer.

Try to give your speech to an empty room at least once before you "go public". You'll unconsciously improve your presentation, which is great. But the main thing you want to do is time how long it is - for most people, it's longer than the time allotted. Now you can cut out the less important bits instead of chopping off the end of the speech. The end is usually where you put what you want people to remember walking out, making that a particularly painful thing to have to do.

There is always a chance that this will be the day that a virus swarm takes down all the department computers, the microphone fails, and the power goes out halfway through your talk. Practice your material enough that you don't have to see the slides, or have a printout with you if you do have to see them.

If you know this will be central to your career - either you will be teaching, or the job postings you apply to mention "PowerPoint skills desired", then get substantial practice. Take a public speaking class, join Toastmasters, give presentation at schools/libraries/where ever, take a Carnegie class, join an improv theater... *something* to give you some comfort in front of an audience, because that little bit more comfort will make you look a LOT better to the audience.

Toastmasters "ten tips for public speaking"

"How to Give an Academic Talk" - from a humanities prof, but relevant points for all academia.

"The Art of Public Speaking" by J. Esenwein and Dale Carnegie
- eBook formats, at manybooks.net
- Gutenberg project (text, good source for test-to-speech):

The 7 Easy Steps to Becoming a Public Speaking Failure, because not everyone has the same goals.

## July 13, 2010

### Why Mathematicians use LaTeX

LaTeX is a typesetting system that mathematicians are expected to be able to use. In a class I am taking, we had a presentation over it, and my response was too long to reasonably post for the class. So I decided to put it here instead. If you ever wonder why a people who live by publishing papers might use something other than Word or other word processor, here's why...

Why does LaTeX exist, and what does it do?

Because in the 1980's Don Knuth, a perfectionist, wanted to write a book (turned out to be several volumes) about theoretical computer science. He didn't want to have a human typesetter who didn't know what the symbols meant goof up his work, and desktop publishing hadn't happened yet. So he wrote a system that he could use to typeset the book for publishing. Thus, LaTeX is very good for publishing books about topics that involve a lot of math. (Other people, these days, use programs like Quark - which is not very friendly for mathematical notation.)

Why do I not want to use it?

It is not good for writing a note to ask for a meeting time. The setup time can be long, even though it tends to be a one-time job. It does not make pasting gifs from the internet into your document easy. It will try to format your document into a structured whole, even if it is too small to be "structured".

Why do I want to use it?

Because you want to write a topic once, publish it many ways (thesis, reviewed paper, whitepaper, lecture slides, poster, book chapter), and have it look really good every time. Paste the "chapter" preamble on last summers "paper" content, and submit. Another entry on your publication list. And never have to wonder about the location of "the equation on the previous page" again, by using the automatic reference system, that will also rebuild your table of contents for you. And it can handle ALL the math symbols! And BibTeX, and the AMS style sheets, and... essentially, all of the normal and specialty features that mathematicians (including statisticians) need are freely available for it on the web.

How hard is it?

The learning curve is strange, since you can copy and paste from your previous work for most things, and never actually learn it. You do end up learning the special symbols, like \nabla and \pi and \rightarrow. On the upside, these are very handy for understanding other mathematicians in text-only emails and discussion groups. You can even text them!

Why is this important?

Because this is the established format for math publishing. No platform incompatibilities, no version incompatibilities (even though Knuth says the lowercase delta in the 1992 version is better than the "ugly" 1986 version, and is still trying to get people to upgrade), text reflow happens in the expected way, and local formatting is harder than using the appropriate style (local formatting usually breaks the design and layout of publications longer than a brochure). It can produce arbitrarily high resolution graphs from the PS/EPS/PDF graphics output from programs like R (or Adobe Illustrator if you make Tufte-esque graphs the hard way), so your poster presentation can look awesome.

What do I use?

I like the look of LaTeX (it doesn't look like it's from the marketing department, and thus gets a more serious treatment sometimes), and the gradually-becomes-the-easiest way to write math, so I'm actually taking class notes with it. But I'm not "writing LaTeX", I use LyX, which looks like a text editor, has multilingual spellcheck built in (handy for some citations and quotes), can run on both the "numbers" machine and the "office" machine, and a menu where I can look up the proper name for "the integral symbol with the circle on it" when I forget that it's "\oint" (but \varointctrclockwise if you need a little counter-clockwise arrow instead of a circle).

Math symbols http://en.wikibooks.org/wiki/LaTeX/Mathematics

"Ugly delta" http://www-cs-faculty.stanford.edu/~uno/cm.html

"Handwriting recognition" for LaTeX: http://detexify.kirelabs.org/classify.html

## June 17, 2010

### Good Slideshows: includes Good Powerpoint, Good Keynote, Good Impress

PowerPoint/ Keynote/ Impress/ other slideshow, or (even better because of wider compatibility) a pdf formatted for projection, is great for PowerPoint sized ideas. Note that Keynote does not have a freely available viewer (thus, no link). If you use it, you are depending on your machine working, or other people having Keynote on their machines. Slide projectors are also rare these days, so be similarly wary of them. PowerPoint and Impress have similar risks, but free availability of a viewer for the former, and the program itself for the latter, ameliorates this risk. PDF viewers are, however, extremely common. At full screen, they are great at showing slide shows. A PDF of the presentation on a USB stick is the current state of the art for a portable slideshow format. Have a second USB stick with a backup copy if it's a really important show.

I've seen a lot of positions that require "good PowerPoint skills". I can't tell if they think that "PowerPoint" is a synonym for "Public Speaking", or if they are just PP-dependent and need someone to assemble slide decks for them. I know that for any of these positions that I consider, I will be asking if that requirement is because they A: assume that good PowerPoint is good speaking, B: PowerPoint is actually a good format for the kinds of results their company finds and/or needs, or C: the audience requires PowerPoint. In case A, I will point out that while I can put together PowerPoints slideshows, 4 years of Toastmasters, with a CC (Competent Communicator, formerly Competent Toastmaster)and CL (Compentent Leader) may be a better indicator of my speaking ability. In case C, I'll be very discrete about the follow up inquiries, audiences that require Powerpoints are a red flag to me. Do they want only soundbites? Can they read technical reports when there are interlinkages and not just bullet points? Are they literate/numerate?

The points from "The Seven Habits of Higly Effective Pirates" (from Schlock Mercenary) are excellent examples of PowerPoint-able ideas. I recommend one per slide, like panel three of the linked comic. It is an almost perfect slide. A picture demonstrating the rule is sometimes stronger, if it doesn't upset the stomachs of your audience. Oh wait - that's panel four, but panel four has the demonstration first, and the rule second. Wrong order for audiences who's attention span is shorter than their reading speed.

Also note, clipping this panel and pasting a comic under the bullet point with the text of the rule does not work well. It is just bad design, and makes you look a combination of stodgy (must have bullet point) and uncreative (particularly if you use a Dilbert clipping - in PP they are now trite, not ironic). If you're going to take someone elses work like that do three things: get permission, make it as big as possible on the slide, put a little bullet under it giving attribution to the artist. This shows that you understand how it highlights your words (which is the part that is your creativity), and are willing to share the credit where due (the picture itself is not your creativity). A highlight is best done large, right?

There are two other cases that are very closely related to each other: pictures and graphs. A picture of a hazard is much better than any technical description, and a florid description that replaces the picture would probably be discredited as "unprofessional". Compare the florid description: "there's a bunch of round rocks held together by mud at the top of a dirt cliff above a building site, they'll fall on you if you sneeze too loud", with the technical presentation description: "there is a rather steep, poorly vegetated slope above the site that is capped by a loosely consolidated conglomerate of riverstone."

Graphs are awesome for presentations. Use them! But don't use all of them, or use them all the time. Really - use graphs to show the patterns, use words to give your presentation. Only show the graphs you want people to remember, those that are important to your presentation. And turn the projector off when you need people to pay attention to what you are saying. If you can't turn it off, put in either blank slides, or uninteresting, you've seen them too many times to care, plain background corporate logo slides between your graphs and photos. This way the audience will listen to you when needed, and see graphs and charts and photos when they need to learn through their eyes. In dire straits, put a business card in front of the projector when you need eyes back on you.

In a twist that won't surprise anyone who has sat through "just a quick Powerpoint on our idea" or a 45-minute "highlights of our wedding" slideshow, slideshows are also useful for another situation: the presentation of "nothing". Best explained at the end of a New York Times article about the use of PowerPoint in the military:

Senior officers say the program does come in handy when the goal is not imparting information, as in briefings for reporters.

The news media sessions often last 25 minutes, with 5 minutes left at the end for questions from anyone still awake. Those types of PowerPoint presentations, Dr. Hammes said, are known as “hypnotizing chickens.”

So, PowerPoint and other slideshows do have good uses. But please, try not to hypnotize the chickens unless you really mean to!

## June 4, 2010

Anscombe (1973)

Possibly one of the most famous sets of artificial data sets. And rightly so, if you're not familiar with "Anscombe's Quartet", follow the above link.

Now that we've had that graphic reminder to LOOK AT THE DATA, I'm going to skip the usual next step of writing about what to do, and instead refer you, kind reader, to the discussion of this data at the Princeton Office of Population Research. The discussion is located at this link.

## June 2, 2010

### Lost in log-space: residuals of log-transformed data

Yesterday I was helping Jim with a model for predicting non-profit registration by county, and a little problem came up about how to explain some data. He had found a really strong relation, ran the model, and made a pretty choropleth of the results. But the legend had entries like "-0.500 standard deviation to 0.500 standard deviation". And the underlying data had been log transformed. He didn't have time to rebuild all of his maps, so I was asked this question: "What do I tell them this means?"

Good question! Log transformation of data is a common technique to deal with several problems: typically anything where the scatterplot of a relationship look log-normal, "looks like an exponential curve", has "too long" of a rightward tail, or numerous other things. I will leave aside the question of whether the log transform is the right one in a particular instance, and proceed directly to today's (well, yesterday's) question:
When analyzing the residual deviations after the model is fit, what does the deviation of the log of the variable of interest mean? And make it friendly, statistics guy!
Well... ouch. I don't like the implication of "y'all are hard to understand". But then again, a lot of people feel that way, and that's why analysts are paid to be analysts. So what to do? What do "regular people" want to see, that is also a good representation of the reality?
Percent deviation from predicted.
And here's what they don't want to hear (but you want to do):
• If you have "standardized residuals" (how very statistical of you!), first multiply by the RMSE to get actual deviations.
• Take the exponent of the deviations, and respect the sign of the deviation!
• Now you have the ratio of measured over predicted. If it's greater than 1, the subtract one and multiply by 100 to get the percent "high". If it's less than one, then subtract the ratio from one and multiply by 100 to get the percent low.
Use those numbers for your labels, don't explain the funky transformation details to a lay audience, and enjoy the lack of flustered puzzlement on the audiences faces!

Example, just for practice:
Say your RMSE is 0.5. Then one deviation high (positive) is e to the 0.2: about 1.65; and one deviation low is e to the -0.5: about .61.

So one up is about 65% high, and one down is about 39% low.

One additional note here: symmetrical percent bands, like 10% to 40%  high or low, will be misleading because the "high" band is actually expected to have smaller counts than the "low" band if a log model is correct. But this is the way that people have become accustomed to having data presented, and people think of it as "fair", despite this potential bias.

## April 16, 2010

### Internationalization is also intranationalization

There are currently four big more-or-less homogeneous markets. By population: China, India, the EU, and the USA. With EU and Indian data, one is quickly aware of the need to detect language early in data processing. While I am not certain about the situation in China, the other "big market" gets certain assumptions made about it. This note covers a "gotcha" about the addresses.

When processing US addresses, you might be tempted to assume that there is a name - like "first" - and then maybe a type - like "street" or "avenue". If you have a DBA-like denormalization instinct, then you'll be tempted to peel off the street name to one field, and have a table of street types. In fact, this is automatic for a lot of address validation software.

And assuming that all the inputs are valid USPS deliverable addresses, you should get valid outputs right?

Let's momentarily ignore the "obvious" challenges ... PO Boxes... APO/FPO... highway contract, rural route... after all, the commercial software does this pretty well. For this example, we'll even cheat, and not really check for deviations from expected formatting. We can just glance at the first few ZIP codes. In 00677, we find one of the hosts of the Rincón International Film Festival at the very typical, for Rincón, PR:
Casa Isleña Inn & Tapas Bar
Barrio Puntas
Carr. Int. 413 km. 4.8Rincón, PR 00677
Or the Starbucks, a very anglophile evironment:
1417-1425 Ave Ashford, San Juan 00907, Puerto Rico
Note the word order. This is not on "Ashford Avenue", but on Avenue Ashford. A more local establishment would be:
Cafe con Leche, 1-9 Calle Bajada Matadero,
San Juan 00901, Puerto Rico
Make certain that your data parsing software can deal with all of these as well. "Calle de" as a street "name" has been the most common language difficulty for US address standardization that I have seen, with significantly fewer problems coming from other local language uses, like the "Rue" prefix in southern Louisiana, native names in tribal areas, and the use of Hawai'ian.

The take away is that even in the "homogeneous" market, a certain amount of "internationalization" is still needed, even if you somehow manage to totally avoid having any foreign data - hard if the internet is involved. In the US, the selection of an official longague is a power left to the states and territories, and last I knew, at least three included languages other than English. (Puerto Rico, New Mexico, and Hawai'i)

## April 13, 2010

### Quick Check: water surface temperature

The check: Daily average water surface temperature is usually well correlated with the lagging three day average of the daily maximum air temperature. This is robust enough that it can be used as the basis for an interpolation model, and outperforms a lag of the average air temperature.

Yes, this summary does skip a lot of analytical details: the tables of correlations between site observations for multiple sites, the various linear models fit and tested, and the analysis of the overlap of gaps in data. This method was found to be a particularly robust method, since daily high/low temperatures are frequently available from multiple sources at a research site. A perfect model for filling in missing data does not help if the data it needs is missing at the same time!

Supposition:
Both daytime temperature and water warming seem likely to covary with daytime insolation - a major source of heating.  Water cools off a lot by evaporative heat loss, which will drop when the air is cool. The air primarily cools off by radiative cooling, which also happens to water but perhaps not at the rate that evaporative cooling happens.

Aside:
I have measured clear blue sky with my little non-contact thermometer as being 4F (-14C) in mid-afternoon in Colorado on an 80F day. There's a lot of potential in radative cooling.

## April 9, 2010

### Data Sources and Software lists

In an effort to simplify things for me, I am making a "data sources" page to merge a large scattered set of references into one place. If you find this helpful for you, I'd be pleased to hear what it helped you with, and perhaps a pointer to the results.

In a similar vein, I have started a "software" page. There are so many moving parts in a complete analytical system that I find it helpful to have a list of which pieces can go where. From ETL (extract-transform-load) to presentation, with useful stops at analysis, programming, data management, and even operating systems and environments. And side notes on geodata/GIS capability if that needs to be in the system too. I hope to have a post at some point giving a couple of recommended flows through these, for people of various means and needs.

It has already made one thing clear: SAS, C/C++, SQL, C#, perl, Python, and friends cover a lot - at least up to the tens of terabytes - but I see I still have some gaps at the low-cost petabyte scale. I might do well to sharpen my Java skills, and add Pig (the programming language) skills.

## April 8, 2010

### What is a Data Gorilla?

Data Gorilla (noun) - a Data Monkey that has grown up and learned to use power tools.

This particular data gorilla uses a lot of windage, common sense and "let's ask what they really meant by this" calls as data power tools, with backup from SAS, R, C++, C, Python, perl, and whatever else is handy. Done previously in contexts of climate research, consumer modeling, behavioral forecasting, machine learning, error analysis, and plain old sanity checks.