July 27, 2010

Scientific Grammar

Scientific writing is sometimes hard to read because of bad grammar, even more than because of strange abbreviations and technical terminology. This is sadly expected in journal articles, even though clear writing will make it more likely that someone will read far enough through your research to use it and cite it. It is also the reason that so many whitepapers are written by non-experts. The sponsoring organization wants people to read them, not fear them.

"The Science of Scientific Writing" (Gopen and Swan, American Scientist, Nov-Dec 1990) is an article that does a great job at documenting these problems and showing how to fix them. The article stresses a simple pattern: start with the familiar: end with the new. As they put it:

"In our experience, the misplacement of old and new information turns out to be the No. 1 problem in American professional writing today."

Gopen and Swan back their thesis up with "worked examples." Taking passages from published articles, they show how to revise them for clarity.

The article ends with seven rules to summarize what they have found. I am putting the rules here to remind me of them, and to entice the reader unfamiliar with them, to visit the original article and learn what they are reminders for.
  1. Follow a grammatical subject as soon as possible with its verb.
  2. Place in the stress position the "new information" you want the reader to emphasize.
  3. Place the person or thing whose "story" a sentence is telling at the beginning of the sentence, in the topic position.
  4. Place appropriate "old information" (material already stated in the discourse) in the topic position for linkage backward and contextualization forward.
  5. Articulate the action of every clause or sentence in its verb.
  6. In general, provide context for your reader before asking that reader to consider anything new.
  7. In general, try to ensure that the relative emphases of the substance coincide with the relative expectations for emphasis raised by the structure.
This article won't replace "The Elements of Style" by Strunk and White, but it is a useful addendum for the scientific writer.

July 26, 2010

Missing people in US phone surveys

It is true that people often like to denigrate statistics derived from survey data, but the reason that I hear most frequently - "But five thousand is less than 0.1% of 300 million!" - is not actually a significant source of error. The error to watch for more carefully is sampling bias.

For a long time, "random dialing" has been a great way to get a random sample that could include about 99% of the population (while the 1% without phones were generally considered to not need consideration for most purposes - everyone imagined jails and wilderness hermits in cabins). While incoming calls were free to telephone subscribers, and most of the population had phones, this was almost ideal. It wasn't actually sampling people, but households, since there was generally one phone per house. However, with this sampling frame it is possible to cleanly stratify by households size, and get back to estimating individuals fairly easily. At the beginning in the early 20th century, there was some problem with the number of households that had no telephone service, but as this shrank toward 1%, the error became negligible.

In the early 21st century, we now have a different problem: mobile phones often charge for incoming calls, so they are not allowed to be "random dialed", and more and more households are relying on them exclusively. How many? Well... I'll show you a picture.


You can clearly see a fairly dramatic age bias. It appears that 25 year-olds are at about 50% reachability by "random dialing", while other age groups may be as high as 90% reachable. Les obviously, that 50% has a pretty good chance at being correlated with other aspects of their lives. Anything that can be done? Maybe. At the very least, stratify your sample by age and use data like that in the above chart to correct for reachability by age group.

If you have another sampling technique available, you can try to use it to infer differences within an age group between with-house-phone and without-house-phone, and then adjust your results appropriately for that. If it's at all controversial though, brace yourself. Even though you should now have more valid results, the people who agree with the groups that were originally overrepresented will now direct very pointed charges at you of "lying with statistics." Even though all you actually did was "question with statistics to the best of your ability."

There are many biases that will be harder to find. The most usual one: the survey was planned by people who wanted to show "regular people" that is, the people like themselves. And a sampling plan was drawn up that seemed like a good way to meet "random" people. Often the same way they meet random people - where they met the people they now hang out with - people they generally are similar to and agree with. It's a very confirmatory feeling to have your survey agree with you, so people who have executed a survey this way may be a bit defensive if you suggest that it is perhaps just maybe a little tiny bit biased. Unfortunately, these surveys seem to be some of the most common ones in politics.
Sigh.

Update: The National Marine Fisheries Service is trialling a switch to postal surveys in light of the increasing problems with telephone surveys.

July 22, 2010

Parenting alone: the googlefight

From the realm of hard to interpret statistics based on easy to get data with unknowable biases:

From Google searching, with English language set:

"single mother"
3,030,000
"single father"
568,000
"single parent"
5,400,000


From"googlefight.com":

"single mother"
24,000,000
"single father"
12,800,000
"single parent"
3,670,000


Major news stories have been based on less than this. Easily accessible and reliable data is great, but when it's not, the easy should not replace the reliable. Check your data before taking it seriously.

The serious data, for the US, from the US Census Bureau (Jan, 2010 press release):

In 2009, 12 percent of the 1.7 million father-only family groups with children under 18 were maintained by an unemployed father, compared with 7 percent in 2007. Of the 9.9 million mother-only family groups, 10 percent were unemployed in 2009 compared with 6 percent in 2007.

Or reformatted, the 2009 Census data:

"single mother"
9,900,000
"single father"
1,700,000
sum,
single parents
11,600,000

And a lot of unemployment.

PS: The pattern in the unemployment numbers is recurring for US data. Women have greater unemployment than men when unemployment is low, and men have greater unemployment than women in times of high unemployment. All kinds of odd questions are suggested by this: Do women have more stable jobs? Is gender-correlated pay inequality causally related to apparent gender-correlated job security? And if so, which way? Would low pay cause secure work, or secure work cause low pay?

PPS: There's not any immediately obvious link on Googlefight to find out how they get with their numbers. Anyone know why it's so different from what I see "fresh from Google"?

July 18, 2010

Mathematical Statistics (1) in Public: indecency, or pearls before swine?

This post is a re-edit of a half serious note that I was inspired to write by the use of words in a snippet of tongue in cheek conversation about professional statistics. Hopefully this will be found to be an enjoyable read for at least a few people.

The inspiration, with no apologies for the really awful pun, stop reading this post now if you can't suffer punning. That's the mood I was in, so it is required for accurately setting the scene.

Sean:

[...]In statistics one should be opinionated. After all, it is generally not considered a field of discreet mathematics. This is a conclusion most often reached when indiscretion prevails, no?

Anna:

I disagree - in statistics, one should not be opinionated, one should be *right*. And clear, succinct, and to the point.

Well... I pretty much completely agreed with her, and felt that her rebuttal didn't really address what I had said. So already being in a very silly mood, I wrote some explanations of how I thought the words used applied to statistics.

Result:

Discretion:
Advised in survey preparation and use, and in development of techniques which will be subject to intellectual property protections (2). In announcing results, discretion greatly increases the chance that someone else will get credit for the work (3).


Opinionated:
Use at exactly the same time as being discreet, at all times in the planning stage of a study, and in the execution and data preparation. Stop immediately upon reaching EDA or analysis. If you are not strongly opinionated in the pre-analysis, you get a bad formulation of a poorly thought out question, a biased crappy design, crappy biased data, results that are pure crap, and a bunch of people saying "One time, at band camp, I read 'How to Lie with Statistics'...". During analysis of the data, shut up and let the data and the PLANNED analysis have their turn. If you want to outsmart the data to support an opinion during the analysis phase, quit mathematics, go back to school for an MBA, and sign up as a derivatives trader or "investment banker". (4)


Rightness/correctness:
There's the question - there in methods, here is the answer - here in results. Statistics are a measure, like the time or the mass of the moon, and thus have an accuracy and an error in lieu of right and wrong. Beliefs and opinions, and even informed opinions, thesis, hypotheses, and other vanities of Man (5) can be right and/or wrong... so of course we may have a wrong idea of what question we thought we answered. Re-check the question... is that what you thought you asked? (6) Bad execution could lead to a "wrong" answer, but we can find out what you did correctly answer by reading your notes. You have notes on the process, right?? As to having a "right" answer, meaning an accurate one, well... we have methods for two questions(7): how confident am I that this is the right answer, and how right do I think the answer is. Bayes, or non-Bayes is the choice, "rightness" is then *measured*.


Clear:
Look at "rightness". See that part about the question? You better know as exactly as you can what question you answered or you will quickly leave the field of mathematics and enter the field of mumbling. Beyond that, "clear" is a communication issue. (8)


Succinct:
The less succinct the design and analysis, the less likely you or anyone else is to understand what was done and what happened. Suspect anything that requires too much explanation or is excessively sophisticated (9) of being misunderstood and misapplied


To the point:
As phrased, this is not part of m. statistics, but part of communications. See succinct for a related topic in mathematics. (10)

Notes:

(1) My latest job search has revealed to me the semantic difference between mathematical statistics and statistics. "Statistics" is what a manager with an MBA and an Excel plug-in does and "has 10+ years of experience" in. "Mathematical Statistics" is the kind you study in school and have publications in.

(2)Carefully note that pure mathematics is unpatentable, categorized as properly being either a discovered work of God, or a discovered element of Nature and in either case a discovery, not an invention. Applied math has fewer such protections, so long as the application of the math(s) is part of the patent application.

(3)Think of it this way, if the results generate scandal with your name on it, then 'everyone knows' it's your work!

(4)Yes, this does lead to the conclusion that one should announce results in an unopinionated but indiscreet manner. It's no longer opinion, so announce the discovery with great excitement, not with great opinion.

(5)Capitalization used to indicate the common name of a species.

(6)Seriously, of "description", "methods", "analysis/results", and "discussion", only "methods" and "results" really matter. "Description" and "discussion" are opinion pieces about what the author thinks was asked, and what they think the results might mean in relation to what they think was asked. Check them with propaganda filters fully powered.

(7)The choice of which question should be asked seems to be a religious issue amoung mathematicians. And like most religions, they have sophisticated definitions for what they actually measure, that when carefully evaluated can only be answered if you have access to all possible universes, meaning that humanity can never actually test against their "gold standards" and just have to depend on the chosen method as a matter of faith. Like the "do parallel lines in this universe actually stay the same distance apart" problem, but with even less hope of ever having a usable answer. (11)

(8)Only use it when you want to be taken seriously.

(9)Check the pre-nineteenth century definitions for "sophisticated" to really understand how much to trust sophistication. While the dictionary might say that these definitions are obsolete, using them seems to have a tendency to make sentences with "sophisticated" in them more correct than the "modern" definition does.

(10)By the way, "to the point" is a highly advisable communications strategy, people have short attention spans. Two hundred words for a press release, five hundred for a press article, as few as possible for a published paper. When both papers about a discovery are published, observation has led me to suspect that the shorter one both gets read more and gets more credit. Make it shorter than what I wrote here, capice?

(11) For any English as a second language readers, yes, it really is "a usable" and not "an usable". _Spoken_ English governs the use of 'a' versus 'an', and 'usable' is pronounced as though there is a 'y' at the beginning. Just like saying "an 's.d.r.'," in which 's' sounds as though it starts with an 'e', but vice versa. Too many misuses in academic papers, making them noticeably harder to read smoothly, has made this a pet peeve of mine.

July 14, 2010

Academic Presentation Hints

At some point, academics have to give presentations to classes, as students, lecturers, or just simply trying to raise interest in our work. Here's a couple of notes for those terrified of the day they are presenting their research.

If using a slide presentation program, have a backup copy on a USB stick. At least with a math audience, there's going to be another working computer in the room. It might not be running the same system, so a PDF based backup is recommended - almost every laptop you meet will have a pdf reader that can be set to full screen to PDF display slides.

Find out the location, and check the room before you present. This way you can make certain to have the right "dongle" for your computer that day. And even though it seems like half of the time the room is changed at the last minute, usually the new room has a similar technical setup.

Avoid ever giving a speech using new software or a new computer.

Try to give your speech to an empty room at least once before you "go public". You'll unconsciously improve your presentation, which is great. But the main thing you want to do is time how long it is - for most people, it's longer than the time allotted. Now you can cut out the less important bits instead of chopping off the end of the speech. The end is usually where you put what you want people to remember walking out, making that a particularly painful thing to have to do.

There is always a chance that this will be the day that a virus swarm takes down all the department computers, the microphone fails, and the power goes out halfway through your talk. Practice your material enough that you don't have to see the slides, or have a printout with you if you do have to see them.

If you know this will be central to your career - either you will be teaching, or the job postings you apply to mention "PowerPoint skills desired", then get substantial practice. Take a public speaking class, join Toastmasters, give presentation at schools/libraries/where ever, take a Carnegie class, join an improv theater... *something* to give you some comfort in front of an audience, because that little bit more comfort will make you look a LOT better to the audience.

Links to relevant free downloads:

Toastmasters "ten tips for public speaking"

"How to Give an Academic Talk" - from a humanities prof, but relevant points for all academia.

"The Art of Public Speaking" by J. Esenwein and Dale Carnegie
- Google book
- eBook formats, at manybooks.net
- Gutenberg project (text, good source for test-to-speech):

The 7 Easy Steps to Becoming a Public Speaking Failure, because not everyone has the same goals.

July 13, 2010

Why Mathematicians use LaTeX

LaTeX is a typesetting system that mathematicians are expected to be able to use. In a class I am taking, we had a presentation over it, and my response was too long to reasonably post for the class. So I decided to put it here instead. If you ever wonder why a people who live by publishing papers might use something other than Word or other word processor, here's why...

Why does LaTeX exist, and what does it do?

Because in the 1980's Don Knuth, a perfectionist, wanted to write a book (turned out to be several volumes) about theoretical computer science. He didn't want to have a human typesetter who didn't know what the symbols meant goof up his work, and desktop publishing hadn't happened yet. So he wrote a system that he could use to typeset the book for publishing. Thus, LaTeX is very good for publishing books about topics that involve a lot of math. (Other people, these days, use programs like Quark - which is not very friendly for mathematical notation.)

Why do I not want to use it?

It is not good for writing a note to ask for a meeting time. The setup time can be long, even though it tends to be a one-time job. It does not make pasting gifs from the internet into your document easy. It will try to format your document into a structured whole, even if it is too small to be "structured".

Why do I want to use it?

Because you want to write a topic once, publish it many ways (thesis, reviewed paper, whitepaper, lecture slides, poster, book chapter), and have it look really good every time. Paste the "chapter" preamble on last summers "paper" content, and submit. Another entry on your publication list. And never have to wonder about the location of "the equation on the previous page" again, by using the automatic reference system, that will also rebuild your table of contents for you. And it can handle ALL the math symbols! And BibTeX, and the AMS style sheets, and... essentially, all of the normal and specialty features that mathematicians (including statisticians) need are freely available for it on the web.

How hard is it?

The learning curve is strange, since you can copy and paste from your previous work for most things, and never actually learn it. You do end up learning the special symbols, like \nabla and \pi and \rightarrow. On the upside, these are very handy for understanding other mathematicians in text-only emails and discussion groups. You can even text them!

Why is this important?

Because this is the established format for math publishing. No platform incompatibilities, no version incompatibilities (even though Knuth says the lowercase delta in the 1992 version is better than the "ugly" 1986 version, and is still trying to get people to upgrade), text reflow happens in the expected way, and local formatting is harder than using the appropriate style (local formatting usually breaks the design and layout of publications longer than a brochure). It can produce arbitrarily high resolution graphs from the PS/EPS/PDF graphics output from programs like R (or Adobe Illustrator if you make Tufte-esque graphs the hard way), so your poster presentation can look awesome.

What do I use?

I like the look of LaTeX (it doesn't look like it's from the marketing department, and thus gets a more serious treatment sometimes), and the gradually-becomes-the-easiest way to write math, so I'm actually taking class notes with it. But I'm not "writing LaTeX", I use LyX, which looks like a text editor, has multilingual spellcheck built in (handy for some citations and quotes), can run on both the "numbers" machine and the "office" machine, and a menu where I can look up the proper name for "the integral symbol with the circle on it" when I forget that it's "\oint" (but \varointctrclockwise if you need a little counter-clockwise arrow instead of a circle).

links:

Math symbols http://en.wikibooks.org/wiki/LaTeX/Mathematics

"Ugly delta" http://www-cs-faculty.stanford.edu/~uno/cm.html

"Handwriting recognition" for LaTeX: http://detexify.kirelabs.org/classify.html