Note that when I provide links on this page, it is usually to the home page of the project producing the software, or whatever page they have that is most useful to me. It is usually easier to get the software as packaged in a repository for your production environment, however. Windows and commercial-only software are the exceptions - Windows doesn't have an effective repository/package system, and commercial software typically uses specialized installation methods. On a related note, for any commercial software, don't forget to double check (before purchasing a license) if it will *both* install and run on your platform!

Primary statistical systems - these are general programs that can carry you through a lot of your analysis. They either are a programming language or can be programmed as part of one, so that they are truely general purpose packages. With the exception of R, these are all "Big Data" capable - they are not limited by your computer's RAM, but by the disk space available. R is a special case - it is just too useful to leave off, even though the default setup is RAM limited. Note that there are both special packages, and special distributions of R that get around these limits.
  • R System
  • SAS
  • SPSS
  • Exastat (uses C++ for programming; Windows developed: claims to run under Wine)
  • PSPP
  • DAP (uses C for programming)

Data management. Mostly database software or GIS software. While the statistical software above have data management features, they are often hard to access outside of the software (notably, SAS is trying to be an exception to this rule - but the exception is expensive). It is much more useful to house the data in software that thinks in terms of "serving data", since all of the packages above are also very adept at accessing external data.

ETL. Data management is much easier once the data is loaded. Custom scrapers are easy to put together, but ETL tools are easier when you need to put together numerous sources. Really, they are! Numerous one-off scripts, when all written by the same person, will become a half-ETL system. Usually no "L". Perl and awk are the canonical tools for creating these one-offs, with Python rapidly joining.

GIS software. A lot of geographic analysis is still done in separate packages from the statistical analysis. While both SAS and R have add-on packages that give them significant geographic abilities, they have the respective drawbacks of being expensive or piecemeal (too many overlapping choices make setup a pain). So... GIS software still has a definite place in the data gorilla toolbox. Very convenient with a GIS-extended database to back it up, like PostGIS or the Oracle GIS extensions.
  • QGIS
  • ESRI (ARC/many things)
  • SEXTANTE - appears to keep the analysis UI as simple as mapping.
  • Manifold - cheap commercial-grade GIS, Windows only.

Architecture. Yes, architecture - what do these things run on, over, and under. Excepting Neteeza (which comes in its own hardware), and Exastat (which seriously needs portability), all of the above software is multi-platform. This does not mean, however, that the platform does not matter. Running the same software (in c++), compiled with the same compiler (gcc), on the same hardware spec: 16 times difference in the the amount of data that can be processes in-memory, and in a different case, 4-times difference in the CPU time used for a process on an unloaded machine (really: one OS only used a total of 20% to 25% of the CPU while running the analysis).

Compilers and development environments
  • gcc (Link to documentation link.)
  • mingw (reliable gcc on Windows)
  • Eclipse
  • Visual Studio
  • EMACS Old skool! (Link to manual.)

Source control. Source control is awesome for the real world!
  • CVS
  • SVN
  • Mecurial
  • Bazaar
  • Git
  • Visual Source Safe

Useful languages and language references. Excluding the ones listed under "statistical systems".

Useful libraries
  • GSL
  • Numpy/SciPy

Presentation. Because humans want to see a pretty picture, or at least a report!
  • - With OpenOffice Base (both ODBC and JDBC) & Python
  • MS Office - ODBC, Visual Basic, and .COM automation
  • Google Docs - they have a pretty good API, and great ability to distribute!
  • R - Trellis graphics, RODBC, and an outrageous number of specialized output packages! Maps, Chernoff faces, violin plots, rug charts, if you've seen it, it's probably available.
  • SAS - between ODS, SAS integration tools for MS-Office, SAS/Graph, SAS/Map, and friends, you can get the output you want.
  • HTML - Java, C++, Python, PHP, ... all sorts of stuff is available to make cool web output of reports. Paper-ready is a bit harder.
  • World Wind - Need to look at a planet? (original NASA software)
  • Google Earth - Like World Wind, but closed source and promoted by Google. Popular for looking at things, particularly in common with SketchUp.

Virtualization. Lastly and leastly... usually I find portable software easier than trying to make an OS portable, but there are corner cases. And the quick ability to pick up shop when a piece of hardware crashes is kind of nice now, isn't it?Of course, a good version control system can ease a lot of the pain too, but here's a list of virtual machines that I consider. Note: cross-platform compatibility is a must on my list. Why have a VM that depends on a particular machine??
  • QEMU (emulator, VM... similar in many ways)
  • Virtual Box
  • Xen
  • Parallels
  • VMware

No comments:

Post a Comment