Primary statistical systems - these are general programs that can carry you through a lot of your analysis. They either are a programming language or can be programmed as part of one, so that they are truely general purpose packages. With the exception of R, these are all "Big Data" capable - they are not limited by your computer's RAM, but by the disk space available. R is a special case - it is just too useful to leave off, even though the default setup is RAM limited. Note that there are both special packages, and special distributions of R that get around these limits.
- R System
- SAS
- SPSS
- Exastat (uses C++ for programming; Windows developed: claims to run under Wine)
- PSPP
- DAP (uses C for programming)
Data management. Mostly database software or GIS software. While the statistical software above have data management features, they are often hard to access outside of the software (notably, SAS is trying to be an exception to this rule - but the exception is expensive). It is much more useful to house the data in software that thinks in terms of "serving data", since all of the packages above are also very adept at accessing external data.
- PostgreSQL (The elephant before Hadoop)
- MySQL
- SAP DB
- Oracle
- Adabase
- DB2
- Teradata
- Neteeza (linking to hardware is hard)
- cassandra newish distributed database
ETL. Data management is much easier once the data is loaded. Custom scrapers are easy to put together, but ETL tools are easier when you need to put together numerous sources. Really, they are! Numerous one-off scripts, when all written by the same person, will become a half-ETL system. Usually no "L". Perl and awk are the canonical tools for creating these one-offs, with Python rapidly joining.
- YALE / RapidMiner
- Talend Open Studio (Spatial Data Integrator for geo)
- Kettle/Pentaho (GeoKettle for geo)
- Apatar
- Sqoop (Hadoop specific)
- Ab Initio
GIS software. A lot of geographic analysis is still done in separate packages from the statistical analysis. While both SAS and R have add-on packages that give them significant geographic abilities, they have the respective drawbacks of being expensive or piecemeal (too many overlapping choices make setup a pain). So... GIS software still has a definite place in the data gorilla toolbox. Very convenient with a GIS-extended database to back it up, like PostGIS or the Oracle GIS extensions.
- GRASS
- QGIS
- ESRI (ARC/many things)
- SEXTANTE - appears to keep the analysis UI as simple as mapping.
- Manifold - cheap commercial-grade GIS, Windows only.
Architecture. Yes, architecture - what do these things run on, over, and under. Excepting Neteeza (which comes in its own hardware), and Exastat (which seriously needs portability), all of the above software is multi-platform. This does not mean, however, that the platform does not matter. Running the same software (in c++), compiled with the same compiler (gcc), on the same hardware spec: 16 times difference in the the amount of data that can be processes in-memory, and in a different case, 4-times difference in the CPU time used for a process on an unloaded machine (really: one OS only used a total of 20% to 25% of the CPU while running the analysis).
- Hadoop
- Solaris (link goes to OpenSolaris)
- Linux/Cygwin. Some major "distro's" here; Scibuntu and Scientific Linux may be simpler to set up. Cygwin runs as a process under Windows.
- Windows/Wine. Wine runs as a process under Linux. (Windows itself is not readily downloaded.)
- BSD: FreeBSD and OS/X (or Darwin) are the most commonly run versions.
- VM and MVS. Sometimes Hercules is useful for checking some of the software for this platform.
- JVM
Compilers and development environments
- gcc (Link to documentation link.)
- mingw (reliable gcc on Windows)
- Eclipse
- Visual Studio
- EMACS Old skool! (Link to manual.)
Source control. Source control is awesome for the real world!
- CVS
- SVN
- Mecurial
- Bazaar
- Git
- Visual Source Safe
Useful languages and language references. Excluding the ones listed under "statistical systems".
Useful libraries
- LAPAK
- GSL
- Numpy/SciPy
Presentation. Because humans want to see a pretty picture, or at least a report!
- OpenOffice.org - With OpenOffice Base (both ODBC and JDBC) & Python
- MS Office - ODBC, Visual Basic, and .COM automation
- Google Docs - they have a pretty good API, and great ability to distribute!
- R - Trellis graphics, RODBC, and an outrageous number of specialized output packages! Maps, Chernoff faces, violin plots, rug charts, if you've seen it, it's probably available.
- SAS - between ODS, SAS integration tools for MS-Office, SAS/Graph, SAS/Map, and friends, you can get the output you want.
- HTML - Java, C++, Python, PHP, ... all sorts of stuff is available to make cool web output of reports. Paper-ready is a bit harder.
- World Wind - Need to look at a planet? (original NASA software)
- Google Earth - Like World Wind, but closed source and promoted by Google. Popular for looking at things, particularly in common with SketchUp.
Virtualization. Lastly and leastly... usually I find portable software easier than trying to make an OS portable, but there are corner cases. And the quick ability to pick up shop when a piece of hardware crashes is kind of nice now, isn't it?Of course, a good version control system can ease a lot of the pain too, but here's a list of virtual machines that I consider. Note: cross-platform compatibility is a must on my list. Why have a VM that depends on a particular machine??
- QEMU (emulator, VM... similar in many ways)
- Virtual Box
- Xen
- Parallels
- VMware
No comments:
Post a Comment