R is perhaps one of the most powerful and most popular platforms for statistical programming and applied machine learning.
R is an open-source environment for statistical programming and visualization.
R is many things, which might be confusing at first:
- R is a computer language.
- R is an interpreter.
- R is a platform.
R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, as an implementation of the S programming language. Its development started in 1993. A version was made available on FTP released under the GNU GPL in 1995. The larger core group and open source project was set up in 1997.
It started as an experiment by the authors to implement a statistical test bed in Lisp using a syntax like that provided in S. As it developed, it took on more of the syntax and features of S, eventually surpassing it in capability and scope.
R is a tool to use when you need to analyze data, plot data, or build a statistical model for data. It is ideal for one-off analyses, prototyping, and academic work but not suited to building models to be deployed in scalable or operational environments.
The three key benefits of using R:
- Open Source: R is free and open-source. You can download it right now and start using it. You can read the source code, learn from it, and modify it to meet your needs.
- Packages: R is popular because it has a vast number of very powerful algorithms implemented as third party libraries called packages. It is common for academics in statistical fields to release their methods as R packages, meaning that you have direct access to some state-of-the-art methods.
- Maturity: R is inspired by propriety statistical language S, using and improving the idioms and metaphors useful for statistical computing, like working in matrices, vectors, and data frames.
For more information about R packages, check out the CRAN (Comprehensive R Archive Network.). The Machine Learning & Statistical Learning view that lists packages for machine learning will be of great interest.
There are three key difficulties with using R:
- Inconsistency: Each algorithm is implemented with its own parameters, naming conventions, and parameters. This can be very frustrating and requires deep reading of the documentation with each new package that you use.
- Documentation: There is a lot of documentation, but it is generally direct and terse. The built-in help is rarely helpful, driving you constantly to the web for complete working examples from which you must derive your use case.
- Scalability: R is intended for use on data that fit into memory on one machine. It is not intended for use with streaming data, big data, or working across multiple machines.
The language is a little obtuse, but as a programmer, you will have little difficulty in picking it up and adapting examples to your needs.
Commercial companies now support R. For example, Revolution R is a commercially supported version of R with extensions useful for enterprises such as an IDE. Oracle, IBM, Mathematica, MATLAB, SPSS, SAS, and others provide integration with R and their platforms.
The Kaggle platform for data science competitions and the KDnuggets polls both point out R as the most popular platform for a successful practicing data scientist.