Twelve books that made me a better Data Scientist

…and now help me in my day-to-day work

I started my career as an academic researcher in the areas of Ecology and Invasion Biology. That involved lots of fieldwork and lab experiments, followed by statistical analyses of the collected data and publishing in peer-reviewed journals. In addition, I taught several data analysis-heavy courses to university undergrads and Master students, including Biostatistics, Population Ecology, and Ecological Modeling.

All of that experience became extremely useful when I decided to leave academia and apply my data analysis skills to solving business problems. Believe it or not, my decision at the time was triggered by the famous article in Harvard Business Review, which stated that Data Science was going to be the sexiest job of the 21st century. Since then, I have had the opportunity to work for a number of companies in industries as diverse as Chemicals, Telecommunications, Online Gaming, and Insurance.

Times and times again while working as a Data Scientist, I have found myself referring to a select few books on statistical techniques, programming tricks, and project management. These books helped me become who I am today in my profession. I thought it might be of interest to my fellow Data Scientists, aspiring and seasoned alike, if I gave a brief overview of these books. So here it goes, grouped by topics.

Photo by Sergey Mastitsky

Zar J.H. (1999) Biostatistical Analysis, 4th edition. — Pearson Education Inc.

I use this classic text as the main reference when it comes to the internals of standard statistical methods. The book covers commonly used probability distributions, descriptive statistics, one-, two- and multi-sample hypothesis testing, linear and polynomial regression, etc. One particularly interesting topic that is not often seen in other textbooks is “data on a circular scale” (think of time, compass directions, etc.). Numerous and very detailed examples help the reader truly understand how the respective methods work.

This book is an encyclopaedia of such methods as factor analysis, multiple regression, multiple discriminant analysis, multivariate analysis of variance, cluster analysis, multidimensional scaling, correspondence analysis, and structural equation modeling. But it is not just the theory of these methods. The book also offers detailed walk-throughs based on real-world datasets, as well as step-by-step instructions and flow diagrams on how to choose and use these methods in practice.

Dalgaard P. (2008) Introductory Statistics with R, 2nd edition. — Springer

If you want to learn R and use it for statistical analysis, look no further! Written by Prof. Peter Dalgaard, one of the R Core Team members, this book is hands down the best introductory text on this language. After describing the basics of R, the author demonstrates how to use it to implement a wide range of statistical techniques, similar to those discussed in Zar (1999).

Photo by Sergey Mastitsky

Many real-world datasets contain repeated measurements made on the same “experimental units” (e.g., individuals, objects, locations), or observations that are “nested” in some way (i.e. subgroups within groups). Due to the temporal or spatial correlation inherently present in such observations, standard statistical methods do not apply to their analysis and one has to employ alternative techniques, such as mixed effects models. Zuur et al. (2009) is an excellent and accessible introduction to this otherwise sophisticated type of models. Although the authors use case studies from Ecology, all examples are easy to follow and understand.

Gelman A., Hill J. (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. — Cambridge University Press

This is another great text on mixed effects models (a.k.a. “multilevel” and “hierarchical models”), which emphasises their Bayesian nature. Of particular value in this book are sections discussing the sample size and power calculations, as well as simulation-based model checking and comparison. Multiple examples are illustrated using the R and Bugs code.

Singer J.D., Willett J.B. (2003) Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. — Oxford University Press

This widely known book is focusing on datasets that contain observations repeatedly taken from the same experimental units over time (a.k.a. “panel data”). In addition to multilevel models for individual change, it also provides an in-depth description of survival, or time-to-event, models. The authors offer step-by-step analyses of published datasets. Thanks to a companion page on the UCLA website, readers can replicate the examples using a variety of software, including R, SAS, Stata, SPSS, and Mplus.

This is the best introductory book on Bayesian inference and thinking that I know of. In contrast to many other, math-heavy, texts on Bayesian statistics, this book is practice-oriented. It will be of particular interest to Data Science practitioners involved in A/B testing. All examples are implemented using the Python-based probabilistic programming framework PyMC.

Photo by Sergey Mastitsky

Being a “lighter” version of the classic text “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, this book is ideal for beginners in the world of Machine Learning. Using a clear and concise style, it covers linear and nonlinear regression, tree-based methods, support vector machines, resampling techniques, principal components analysis, cluster analysis, and other methods. Each chapter ends with a set of well-selected practical exercises (using R).

Personally, I think this one of the best and most important texts on Machine Learning ever written. And I am proud to be its official translator to Russian language (published by Moscow-based DMK Press in 2016).

This book is a comprehensive overview of the practice of predictive modeling. Lots of attention is paid to data preprocessing and feature engineering, overfitting and hyperparameter tuning, performance metrics, class imbalance in classification problems, predictor importance, and other topics. Although examples are implemented using the R package caret, the book is a great general guide on the process of model building.

Photo by Sergey Mastitsky

Hyndman R.J., Athanasopoulos G. (2014) Forecasting Principles and Practice. — OTexts

Good applied books on forecasting are rare. However, this one is a real gem. Using numerous real-world examples, it covers a range of standard (exponential smoothing, moving average, ARIMA, etc.) and advanced forecasting techniques (dynamic regression models, neural net-based models, hierarchical and grouped times series, etc.). Although here I am citing the hard-copy edition from 2014, there is a newer online version of this book. If you are interested in everything-forecasting, I would also strongly recommend subscribing to Prof. Hyndman’s blog “Hyndsight”.

Photo by Sergey Mastitsky

This is one of the books that I use particularly often in my work. And that is not because it provides an excellent overview of commonly used Machine Learning techniques (although it does!). I refer to this book again and again because it is filled with practical advice on how to use Data Science to solve business problems. For example, the “expected value framework”, one of the key ideas in the book, helps me structure my projects around the commercial outcomes expected by stakeholders and clients and then build my solutions accordingly.

This is another book that I regularly use in my work. It is a collection of brilliant practical tips on how to develop software pragmatically and efficiently (including recommendations on project planning and execution, stakeholder management, documentation, etc.). Thanks to the natural overlaps between Data Science and software development, many of these tips are directly applicable to Data Science projects. For example, this book inspired my article about the gathering of project requirements:

If you decide to buy a copy of “The Pragmatic Programmer”, I would recommend getting its most recent, 20th anniversary edition.

…and now help me in my day-to-day work

Zar J.H. (1999) Biostatistical Analysis, 4th edition. — Pearson Education Inc.

Dalgaard P. (2008) Introductory Statistics with R, 2nd edition. — Springer

Gelman A., Hill J. (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. — Cambridge University Press

Singer J.D., Willett J.B. (2003) Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. — Oxford University Press

Hyndman R.J., Athanasopoulos G. (2014) Forecasting Principles and Practice. — OTexts

Footer