Transforming [Petroleum] Engineers in Data Science Wizards. Update 2020

Note. This is an update of the original article I published in 2017 and 2019. So many things have changed, and so many new things I have learned, that this article needs some refreshing to reflect those new experiences.

First of all, I am enclosing “Petroleum” in brackets because that was the original title of the article. But I came to realize that it goes beyond a narrow engineering field; it may useful to other engineers as well.

These are not absolute truths written in stone. I am sharing experiences, recommendations of something that worked for me and others. Part of them are collection of notes that I have shared with colleagues when asked me:

“Alfonso, how do I start with Data Science?”,
“Is there any value in learning Data Science?”,
“What impact could Data Science bring to my engineering job to improve processes, increase revenues, or reduce costs?”

It is difficult to find out if you don’t apply it in the the real world. I was an skeptic until I took the decision to seriously learn it. I don’t mean one hour a week or a day; I meant living it and making it part of my engineering process at work. It is funny because my team was a bit reluctant at changing the classic way of solving well models to optimize well production. Then, we wouldn’t dare to analyze an oil field without first applying statistics, creating datasets from input files, simulation, or solver outputs, and look at the resulting data with a new perspective.

It resembles a scientific approach to things. Mind you that the word “science” is not gratuitously in “data science”. Data science comes from the world of Statistics when Computer Science made it a practical tool make discoveries from data. We owe Data Science, first and foremost, to Statisticians. Besides, this new industrial revolution, based on data, we are living in, requires a new set of lenses to understand, and discover things that are not immediately evident using the classical methods. Of course it will be difficult, no doubt you will meet resistance, but what human endeavor that is worth, isn’t?

You may take these as some recipes to start your transformation in a Data Science wizard:

Complete any of the Python or R online courses on Data Science. My favorites are the ones from Johns Hopkins and the University of Michigan in Coursera (Data Science Specializations in R or Python). Don’t be mistaken: the data science specialization in R is a high quality course, and would make you feel sometimes like going through a PhD program. You will need a firm commitment and set aside some time for lectures, quizzes and project assignments. You could complement it with DataCamp short workshops. For instance, I started, few years ago, with the two-hour “Introduction to R” quick course. There are other online institutions such as edX, Udacity, Udemy, etc. You will also be able to find online courses from reputable universities such as Stanford, MIT, or Harvard. If you don’t have previous programming experience, start with Python; if you feel confident about your programming skills, and would like breaking the barrier between engineering and science, go full throttle with R.
Start using Git as much as possible in all your projects. It is useful for sharing and maintaining code, working in teams, synchronize projects, work in different computers, bring reproducibility to your data science projects, etc. To access Git in the cloud you may use GitHub, Bitbucket, or GitLab. Don’t be frustrated if you don’t get it or understand it at first; everybody struggles with Git. Even PhDs, who, by the way, have written the best tutorials on Git. So, you are not alone.
Learn the basics of the Unix terminal. This is useful for many things that Windows doesn’t do -and might never do; it is even useful for Linux and Macs which are Unix based as well. You can do automatic scripting using the Unix terminal that can serve you in many data oriented activities, such as operations on huge datasets, deployment, backups, file transfer, manage remote computers, secure transfer, low level settings, version control with Git, etc. If you a Windows guy, get familiar with Unix, from hybrid Windows applications such as Git-Bash, MSYS2, or Cygwin. There is no question about that you have to know the Unix terminal. It makes your data science much more powerful and reproducible, giving you also avenues for deployment. I am finding more frequently articles where they have managed to read and transform terabyte-size datasets, in laptops, using combination of Unix utilities like grep, awk, sed, etc., along with data.frame and data.table structures. No need of big-data computer clusters with Hadoop or Spark, which are more more difficult to handle.
As soon as you have installed R, Rtools and RStudio in your computer, start using Markdown. In R is called Rmarkdown, which is widely used in science for generating documentation, papers, citations, booklets, manuals, tutorials, schematics, diagrams, web pages, blogs, slides, etc. Make a habit in using Markdown. If possible, during engineering work, avoid Word -which generates mostly binary files. Working with Markdown makes easier to do revision control and it is reproducible, both, key to reliable, testable, traceable, repeatable data science. With markdown, you can also embed Latex equations with text, code and calculations. Besides you gain an additional ecosystem to run code and tools from the Latex universe, which is enormous.
Strive to publish your engineering results using Markdown. It will complement your efforts of batch automation, data science and machine learning. Combine calculations with code and text using the Rmarkdown notebooks in R. Essentially, any document can be written mixing text, graphics and calculations with R or Python. Even though I am originally a Python guy (10+ years), I am not strongly recommending the Python notebooks, or Jupyter, because they are not 100% human readable text (it uses JSON), that you may find difficult to apply version control and reproducible practices, or using it with Git. I have possible built more than a thousand Jupyter notebooks but when I learned Rmarkdown, it was like stepping in another dimension.
Start bringing your data into datasets with assistance of in R or Python. Build your favorite collections of datasets. Share with colleagues in the office and discuss the challenges in making raw data tidy. Generate tables, plots and statistical reports to come up with discoveries. Use markdown to document the variables or features (columns). If you want to share the data, keeping the confidentiality, learn how to anonymize your data with R or Python cryptographic or scrambling packages.
Start solving daily engineering problems with R or Python incorporating them in your workflow. If you can, avoid Excel or Excel-VBA if possible. VBA purpose was not version control, or reproducibility, or data science, much less, machine learning. Sticking to Office tools may keep you stuck to outdated practices or being unable to perform a much richer and productive data science. There is one more thing you may have possible noticed, and that is Excel plots are very simplistic; they go back to 30 years ago techniques-, and you would run the risk of dumbing down your analysis, or prevent you of making discoveries from your data, or showing a compelling story, which is the purpose of data science anyway.
Learn and apply statistics everywhere; every time you can, on all petroleum engineering activities you perform. Find what no other person can by using math, physics and statistics. Data Science is about making discoveries and answering questions on the data. Data Science was invented by statisticians; who at that time they called it “data analysis”. An article I never get tired of read and re-read is this “50 years of Data Science by David Donohoe”. Please, read it. It will explain statistics and its tempestuous, albeit tight, relationship with data science.
Read what other disciplines outside yours are doing in data science and machine learning. Look at bioscience, biostatistics, genetics, robotics, medicine, cancer research, psychology, biology, ecology, automotive, finance, etc.
Read articles in the net on data science. It doesn’t matter if it is Python or R. You just have to learn what data science is about; how it could bring value to your everyday workflow. They may give you ideas of applications involving data in your engineering area of expertise. They may not be data science per-se now but they most likely could be the next stepping stone. Additionally, most of the articles are free as well as hundreds of books, booklets, tutorials and papers. We never had the chance to learn so much for so little. Somebody has call this the era of democratization of knowledge and information. What you have to invest is time.
The next stepping stone while learning Data Science is Machine Learning. Start inquiring about what machine learning is about. Same with artificial intelligence. There is nothing better than knowing, at least, the fundamentals of what others are trying to sell you. There is so much noise, and snake-oil marketing nowadays surrounding the words “machine learning” and “artificial intelligence”. Three books I would recommend, out of the top of my head, on artificial intelligence: Artificial Intelligence: A New Synthesis” by Nils Nilsson; “Computational Intelligence. A logical approach” by David Poole, Alan Mackworth and Randy Goebel; and “Artificial Intelligence: A Modern Approach” by Russell and Norvig. You will find that AI is not what you read in newspapers or articles.
Review C++ and Fortran scientific code. I don’t mean to say that you need to learn another programming language, but knowing what they can do will add power to your toolbox, specially at deployment time. Sooner or later you will need Fortran, C or C++ for reasons of efficiency and speed. Not for nothing the best in class simulators and optimizers of today have plenty of Fortran routines under the hood.
Learn how to read from different file formats. It is amazing the enormous variety of file formats in what you may find raw data. There is a lot of value that you could bring to your daily activities by automating your data analysis workflow using R or Python. Also, ask what are the different data formats that are used in your company for storing data. Get familiar with them. If you are in petroleum engineering, try reading some chunks of that data: try with logs, seismic, well tests, buildups, drilling reports, deviation surveys, geological data, process data, simulation output, etc. Create tidy datasets out of them. Explore the data. Embark in finding and discover things.
Something that is more challenging is learning how to read and transform unstructured data, meaning, data that is not in row-column (rectangular) format. The typical and close cases to us are the text output from simulators, optimizers, stimulation or well design, etc. This is one of the most difficult data to operate with, and when learning or knowing “regex” really pays off. There is a more complex side of unstructured data, and that is video, sound, and images! Today there are plenty of algorithms available that deal with that kind of data either with Matlab, Python or R.
Learn something about Virtual Machines with VirtualBox or Vmware. It is very useful to have several operating systems working at the same time in your PC: Windows, Linux, MacOS. There is a lot of good data science and machine learning stuff in Linux packaged as VMs which could be run under Windows very easily. These are applications that are ready to run without the need of installing anything on the physical machine. Few months ago, I was able to download a couple of Linux VM with whole bunch of machine learning and artificial intelligence applications, and test them with minimum effort. I have other VMs from Cloudera and Horton-Works where I was able to run big-data applications such as Hadoop, Spark, etc. Another virtualization tool that you may want to learn is Docker containers. The concept is similar to that of virtual machines but lighter and less resource intensive. Another tool you may to explore in virtualization is Vagrant, which is an advanced combination of virtual machines and containers. These tools will make your data science even more reproducible and stand the test of time.
Note. For those who have asked me if I recommend a formal data science degree at a university, what I tell them is try first with online courses and see if it is for you.

Alfonso R. Reyes

Houston, Texas. 2020

My GitHub repository with open source code: link
My Rmarkdown blog: link
Original article 2019 in LinkedIn: link
Original article 2017 in LinkedIn: link

Footer