Application of Machine Learning in Sensor Based Water Quality Early Warning System
Written by Justin Lam, Yin Lin, Jian Peng and Wendy Ye
Project Description
In this project, our team worked to apply machine learning techniques to develop a model to detect potential water quality or pollution events from water probe and sensor data.
Objective
Currently, most pollution or water quality events go undetected as there is no infrastructure in place to support such a goal. A water quality sensor deployed in a pilot program is located in San Diego Creek transmits real time readings for environmental scientists to monitor, but there is no process in place to analyze it.
In most cases events are detected after the fact, as water quality samples need to be taken on site and sent to a lab for analysis. The goal of this project to to delve into the sensor data and see if there are any patterns that can help spot potentially problematic events early on.
The Data
Our project is specific to the Orange County watershed, where OC Environmental Resources recently piloted a water quality sensor in San Diego Creek near the University of California, Irvine campus. This sensor creates readings every 15 minutes across 7 different attributes: rainfall, discharge rate, temperature, turbidity, conductivity, pH, dissolved solids and dissolved oxygen. Our data set consists of a year’s worth of readings for a total of ~3.6 million data points.
Methodology
A significant amount of effort was put into cleaning up the data set prior to any deeper analysis. Over 25% of the readings had to be discarded due to missing or invalid values. This was a result of either the sensor malfunctioning or being taken offline for maintenance. Calculated attributes such as percentage change over intervals of time were also added to hopefully add additional dimensions to the data. As each attribute was recorded in different units of measure, all remaining values were normalized prior to training any model against it.
Our team initially had several techniques in mind when first starting this project. Our initial objective of identifying quality and pollution issues meant that classification methods like logistic regression and decision trees were the obvious choice. However, because our data is the result of a pilot program we had no readings for actual water quality events within our time frame that occurred at this particular location in San Diego Creek. Since our data was primarily numerical in nature, we opted instead to perform a cluster analysis to identify segments that may have signals of potential issues. We also conducted a time series analysis for each attribute to ascertain any trends or seasonality with the data.
Key Findings
Our team chose simple K-means clustering as our model of choice. Several combinations of centroids numbers and included fields were ran until we settled on 4 centroids using only the hourly average of each attribute. Based on our teammate’s professional experience, elevated levels of high turbidity were the best indicator of potential quality issues based on the readings available. Below we found two clusters capture very clearly timeframes of high turbidity, and to some degree spikes in conductivity. These would be solid leads to investigate further to see if a rule based model like a decision tree can be formed to create a classification model using all attributes available.
A time series analysis of turbidity and conductivity concluded that there is no obvious trend or seasonality for these two variables. Some seasonality was detected in discharge rate and most clearly water temperature, which was most likely lead by atmospheric temperature on a diurnal cycle.
Challenges
While our team had hoped to have more conclusive results, we were hampered with several key issues. As this water quality program is just a pilot, data was only available from a single stationary sensor in a downstream area, whereas there are dozens if not hundreds of miles of upstream points where events could occur. A major problem that changed the course of this project was the lack of suspected incidents to validate our model against. As mentioned previously most incidents are detected after the fact, and sensor readings for this location only go back about a year.
However, these challenges are being addressed by the OC Environmental Resources team that manages this program. A budget has been proposed to deploy more sensors across the region as well as cameras at key junctions to aid in earlier detection of incidents.
Conclusion
Our team thought that this was a good exercise in how to apply machine learning to a real world problem. As demonstrated with our rather difficult to interpret data set, not all business problems are clearly defined with a structured data set ready to analyze. Projects like these likely will need to involve several cross functional teams to set up the infrastructure and processes needed to make this system a reality.