Learning from non-stationary distributions

Time-varying data characteristics

One of the most critical assumptions in ML data modeling is that the train and test dataset belong to similar distribution. This emphasizes the property of generalization of ML solution

Based on the generalization property, a machine learning model learns the association between the independent features and the target variable from the train data and predicts unseen data (we will call it test data in the rest of the article).

Note that the train data is used as a reference to estimate the target value for the test data. In other words, the ML solution is probabilistic and not guaranteed based on past data.

Now, let’s discuss the following business case and understand what the issues in it are and what are the possible approaches and methods that could rally the learning.

Business Case:

The case is inspired by this link:

Learn the model to predict the user behavior using the features F1, F2…Fₙ
Let’s consider two different train and test split:

a) Time based 70–30% split

b) Random train test split with test size = 0.3

The time split based trained model led to poor model accuracy following different distributions of train and test data.

Now we are acquainted with the problem, let’s see three main solution/approaches that work in such a scenario:

Source: Author

1) Online Learning:

It generates predictions under the premise of regret over the experts. Experts could be for a single feature or a group of features. The algorithm sees the data, makes the mistake, and accordingly keeps updating the weights of each of the experts in hindsight. This way the algorithm is able to adjust the weight and outputs relatively more correct predictions with the changing distribution of the input data.

2) Domain Adaptation:

It is used when the train and test data features are different. It uses unlabelled test data to put weights on the training data.

If the F1 feature occurs more often in train data as compared to the test data, then it is suggested to lower the weight of F1. The motivation behind it is that the learning of the model is based on decreased dependence of the F1 feature, as the model is less likely to see the F1 feature among unseen data.

The above works when the feature set is different in two datasets. But if the distribution of features itself changes over time, then training features are weighted based on their likelihood of occurring in the test data.

3) Reinforcement Learning:

It differs from supervised learning in the sense that it does not need labeled input/output pairs. It continuously interacts with the environment and focuses on finding the balance between exploration and exploitation. The environment is typically stated in the form of a Markov Decision Process.

There are various research papers listed in the source link mentioned in the reference below. If you wish to learn more about these approaches, feel free to go through those research papers.

I will also write a detailed explanation of each of these approaches in the next post, stay tuned, and keep learning!!!

Thanks for reading.

Reference:

https://stats.stackexchange.com/questions/142315/learning-user-behavior-that-changes-over-time

Footer