- Business Problem
- Dataset Analysis
- Mapping the Real World Problem into ML problem
- Performance Metric
- Exploratory Data Analysis
- Feature Engineering
- Data preprocessing
- Future improvements
It was a Kaggle competition which is organized by Booz Allen Hamilton company on Kaggle. Booz Allen Hamilton has been solving for business, government, and military leaders for over 100 years. competition is about reducing aviation fatalities we have to predict state of the pilot based on given physiological data in competition. So let’s start to explore the problem!
Most of the flight fatalities are due to the loss of airplane sate of awareness. One of the important today’s travelers are considered is safety. Accidents related flight leads to the loss of life of several people. So our challenge is to build a model to detect troubling events from aircrew’s physiological data.
The most frequent causes for these aviation accidents include:
- Pilot error
- Mechanical error
- Runway problems
- Air traffic failure
- Climate problems
In this problem we mainly focus is on the first cause i.e, aviation accidents caused by pilot error, how to reduce that..
Air flight is one of the important ways of traveling. So the safety of passengers is primarily considered by most of the airplane companies. As a part of that pilots underwent severe training to handle different situations, important ability a pilot should pursue is multitasking capability.
Most of the flight fatalities or flight accidents due to pilot error are due to the loss of airplane state awareness. Airplane state awareness (ASA) is a pilot performance attribute wherein the pilot should be able to realize and respond quickly to any change of state of the airplane. Loss of airplane state awareness may lead to many dangerous situations and may result in loss of airplane control. Loss of ASA is mainly due to loss of attention on the part of pilots who may be distracted, sleepy, or in other dangerous cognitive states. Due to the stressful environment, while flying, the possibility of the loss of awareness is common.
In this dataset, you are provided with real physiological data from eighteen pilots who were subjected to various distracting events. The benchmark training set is comprised of a set of controlled experiments collected in a non-flight environment, outside of a flight simulator. The test set (abbreviated
LOFT = Line Oriented Flight Training) consists of a full flight (take off, flight, and landing) in a flight simulator.
The pilots experienced distractions intended to induce one of the following three cognitive states:
- Channelized Attention (CA) is, roughly speaking, the state of being focused on one task to the exclusion of all others. This is induced in benchmarking by having the subjects play an engaging puzzle-based video game.
- Diverted Attention (DA) is the state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task. Periodically, a math problem showed up which had to be solved before returning to the monitoring task.
- Startle/Surprise (SS) is induced by having the subjects watch movie clips with jump scares.
The aim is to build a model that can estimate the state of mind of the pilot in real-time using the physiological data given. When the pilot enters into any one of the above mentioned dangerous cognitive states, he/she should be alerted, thereby preventing any possible accident.
The dataset are provided in csv files (both train and test dataset).
Now, let’s analyze each attribute in the dataset.
The main sensors used for the collecting physiological data are EEG, ECG, Respiration, Galvanic skin response.
id– (test.csv and sample_submission.csv only) A unique identifier for a crew + time combination. You must predict probabilities for each
crew– a unique id for a pair of pilots. There are 9 crews in the data.
experiment– One of
LOFT. The first 3 comprise the training set. The latter the test set.(The training data consist of three experiments: CA, DA, and SS. The output is one of the four labels: Baseline(no event), CA, DA, or SS. For example, if the experiment is CA, the output is either CA or Baseline(no event). The test data is taken from a full flight simulator. Here the experiment is called LOFT or Line Oriented Flight Training where the training of the pilot is carried out in a flight simulator, which artificially creates the environment of a real flight. In the test data, the experiment is given as LOFT and the output can be one of the four states at a given time.)
time– seconds into the experiment
seat– is the pilot in the left (0) or right (1) seat
eeg_fp1 ,eeg_f7 ,eeg_f8 ,eeg_t4 ,eeg_t6 ,eeg_t5 ,eeg_t3 ,eeg_fp2 ,eeg_o1 ,eeg_p3 ,eeg_pz ,eeg_f3 ,eeg_fz ,eeg_f4 ,eeg_c4 ,eeg_p4 ,eeg_poz eeg_c3 ,eeg_cz ,eeg_o2
ecg– 3-point Electrocardiogram signal. The sensor had a resolution/bit of .012215 µV and a range of -100mV to +100mV. The data are provided in microvolts.
r– Respiration, a measure of the rise and fall of the chest. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
gsr– Galvanic Skin Response, a measure of electrodermal activity. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
event– The state of the pilot at the given time: one of A= baseline, B= SS, C= CA, D= DA
This is a multiclass classification(A,B,C,D) problem .For each id , we need to predict the state of the pilot as belonging to one of the four given classes
The problem we are handling is a multiclass classification problem where the number of classes is 4
- The evaluation matrix used in this competition is multiclass log loss
where N is the total number of data points, M is the number of classes.
yij is 1 if the data point i is predicted to be of class j, and is 0 otherwise.
pij is the probability of data point i belonging to class j
The first step to explore train data is to check data is balanced or imbalanced. For this purpose we used countplot.
From the plot we can understand that train data is imbalanced