Introduction to EDA

How to Get a Job at Google?

This notebook goes through Exploratory Data Analysis to build a model.

We need to find more about our data to make predictions and decide on a model. With EDA, we will figure out the details of our data and build a model accordingly.

Data

EDA varies considerably with each data set. Today, we will be looking at data to see what skills are required to get a job at Google. You can find the dataset here: https://www.kaggle.com/niyamatalmass/google-job-skills

Our data has the following features:

Title: The title of the Job
Category: Category of the Job
Location: Location of the Job
Responsibilities: Responsibilities for the Job
Minimum Qualifications: Minimum Qualifications for the Job
Preferred Qualifications: Preferred Qualifications for the Job

Importing the Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing our Data

df = pd.read_csv('/Users/delaldeniztomruk/Desktop/job_skills.csv')

Describing/ Understanding the Data

df.head()

We see the how our data looks in general.

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 7 columns):
#   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
0   Company                   1250 non-null   object
1   Title                     1250 non-null   object
2   Category                  1250 non-null   object
3   Location                  1250 non-null   object
4   Responsibilities          1235 non-null   object
5   Minimum Qualifications    1236 non-null   object
6   Preferred Qualifications  1236 non-null   object
dtypes: object(7)
memory usage: 68.5+ KB

Here we can see that our data types are all objects. Since we want to fit our data to a model, we should take a note that our values ar categorical rather than numerical. Meaning our data does not include numbers and should be encoded to numbers before fitting, as models only accept numerical values

df.describe()

df.isna().sum()Company                      0
Title                        0
Category                     0
Location                     0
Responsibilities            15
Minimum Qualifications      14
Preferred Qualifications    14
dtype: int64

We see that our data have missing values in the following columns: Responsibilities, Minimum Qualifications, Preferred Qualifications Our model will not accept missing values, so we should also handle it.

df["Location"].value_counts()Mountain View, CA, United States    190
Sunnyvale, CA, United States        155
Dublin, Ireland                      87
New York, NY, United States          70
London, United Kingdom               62
... 
Nairobi, Kenya                        1
Kraków, Poland                        1
Kyiv, Ukraine                         1
Moscow, ID, United States             1
Lisbon, Portugal                      1
Name: Location, Length: 92, dtype: int64

Creating Test and Training Sets

You might be asking: why are we separating those sets in the beginning? We want to make sure that our data will be able to predict the patterns and will give us a good result. However, if we and consequently our model see the test data beforehand, it will cause a data snooping bias, meaning that our estimation will also track the test data and do estimations accordingly. Thus, we need to separate our dataset to get accurate results without overfitting.

from sklearn.model_selection import train_test_split#split the dataset
train_set, test_set  = train_test_split(df, 
test_size = 0.2,
random_state=42)
len(train_set), len(test_set)(1000, 250)

Now, we will set our test data aside and will do our calculations based on the training data.

Visualizing the Data

First, I will create a copy of my training data to make sure that my computations will not cause a problem.

df_copy = train_set
df_copy.head()

First, let’s see the education requirement for positions (if specified).

degrees = ['BA','BS','BA/BS','MBA','Master','PhD']degree_calc = dict((x,0) for x in degrees)
for i in degrees:
x = df['Minimum Qualifications'].str.contains(i).sum()
if i in degree_calc:
degree_calc[i] = x
degree_calc{'BA': 909, 'BS': 879, 'BA/BS': 835, 'MBA': 71, 'Master': 81, 'PhD': 8}plt.bar(range(len(degree_calc)), degree_calc.values(), align='center');
plt.xticks(range(len(degree_calc)), degree_calc.keys());

Here, we see that most positions require at least BA. Good news, you might not need a PhD to get into Google!

Let’s check which programming languages are desired.

prog_lang = ['Java', 'Python', 'SQL', 'Ruby', 'C', 'PHP']lang_calc = dict((x,0) for x in prog_lang)
for i in prog_lang:
x = df['Minimum Qualifications'].str.contains(i).sum()
if i in lang_calc:
lang_calc[i] = x
lang_calc{'Java': 97, 'Python': 96, 'SQL': 75, 'Ruby': 14, 'C': 459, 'PHP': 7}plt.bar(range(len(lang_calc)), lang_calc.values(), align='center');
plt.xticks(range(len(lang_calc)), lang_calc.keys());

Looks like you should be learning C 🙂

Question : It is obvious that we’ve limited our language criterias by prog_lang list. What can we do to get every language that was classified in the job postings?