• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Introduction to EDA

January 16, 2021 by systems

How to Get a Job at Google?

Delal Tomruk

This notebook goes through Exploratory Data Analysis to build a model.

Photo by visuals on Unsplash

We need to find more about our data to make predictions and decide on a model. With EDA, we will figure out the details of our data and build a model accordingly.

Data

EDA varies considerably with each data set. Today, we will be looking at data to see what skills are required to get a job at Google. You can find the dataset here: https://www.kaggle.com/niyamatalmass/google-job-skills

Our data has the following features:

  • Title: The title of the Job
  • Category: Category of the Job
  • Location: Location of the Job
  • Responsibilities: Responsibilities for the Job
  • Minimum Qualifications: Minimum Qualifications for the Job
  • Preferred Qualifications: Preferred Qualifications for the Job

Importing the Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing our Data

df = pd.read_csv('/Users/delaldeniztomruk/Desktop/job_skills.csv')

Describing/ Understanding the Data

df.head()
png

We see the how our data looks in general.

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Company 1250 non-null object
1 Title 1250 non-null object
2 Category 1250 non-null object
3 Location 1250 non-null object
4 Responsibilities 1235 non-null object
5 Minimum Qualifications 1236 non-null object
6 Preferred Qualifications 1236 non-null object
dtypes: object(7)
memory usage: 68.5+ KB

Here we can see that our data types are all objects. Since we want to fit our data to a model, we should take a note that our values ar categorical rather than numerical. Meaning our data does not include numbers and should be encoded to numbers before fitting, as models only accept numerical values

df.describe()
png
df.isna().sum()Company                      0
Title 0
Category 0
Location 0
Responsibilities 15
Minimum Qualifications 14
Preferred Qualifications 14
dtype: int64

We see that our data have missing values in the following columns: Responsibilities, Minimum Qualifications, Preferred Qualifications Our model will not accept missing values, so we should also handle it.

df["Location"].value_counts()Mountain View, CA, United States    190
Sunnyvale, CA, United States 155
Dublin, Ireland 87
New York, NY, United States 70
London, United Kingdom 62
...
Nairobi, Kenya 1
Kraków, Poland 1
Kyiv, Ukraine 1
Moscow, ID, United States 1
Lisbon, Portugal 1
Name: Location, Length: 92, dtype: int64

Creating Test and Training Sets

You might be asking: why are we separating those sets in the beginning? We want to make sure that our data will be able to predict the patterns and will give us a good result. However, if we and consequently our model see the test data beforehand, it will cause a data snooping bias, meaning that our estimation will also track the test data and do estimations accordingly. Thus, we need to separate our dataset to get accurate results without overfitting.

from sklearn.model_selection import train_test_split

#split the dataset

train_set, test_set = train_test_split(df,
test_size = 0.2,
random_state=42)

len(train_set), len(test_set)

(1000, 250)

Now, we will set our test data aside and will do our calculations based on the training data.

Visualizing the Data

First, I will create a copy of my training data to make sure that my computations will not cause a problem.

df_copy = train_set
df_copy.head()
png

First, let’s see the education requirement for positions (if specified).

degrees = ['BA','BS','BA/BS','MBA','Master','PhD']

degree_calc = dict((x,0) for x in degrees)

for i in degrees:
x = df['Minimum Qualifications'].str.contains(i).sum()
if i in degree_calc:
degree_calc[i] = x

degree_calc

{'BA': 909, 'BS': 879, 'BA/BS': 835, 'MBA': 71, 'Master': 81, 'PhD': 8}plt.bar(range(len(degree_calc)), degree_calc.values(), align='center');
plt.xticks(range(len(degree_calc)), degree_calc.keys());
png

Here, we see that most positions require at least BA. Good news, you might not need a PhD to get into Google!

Let’s check which programming languages are desired.

prog_lang = ['Java', 'Python', 'SQL', 'Ruby', 'C', 'PHP']

lang_calc = dict((x,0) for x in prog_lang)

for i in prog_lang:
x = df['Minimum Qualifications'].str.contains(i).sum()
if i in lang_calc:
lang_calc[i] = x

lang_calc

{'Java': 97, 'Python': 96, 'SQL': 75, 'Ruby': 14, 'C': 459, 'PHP': 7}plt.bar(range(len(lang_calc)), lang_calc.values(), align='center');
plt.xticks(range(len(lang_calc)), lang_calc.keys());
png

Looks like you should be learning C 🙂

Question : It is obvious that we’ve limited our language criterias by prog_lang list. What can we do to get every language that was classified in the job postings?

Filed Under: Machine Learning

Primary Sidebar

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

The Future of Mobile Technology: Recent Advancements and Predictions

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy