Develop a basic data science project with Tensorflow
Today I found myself learning new skills using Tensorflow and I thought about how I can make a project with this helpful open-source library. When I learned something, I want to work with real datasets because this is the best way to consolidate what I learned.
1- Finding real dataset
Many platforms nowadays share free data sets but now I’ll talk about the most popular one, Kaggle. You can find a wide variety of data in Kaggle. If you want to check, just click the link in the first word Kaggle. If you are interested in machine learning or deep learning you should absolutely know this website.
In this project, I choose Amazon’s Top 50 Bestselling Books between 2009 and 2019 dataset.
2-Advantages and Definition of Tensorflow
Before Tensorflow’s advantages, I will give you a quick description of Tensorflow.
Actively, Tensorflow is the most widely used deep learning framework all around the world. It is a free and open-source software library used for data flow, differentiable programming across a range of tasks, and train ML models. Using Tensorflow is very simple as you will see in the example below. We will examine how it works step by step in a more detailed example.
Unlike traditional digital libraries, TensorFlow uses Data Flow Graph, a common programming model in cloud computing and machine learning, to express and organize the computational workflow, and then map the mathematical operations in the graph to different computing devices. (e.g. GPUs, TPUs, and CPUs).
This architecture provides a uniform API to make low-level modules and devices transparent to users; This not only saves us from the tedious and demanding tasks of parallel programming but also makes it possible to move the application from one computing platform to another with virtually no change.
Let’s look at the main advantages:
- Quick Model Creation
- Scalability
- Robust Machine Learning Generation
- Pipelining
- Community Support etc.
3.1-We have to bring the necessary libraries and extensions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
Now, we can read, and control our data set
df = pd.read_excel("bestsellers.xlsx")
df.head()
df.describe()
3.2-Visualization
I saw that my data was exist and available. After that I will use Seaborn. Seaborn is an amazing visualization library for statistical graphics plotting in Python. It provides beautiful default styles and color palettes to make statistical plots more attractive. It is built on the top of matplotlib library and also closely integrated to the data structures from pandas.
sbn.countplot(df["Price"])
plt.figure(figsize=(7,5))
sbn.distplot(df["Price"])
sbn.scatterplot(x="Reviews",y="Price",data=df)
If we examine the last chart, there is no relationship between the reviews and the price.
In the above examples, we see how to plot avarage prices with Seaborn and as you can see it is very easy and quick.
3.3- Training and Testing Data
Dataset splitting with the Sklearn train_test_split function
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=10)
Sklearn (or Scikit-learn) is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.
Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction.
To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset.
The train_test_split function is for splitting a single dataset for two different purposes: training and testing. The testing subset is for building your model. The testing subset is for using the model on unknown data to evaluate the performance of the model.
len(x_train)
output: 109
len(x_test)
output: 47
I checked my x_train and x_test correction, there is no problem and then I passed preprocessing step with MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
Transform features by scaling each feature to a given range.
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Now it is time to create my model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
You can create a Sequential model by passing a list of layers to the Sequential constructor:
model = keras.Sequential(
[
layers.Dense(2, activation="relu"),
layers.Dense(3, activation="relu"),
layers.Dense(4),
]
)
You can also create a Sequential model incrementally via the add()
method:
I choose the add() method for myself.
model = Sequential()model.add(Dense(12,activation="relu"))
model.add(Dense(12,activation="relu"))
model.add(Dense(12,activation="relu"))
model.add(Dense(12,activation="relu"))model.add(Dense(1))
# pass optimizer by name: default parameters will be used
model.compile(optimizer="adam",loss="mse")
An optimizer is one of the two arguments required for compiling a Keras model
Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.
Optimizer: str (name of optimizer) or optimizer object
Loss: str (name of objective function) or objective function
3.4-Train the model for a fixed number of epochs
model.fit(x=x_train, y = y_train,validation_data=(x_test,y_test),batch_size=250,epochs=300)
- batch_size: int. Number of samples per gradient update
- validation_data: tuple (X, y) to be used as held-out validation data. Will override validation_split
- nb_epoch: integer, total number of iterations on the data
lossData = pd.df(model.history.history)
lossData.plot()
Returns a history object. Its history attribute is a record of training loss values at successive epochs, as well as validation loss values (if applicable)
3.5-Prediction Series Plotting
from sklearn.metrics import mean_squared_error, mean_absolute_error
- mean_absolute_error: Mean absolute error regression loss
- mean_squared_error: Mean squared error regression loss
pred = model.predict(x_test)
plt.scatter(y_test,pred)
plt.plot(y_test,y_test,"g-*")
A scatter plot is a diagram where each value in the data set is represented by a dot
The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis:
The X array represents the y_test in our code
The Y array represents the book price prediction
In this basic project, I discovered how I can make classification and regression predictions with Tensorflow and a machine learning model in the scikit-learn Python library.