In Part 1 of 3 of Data Wrangling, we read in our data file & install all required libraries/packages for our project. We also examine if there are any problems with our dataset, & hence see that there are no issues.
```{r DataWrangling1}library(tidyverse)
library(plyr)
library(readr)
library(dplyr)
cc1 <- read_csv("BankChurners.csv") # original dataset#----------------------------
#----------------------------problems(cc1) # no problems with cc1
head(cc1)
dim(cc1) # # returns dimensions;10127 rows 23 col
cc1 %>% filter(!is.na(Income_Category))
(is.na(cc1))
glimpse(cc1)```
In Part 2 of 3 of Data Wrangling, we manipulate the data to get only the columns we want & to remove NA & Unknown values in our data. We also examine the dimensions & unique values for our discrete variables.
6 distinct discrete types for Income_Category :$60K — $80K, Less than $40K ,$80K — $120K ,$40K — $60K ,$120K + , Unknown
4 distinct discrete types for Marital_Status: Married, Single, Divorced, Unknown
4 distinct discrete types for Card_Category: Blue, Gold, Siler, Platinum
Note: We will also remove any rows/entries with a “Unknown”/NA value.
We see here we initally have 10,127 rows & 23 columns, but we truncate that too 8348 rows by 9 columns.
```{r DataWrangling2}# selected the columns we care abouts
cc2 <- cc1 %>% select(Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Credit_Limit, Attrition_Flag) %>% filter( !is.na(.))
# see the head of it
head(cc2)
dim(cc2) #dimensions 10127 rows 9 columns
#(cc2 <- na.omit(cc2) ) # EXACt SAME as : %>% filter( !is.na(.))#----------------------------cc2 %>% group_by(Income_Category,Marital_Status)#----------------------------# Lets see which distinct types there are
(distinct(cc2, Income_Category)) # 6 types:$60K - $80K, Less than $40K ,$80K - $120K ,$40K - $60K ,$120K + ,Unknown
(distinct(cc2, Marital_Status)) # 4 types: Married, Single, Divorced, Unknown
(distinct(cc2, Card_Category)) # 4 types: Blue, Gold, Siler, Platinum#----------------------------# Drop all the "unknown" rows from Marital_Status & Income_Category
# 82x9, 82 rows must remove these rows
cc3 <- cc2 %>% select(Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Credit_Limit, Attrition_Flag) %>% filter(Marital_Status != "Unknown" , Income_Category != "Unknown",Education_Level !="Unknown")#----------------------------
head(cc3)
dim(cc3) #8348 rows by 9 cols
#----------------------------```
In Part 3 of 3 Data Wrangling, we rename our predictor Column Attrition_Flag to Exited_Flag. We also rename the binary output values for this predictor from Existing Customer/Attrited Customer to Current/Exited, respectivley. We lastly, also see the cout of each discrete feature with our discrete predictor.
```{r DataWrangling3}#----------------------------
#----------------------------
#install.packages("dplyr")
library(dplyr)# Rename Label Colum to Exited_Flag
dataCC4 <- cc3 %>% rename(Exited_Flag = Attrition_Flag)
#dataaa <- cc3 %>% rename(Exited_Flag = Attrition_Flag)
#----------------------------
#----------------------------
dataCC4 <- cc3
#Rename values
dataCC4 $Attrition_Flag[dataCC4 $Attrition_Flag == "Existing Customer"] <- "Current"
dataCC4 $Attrition_Flag[dataCC4 $Attrition_Flag == "Attrited Customer"] <- "Exited"#----------------------------
#----------------------------
(dataCC4 %>% group_by(Attrition_Flag) %>% summarize(meanAge= mean(Customer_Age), meanDepdent= mean(Dependent_count), meanCreditLim= mean(Credit_Limit)))#AKA:
summarise_mean <- function(data, vars) {
data %>% summarise(n = n(), across({{ vars }}, mean))
}#dataCC4 %>%
#group_by(Attrition_Flag) %>%
# summarise_mean(where(is.numeric))#----------------------------
#----------------------------
#see the count of each(dataCC4 %>% select(Gender,Attrition_Flag) %>% group_by(Gender) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Education_Level) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Marital_Status) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Income_Category) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Card_Category) %>% count(Attrition_Flag) )
summary(dataCC4)
```
Above, we can evidently see that Current Customers had higher mean credit limits than did churning customers.