Step 1 — Data exploration
The data contains 961 instances of masses detected in mammograms, and contains the following attributes:
- BI-RADS assessment (ordinal) — Assessment of how confident the severity classification is, ranked from 1 to 5.
- Age (integer) — Patient’s age in years.
- Mass shape (nominal) — round=1 oval=2 lobular=3 irregular=4
- Mass margin (nominal) — circumscribed=1; micro-lobulated=2; obscured=3; ill-defined=4; spiculated=5
- Mass density (ordinal) — high=1; iso=2; low=3; fat-containing=4
- Severity (binomial) — benign=0 or malignant=1
Here are some statistics of each feature:
Step 2 — Handling missing values
In Figure 1, we can observe there are quite a few missing values in the dataset (2 for “BI-RADS”, 5 for “age”, 31 for “shape”, 48 for “margin” and 76 for “density”).
Before dropping every row that’s missing data, it is important to make sure we don’t bias our data by doing so.
Let’s look at how missing values are distributed (Fig. 2 shows the missing values distribution for “age”). If it appears there are any sort of correlation to what sort data has missing fields, we’d have to impute that data in with a suitable method (eg. KNN, MICE).
In our case, missing data seems randomly distributed. We can therefore move on and drop rows containing missing values:
Step 3 — Feature selection
Now, data must be split into two arrays:
- A multi-dimensional input array (X) containing values of features relevant to predict the output. In our case, relevant features are age, shape, margin and density. The attribute BI-RADS (assessment of how confident the severity classification is) is dropped because it is not a “predictive” attribute.
- A 1D array (Y) containing classification data (values of the feature ‘severity’).
Step 4 — Normalization
Finally, some models require input data to be normalized so let’s go ahead and normalize our matrix X: