Machine Learning Applied to Mammogram Classification

Step 1 — Data exploration

The data contains 961 instances of masses detected in mammograms, and contains the following attributes:

BI-RADS assessment (ordinal) — Assessment of how confident the severity classification is, ranked from 1 to 5.
Age (integer) — Patient’s age in years.
Mass shape (nominal) — round=1 oval=2 lobular=3 irregular=4
Mass margin (nominal) — circumscribed=1; micro-lobulated=2; obscured=3; ill-defined=4; spiculated=5
Mass density (ordinal) — high=1; iso=2; low=3; fat-containing=4
Severity (binomial) — benign=0 or malignant=1

Here are some statistics of each feature:

Fig. 1 — Features information

Step 2 — Handling missing values

In Figure 1, we can observe there are quite a few missing values in the dataset (2 for “BI-RADS”, 5 for “age”, 31 for “shape”, 48 for “margin” and 76 for “density”).

Before dropping every row that’s missing data, it is important to make sure we don’t bias our data by doing so.

Let’s look at how missing values are distributed (Fig. 2 shows the missing values distribution for “age”). If it appears there are any sort of correlation to what sort data has missing fields, we’d have to impute that data in with a suitable method (eg. KNN, MICE).

Fig. 2 — “Age” missing values distribution

In our case, missing data seems randomly distributed. We can therefore move on and drop rows containing missing values:

Fig. 3 — Features information (missing values dropped)

Step 3 — Feature selection

Now, data must be split into two arrays:

A multi-dimensional input array (X) containing values of features relevant to predict the output. In our case, relevant features are age, shape, margin and density. The attribute BI-RADS (assessment of how confident the severity classification is) is dropped because it is not a “predictive” attribute.
A 1D array (Y) containing classification data (values of the feature ‘severity’).

Fig. 4 — Input data matrix (X) and classification data matrix (Y)

Step 4 — Normalization

Finally, some models require input data to be normalized so let’s go ahead and normalize our matrix X:

Fig. 5 — Normalized input data matrix (X)

Footer