Identifying Blood Donor Segments and Implementing Management Dashboard using Machine Learning and Data Visualization¶

demandforce-BlogSocial-Image-National-Blood-Donor-Month.jpg

For this project, we will be using machine learning specifically K-modes Clustering for blood donor segmentation - this jupyter notebook will show the step by step analysis and data wrangling done. Segmented data is then visualized using Power BI. Dataset used was from Taguig Pateros District Hospital

The study revolves around the data science question: How can we increase the number of donors in Blood Donation drives? Using the team's hypothesis By knowing the right people to invite, the following objective/solutions will be used as the direction of the study: 1) Blood Donor Segmentation, 2) Data Dashboard, and 3) Share EDA insights and propose initiatives.

This topic was chosen in the hopes that it can contribute to the still limited number of studies regarding Blood Donation initiatives in the Philippines. Moreover, the team knew that this analysis could be helpful for the Filipinos and that it can save lives.

Below are links to the supporting materials which will explain the study further:

Capstone Project deck: https://drive.google.com/file/d/1AMgUk-4AX9geHX1uQ2VqV3YJhdGBd_30/view?usp=sharing
Power BI recording: https://drive.google.com/file/d/1qKq6LkWHBEcbhxNR_oKc171IakppskFr/view?usp=sharing

Proponents of the project are Tine Celestial, Pamy Longos, Karen Salas, and Maico Rebong. This capstone project was created as a completion requirement for Analytiks Inc. bootcamp.

KMode Reference

Import Libraries¶

# supress warnings
import warnings
warnings.filterwarnings('ignore')

# Importing all required packages
import numpy as np
import pandas as pd

# Data viz lib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib.pyplot import xticks

Load the Dataset¶

df = pd.read_csv(r"C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Blood_Donation_2019.csv", index_col = 'Donor ID')

Attribute Information(Categorical)

Donor ID: Unique ID assigned to the donor
Gender: Male or Female
Blood Type: O+, A+, B+, AB+
Barangay: 28 barangays in Taguig
District: Congressional district, 1 or 2 only
Network: Communications provider
Quarter:Q1 to Q4 (full year 2019)
Month: January to December (full year 2019)
Day of the Week: Sunday to Saturday
Donor Count: volume of blood donated by a donor, 450cc
Blood Drive Location: Facility
Data: TPDH

Exploratory Data Analysis (EDA)¶

# Check if the Data loaded correctly
df.head()

df.tail()

# Check the dimensions
df.shape

(1969, 10)

# Check the features
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1969 entries, P1 to P2012
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1969 non-null   object
 1   Blood Type            1969 non-null   object
 2   Barangay              1969 non-null   object
 3   District              1819 non-null   object
 4   Network               1969 non-null   object
 5   Quarter               1969 non-null   object
 6   Month                 1969 non-null   object
 7   Day of the Week       1969 non-null   object
 8   Donor Count           1969 non-null   object
 9   Blood Drive Location  1969 non-null   object
dtypes: object(10)
memory usage: 169.2+ KB

# Check for Column Names
df.columns

Index(['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Quarter',
       'Month', 'Day of the Week', 'Donor Count', 'Blood Drive Location'],
      dtype='object')

# Get the general description of the data
df.describe(include='all')

# Importing categorical columns
df_donor = df[['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Month', 'Blood Drive Location']]

Data Inspection¶

df_donor.head()

df_donor.tail()

df_donor.shape

(1969, 7)

df_donor.columns

Index(['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Month',
       'Blood Drive Location'],
      dtype='object')

df_donor.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1969 entries, P1 to P2012
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1969 non-null   object
 1   Blood Type            1969 non-null   object
 2   Barangay              1969 non-null   object
 3   District              1819 non-null   object
 4   Network               1969 non-null   object
 5   Month                 1969 non-null   object
 6   Blood Drive Location  1969 non-null   object
dtypes: object(7)
memory usage: 123.1+ KB

Data Cleaning¶

# Check the null values
df_donor.isnull().sum()

Gender                    0
Blood Type                0
Barangay                  0
District                150
Network                   0
Month                     0
Blood Drive Location      0
dtype: int64

# Drop null values
df_donor = df_donor.dropna()
df_donor = df_donor.drop(['District'], axis = 1)

df_donor.shape

(1819, 6)

df_donor.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1819 entries, P1 to P2012
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1819 non-null   object
 1   Blood Type            1819 non-null   object
 2   Barangay              1819 non-null   object
 3   Network               1819 non-null   object
 4   Month                 1819 non-null   object
 5   Blood Drive Location  1819 non-null   object
dtypes: object(6)
memory usage: 99.5+ KB

Model Building¶

# First we will keep a copy of data
df_donor_copy = df_donor.copy()

Data Preparation¶

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_donor = df_donor.apply(le.fit_transform)
df_donor.head()

Additional library¶

This project will be using k-modes as its partitioning algorithm since we are dealing with categorical data. Cao approach was used "to select initial cluster centers by considering the distance between objects and the density of each object."

from kmodes.kmodes import KModes

Using K-Mode with "Cao" initialization¶

km_cao = KModes(n_clusters=3, init = "Cao", n_init = 10, verbose=1)
fitClusters_cao = km_cao.fit_predict(df_donor)

Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 102, cost: 4863.0

# Predicted Clusters
fitClusters_cao

array([0, 2, 0, ..., 1, 1, 2], dtype=uint16)

clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = df_donor.columns

# Mode of the clusters
clusterCentroidsDf

kmodes = km_cao.cluster_centroids_
shape = kmodes.shape
shape

(3, 6)

Choosing K by comparing Cost against each K¶

cost = []
for num_clusters in list(range(1,5)):
    kmode = KModes(n_clusters=num_clusters, init = "cao", n_init = 10, verbose=1)
    kmode.fit_predict(df_donor)
    cost.append(kmode.cost_)

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 0, cost: 5763.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 0, cost: 5170.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 102, cost: 4863.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 100, cost: 4658.0
Best run was number 1

y = np.array([i for i in range(1,5,1)])
plt.plot(y,cost)

[<matplotlib.lines.Line2D at 0x24112adde08>]

Combining the predicted clusters with the original DF¶

df_donor = df_donor_copy.reset_index()

clustersDf = pd.DataFrame(fitClusters_cao)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([df_donor, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index'], axis = 1)

combinedDf.head()

label = km_cao.labels_

combinedDf['cluster_name'] = ['cluster1' if x == 0 else\
                           'cluster2' if x == 1 else\
                           'cluster3' if x == 2 else\
                           'cluster4' for x in label]

plt.style.use('ggplot')
sns.countplot(combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,palette = 'rainbow')
plt.title('Cluster distribution')
plt.ylabel(None)
plt.xlabel(None)

Text(0.5, 0, '')

Cluster Identification¶

cluster_1 = combinedDf[combinedDf['cluster_predicted'] == 0]
cluster_2 = combinedDf[combinedDf['cluster_predicted'] == 1]
cluster_3 = combinedDf[combinedDf['cluster_predicted'] == 2]

cluster_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1211 entries, 0 to 1815
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              1211 non-null   object
 1   Gender                1211 non-null   object
 2   Blood Type            1211 non-null   object
 3   Barangay              1211 non-null   object
 4   Network               1211 non-null   object
 5   Month                 1211 non-null   object
 6   Blood Drive Location  1211 non-null   object
 7   cluster_predicted     1211 non-null   uint16
 8   cluster_name          1211 non-null   object
dtypes: object(8), uint16(1)
memory usage: 87.5+ KB

cluster_1.head()

cluster_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371 entries, 18 to 1817
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              371 non-null    object
 1   Gender                371 non-null    object
 2   Blood Type            371 non-null    object
 3   Barangay              371 non-null    object
 4   Network               371 non-null    object
 5   Month                 371 non-null    object
 6   Blood Drive Location  371 non-null    object
 7   cluster_predicted     371 non-null    uint16
 8   cluster_name          371 non-null    object
dtypes: object(8), uint16(1)
memory usage: 26.8+ KB

cluster_2.head()

cluster_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 237 entries, 1 to 1818
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              237 non-null    object
 1   Gender                237 non-null    object
 2   Blood Type            237 non-null    object
 3   Barangay              237 non-null    object
 4   Network               237 non-null    object
 5   Month                 237 non-null    object
 6   Blood Drive Location  237 non-null    object
 7   cluster_predicted     237 non-null    uint16
 8   cluster_name          237 non-null    object
dtypes: object(8), uint16(1)
memory usage: 17.1+ KB

cluster_3.head()

plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Gender'])
plt.show()

plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Type'])
plt.show()

plt.style.use('ggplot')
plt.subplots(figsize = (15,12))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Barangay'])
plt.show()

plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Network'])
plt.show()

plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Month'])
plt.show()

plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Drive Location'])
plt.show()

# Extract clustered data to csv
combinedDf.to_csv(r'C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Cao_v1_n3.csv')

Transform segmented data to Power BI visualization¶

$Power%20BI%20GIF.gif$

Generate insights¶

Analyzing the segmented data, we were able to identify personas for each cluster. These are based on the dominant features (blood type, gender, barangay, donation facility, month, network) per grouping as well as the context of the donors.

Cluster 1: Key group of new donors
Cluster 2: Regular donors (Returnees)
Cluster 3: New donors

Recommendations¶

Data-driven recommendations

Gather more demographic information (ex. Sex, Age, Blood volume donated (cc), Employment Status) that can help improve segmentation.
Conduct surveys to observe behavioral pattern of donors (ex. motivation to donate).
Assign a health coordinator that would recruit blood donors local areas / barangays with lowest number of donors.
Continue retention initiatives such as recognizing commitment in doing voluntary blood donation and partnering with companies for local areas / barnagay with highest number of donors.
Increase in frequency of mobile blood drive during month of July and August. Target areas / barangays far from the in house facility.

Ways of working recommendations

Automate follow-up invites for blood donations by creating a scheduling tool based on the last date of the donation + 3 months and the contact number provided in initial donation.
Digitalize the donor data collection through use of registration laptop and online survey. To decentralize, you may opt to provide QR code directing to survey link and inspect “completed/submitted results” before proceeding.
Advocacy talks Innovate materials to be used (ex. redesign slides, use of videos/testimonials), tie up with barangay for awareness sessions

Appendix¶

# Combined charts
for col in combinedDf:
    plt.style.use('ggplot')
    plt.subplots(figsize = (15,10))
    sns.countplot(x='cluster_name', hue=col, data = combinedDf)
    plt.show()

	Gender	Blood Type	Barangay	District	Network	Quarter	Month	Day of the Week	Donor Count	Blood Drive Location
Donor ID
P1	Female	O+	Western Bicutan	2nd	TNT	Q1	January	Wed	450 cc	TPDH
P2	Female	A+	Hagonoy	1st	Others	Q1	January	Wed	450 cc	TPDH
P3	Female	B+	Tanyag	2nd	Globe/TM	Q1	January	Wed	450 cc	TPDH
P4	Male	O+	Western Bicutan	2nd	Globe/TM	Q1	January	Wed	450 cc	TPDH
P5	Female	B+	Upper Bicutan	2nd	Globe/TM	Q1	January	Wed	450 cc	TPDH

	Gender	Blood Type	Barangay	District	Network	Quarter	Month	Day of the Week	Donor Count	Blood Drive Location
Donor ID
P2008	Male	B+	Lower Bicutan	1st	TNT	Q4	December	Fri	450 cc	TPDH
P2009	Male	O+	Western Bicutan	2nd	Others	Q2	May	Thu	450 cc	Taguig ARMY Signal
P2010	Male	B+	Lower Bicutan	1st	Others	Q3	July	Wed	450 cc	TPDH
P2011	Male	B+	Upper Bicutan	2nd	Smart	Q4	October	Wed	450 cc	TPDH
P2012	Female	A+	Western Bicutan	2nd	Others	Q3	August	Mon	450 cc	TPDH

	Gender	Blood Type	Barangay	District	Network	Quarter	Month	Day of the Week	Donor Count	Blood Drive Location
count	1969	1969	1969	1819	1969	1969	1969	1969	1969	1969
unique	2	4	42	2	6	4	12	7	1	9
top	Male	O+	Western Bicutan	2nd	Globe/TM	Q4	June	Wed	450 cc	TPDH
freq	1474	899	617	1140	898	583	281	1382	1969	1344

	Donor ID	Gender	Blood Type	Barangay	Network	Month	Blood Drive Location	cluster_predicted	cluster_name
18	P19	Female	B+	Tuktukan	TNT	January	TPDH	1	cluster2
21	P22	Male	B+	Western Bicutan	Smart	January	TPDH	1	cluster2
22	P23	Male	B+	Maharlika Village	Smart	January	TPDH	1	cluster2
24	P25	Male	B+	Western Bicutan	Smart	January	TPDH	1	cluster2
30	P31	Male	B+	Western Bicutan	Smart	January	TPDH	1	cluster2