For this project, we will be using machine learning specifically K-modes Clustering for blood donor segmentation - this jupyter notebook will show the step by step analysis and data wrangling done. Segmented data is then visualized using Power BI. Dataset used was from Taguig Pateros District Hospital
The study revolves around the data science question: How can we increase the number of donors in Blood Donation drives? Using the team's hypothesis By knowing the right people to invite, the following objective/solutions will be used as the direction of the study: 1) Blood Donor Segmentation, 2) Data Dashboard, and 3) Share EDA insights and propose initiatives.
This topic was chosen in the hopes that it can contribute to the still limited number of studies regarding Blood Donation initiatives in the Philippines. Moreover, the team knew that this analysis could be helpful for the Filipinos and that it can save lives.
Below are links to the supporting materials which will explain the study further:
Proponents of the project are Tine Celestial, Pamy Longos, Karen Salas, and Maico Rebong. This capstone project was created as a completion requirement for Analytiks Inc. bootcamp.
# supress warnings
import warnings
warnings.filterwarnings('ignore')
# Importing all required packages
import numpy as np
import pandas as pd
# Data viz lib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib.pyplot import xticks
df = pd.read_csv(r"C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Blood_Donation_2019.csv", index_col = 'Donor ID')
Attribute Information(Categorical)
# Check if the Data loaded correctly
df.head()
df.tail()
# Check the dimensions
df.shape
# Check the features
df.info()
# Check for Column Names
df.columns
# Get the general description of the data
df.describe(include='all')
# Importing categorical columns
df_donor = df[['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Month', 'Blood Drive Location']]
df_donor.head()
df_donor.tail()
df_donor.shape
df_donor.columns
df_donor.info()
# Check the null values
df_donor.isnull().sum()
# Drop null values
df_donor = df_donor.dropna()
df_donor = df_donor.drop(['District'], axis = 1)
df_donor.shape
df_donor.info()
# First we will keep a copy of data
df_donor_copy = df_donor.copy()
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_donor = df_donor.apply(le.fit_transform)
df_donor.head()
This project will be using k-modes as its partitioning algorithm since we are dealing with categorical data. Cao approach was used "to select initial cluster centers by considering the distance between objects and the density of each object."
from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=3, init = "Cao", n_init = 10, verbose=1)
fitClusters_cao = km_cao.fit_predict(df_donor)
# Predicted Clusters
fitClusters_cao
clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = df_donor.columns
# Mode of the clusters
clusterCentroidsDf
kmodes = km_cao.cluster_centroids_
shape = kmodes.shape
shape
cost = []
for num_clusters in list(range(1,5)):
kmode = KModes(n_clusters=num_clusters, init = "cao", n_init = 10, verbose=1)
kmode.fit_predict(df_donor)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,5,1)])
plt.plot(y,cost)
df_donor = df_donor_copy.reset_index()
clustersDf = pd.DataFrame(fitClusters_cao)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([df_donor, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index'], axis = 1)
combinedDf.head()
label = km_cao.labels_
combinedDf['cluster_name'] = ['cluster1' if x == 0 else\
'cluster2' if x == 1 else\
'cluster3' if x == 2 else\
'cluster4' for x in label]
plt.style.use('ggplot')
sns.countplot(combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,palette = 'rainbow')
plt.title('Cluster distribution')
plt.ylabel(None)
plt.xlabel(None)
cluster_1 = combinedDf[combinedDf['cluster_predicted'] == 0]
cluster_2 = combinedDf[combinedDf['cluster_predicted'] == 1]
cluster_3 = combinedDf[combinedDf['cluster_predicted'] == 2]
cluster_1.info()
cluster_1.head()
cluster_2.info()
cluster_2.head()
cluster_3.info()
cluster_3.head()
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Gender'])
plt.show()
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Type'])
plt.show()
plt.style.use('ggplot')
plt.subplots(figsize = (15,12))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Barangay'])
plt.show()
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Network'])
plt.show()
plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Month'])
plt.show()
plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Drive Location'])
plt.show()
# Extract clustered data to csv
combinedDf.to_csv(r'C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Cao_v1_n3.csv')
Analyzing the segmented data, we were able to identify personas for each cluster. These are based on the dominant features (blood type, gender, barangay, donation facility, month, network) per grouping as well as the context of the donors.
Data-driven recommendations
Ways of working recommendations
# Combined charts
for col in combinedDf:
plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x='cluster_name', hue=col, data = combinedDf)
plt.show()