Identifying Blood Donor Segments and Implementing Management Dashboard using Machine Learning and Data Visualization

demandforce-BlogSocial-Image-National-Blood-Donor-Month.jpg

For this project, we will be using machine learning specifically K-modes Clustering for blood donor segmentation - this jupyter notebook will show the step by step analysis and data wrangling done. Segmented data is then visualized using Power BI. Dataset used was from Taguig Pateros District Hospital

The study revolves around the data science question: How can we increase the number of donors in Blood Donation drives? Using the team's hypothesis By knowing the right people to invite, the following objective/solutions will be used as the direction of the study: 1) Blood Donor Segmentation, 2) Data Dashboard, and 3) Share EDA insights and propose initiatives.

This topic was chosen in the hopes that it can contribute to the still limited number of studies regarding Blood Donation initiatives in the Philippines. Moreover, the team knew that this analysis could be helpful for the Filipinos and that it can save lives.

Below are links to the supporting materials which will explain the study further:

Proponents of the project are Tine Celestial, Pamy Longos, Karen Salas, and Maico Rebong. This capstone project was created as a completion requirement for Analytiks Inc. bootcamp.

KMode Reference

Import Libraries

In [1]:
# supress warnings
import warnings
warnings.filterwarnings('ignore')

# Importing all required packages
import numpy as np
import pandas as pd

# Data viz lib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib.pyplot import xticks

Load the Dataset

In [2]:
df = pd.read_csv(r"C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Blood_Donation_2019.csv", index_col = 'Donor ID')

Attribute Information(Categorical)

  • Donor ID: Unique ID assigned to the donor
  • Gender: Male or Female
  • Blood Type: O+, A+, B+, AB+
  • Barangay: 28 barangays in Taguig
  • District: Congressional district, 1 or 2 only
  • Network: Communications provider
  • Quarter:Q1 to Q4 (full year 2019)
  • Month: January to December (full year 2019)
  • Day of the Week: Sunday to Saturday
  • Donor Count: volume of blood donated by a donor, 450cc
  • Blood Drive Location: Facility
  • Data: TPDH

Exploratory Data Analysis (EDA)

In [3]:
# Check if the Data loaded correctly
df.head()
Out[3]:
Gender Blood Type Barangay District Network Quarter Month Day of the Week Donor Count Blood Drive Location
Donor ID
P1 Female O+ Western Bicutan 2nd TNT Q1 January Wed 450 cc TPDH
P2 Female A+ Hagonoy 1st Others Q1 January Wed 450 cc TPDH
P3 Female B+ Tanyag 2nd Globe/TM Q1 January Wed 450 cc TPDH
P4 Male O+ Western Bicutan 2nd Globe/TM Q1 January Wed 450 cc TPDH
P5 Female B+ Upper Bicutan 2nd Globe/TM Q1 January Wed 450 cc TPDH
In [4]:
df.tail()
Out[4]:
Gender Blood Type Barangay District Network Quarter Month Day of the Week Donor Count Blood Drive Location
Donor ID
P2008 Male B+ Lower Bicutan 1st TNT Q4 December Fri 450 cc TPDH
P2009 Male O+ Western Bicutan 2nd Others Q2 May Thu 450 cc Taguig ARMY Signal
P2010 Male B+ Lower Bicutan 1st Others Q3 July Wed 450 cc TPDH
P2011 Male B+ Upper Bicutan 2nd Smart Q4 October Wed 450 cc TPDH
P2012 Female A+ Western Bicutan 2nd Others Q3 August Mon 450 cc TPDH
In [5]:
# Check the dimensions
df.shape
Out[5]:
(1969, 10)
In [6]:
# Check the features
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1969 entries, P1 to P2012
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1969 non-null   object
 1   Blood Type            1969 non-null   object
 2   Barangay              1969 non-null   object
 3   District              1819 non-null   object
 4   Network               1969 non-null   object
 5   Quarter               1969 non-null   object
 6   Month                 1969 non-null   object
 7   Day of the Week       1969 non-null   object
 8   Donor Count           1969 non-null   object
 9   Blood Drive Location  1969 non-null   object
dtypes: object(10)
memory usage: 169.2+ KB
In [7]:
# Check for Column Names
df.columns
Out[7]:
Index(['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Quarter',
       'Month', 'Day of the Week', 'Donor Count', 'Blood Drive Location'],
      dtype='object')
In [8]:
# Get the general description of the data
df.describe(include='all')
Out[8]:
Gender Blood Type Barangay District Network Quarter Month Day of the Week Donor Count Blood Drive Location
count 1969 1969 1969 1819 1969 1969 1969 1969 1969 1969
unique 2 4 42 2 6 4 12 7 1 9
top Male O+ Western Bicutan 2nd Globe/TM Q4 June Wed 450 cc TPDH
freq 1474 899 617 1140 898 583 281 1382 1969 1344
In [9]:
# Importing categorical columns
df_donor = df[['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Month', 'Blood Drive Location']]

Data Inspection

In [10]:
df_donor.head()
Out[10]:
Gender Blood Type Barangay District Network Month Blood Drive Location
Donor ID
P1 Female O+ Western Bicutan 2nd TNT January TPDH
P2 Female A+ Hagonoy 1st Others January TPDH
P3 Female B+ Tanyag 2nd Globe/TM January TPDH
P4 Male O+ Western Bicutan 2nd Globe/TM January TPDH
P5 Female B+ Upper Bicutan 2nd Globe/TM January TPDH
In [11]:
df_donor.tail()
Out[11]:
Gender Blood Type Barangay District Network Month Blood Drive Location
Donor ID
P2008 Male B+ Lower Bicutan 1st TNT December TPDH
P2009 Male O+ Western Bicutan 2nd Others May Taguig ARMY Signal
P2010 Male B+ Lower Bicutan 1st Others July TPDH
P2011 Male B+ Upper Bicutan 2nd Smart October TPDH
P2012 Female A+ Western Bicutan 2nd Others August TPDH
In [12]:
df_donor.shape
Out[12]:
(1969, 7)
In [13]:
df_donor.columns
Out[13]:
Index(['Gender', 'Blood Type', 'Barangay', 'District', 'Network', 'Month',
       'Blood Drive Location'],
      dtype='object')
In [14]:
df_donor.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1969 entries, P1 to P2012
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1969 non-null   object
 1   Blood Type            1969 non-null   object
 2   Barangay              1969 non-null   object
 3   District              1819 non-null   object
 4   Network               1969 non-null   object
 5   Month                 1969 non-null   object
 6   Blood Drive Location  1969 non-null   object
dtypes: object(7)
memory usage: 123.1+ KB

Data Cleaning

In [15]:
# Check the null values
df_donor.isnull().sum()
Out[15]:
Gender                    0
Blood Type                0
Barangay                  0
District                150
Network                   0
Month                     0
Blood Drive Location      0
dtype: int64
In [16]:
# Drop null values
df_donor = df_donor.dropna()
df_donor = df_donor.drop(['District'], axis = 1)
In [17]:
df_donor.shape
Out[17]:
(1819, 6)
In [18]:
df_donor.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1819 entries, P1 to P2012
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Gender                1819 non-null   object
 1   Blood Type            1819 non-null   object
 2   Barangay              1819 non-null   object
 3   Network               1819 non-null   object
 4   Month                 1819 non-null   object
 5   Blood Drive Location  1819 non-null   object
dtypes: object(6)
memory usage: 99.5+ KB

Model Building

In [19]:
# First we will keep a copy of data
df_donor_copy = df_donor.copy()

Data Preparation

In [20]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_donor = df_donor.apply(le.fit_transform)
df_donor.head()
Out[20]:
Gender Blood Type Barangay Network Month Blood Drive Location
Donor ID
P1 0 3 27 5 4 3
P2 0 0 6 2 4 3
P3 0 2 22 1 4 3
P4 1 3 27 1 4 3
P5 0 2 24 1 4 3

Additional library

This project will be using k-modes as its partitioning algorithm since we are dealing with categorical data. Cao approach was used "to select initial cluster centers by considering the distance between objects and the density of each object."

In [21]:
from kmodes.kmodes import KModes

Using K-Mode with "Cao" initialization

In [22]:
km_cao = KModes(n_clusters=3, init = "Cao", n_init = 10, verbose=1)
fitClusters_cao = km_cao.fit_predict(df_donor)
Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 102, cost: 4863.0
In [23]:
# Predicted Clusters
fitClusters_cao
Out[23]:
array([0, 2, 0, ..., 1, 1, 2], dtype=uint16)
In [24]:
clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = df_donor.columns
In [25]:
# Mode of the clusters
clusterCentroidsDf
Out[25]:
Gender Blood Type Barangay Network Month Blood Drive Location
0 1 3 27 1 6 3
1 1 2 10 3 10 3
2 1 0 27 5 9 3
In [26]:
kmodes = km_cao.cluster_centroids_
shape = kmodes.shape
shape
Out[26]:
(3, 6)

Choosing K by comparing Cost against each K

In [27]:
cost = []
for num_clusters in list(range(1,5)):
    kmode = KModes(n_clusters=num_clusters, init = "cao", n_init = 10, verbose=1)
    kmode.fit_predict(df_donor)
    cost.append(kmode.cost_)
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 0, cost: 5763.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 0, cost: 5763.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 0, cost: 5170.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 0, cost: 5170.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 102, cost: 4863.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 102, cost: 4863.0
Best run was number 1
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 6, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 7, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 8, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 9, iteration: 1/100, moves: 100, cost: 4658.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 10, iteration: 1/100, moves: 100, cost: 4658.0
Best run was number 1
In [28]:
y = np.array([i for i in range(1,5,1)])
plt.plot(y,cost)
Out[28]:
[<matplotlib.lines.Line2D at 0x24112adde08>]

Combining the predicted clusters with the original DF

In [29]:
df_donor = df_donor_copy.reset_index()
In [30]:
clustersDf = pd.DataFrame(fitClusters_cao)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([df_donor, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index'], axis = 1)
In [31]:
combinedDf.head()
Out[31]:
Donor ID Gender Blood Type Barangay Network Month Blood Drive Location cluster_predicted
0 P1 Female O+ Western Bicutan TNT January TPDH 0
1 P2 Female A+ Hagonoy Others January TPDH 2
2 P3 Female B+ Tanyag Globe/TM January TPDH 0
3 P4 Male O+ Western Bicutan Globe/TM January TPDH 0
4 P5 Female B+ Upper Bicutan Globe/TM January TPDH 0
In [32]:
label = km_cao.labels_
In [33]:
combinedDf['cluster_name'] = ['cluster1' if x == 0 else\
                           'cluster2' if x == 1 else\
                           'cluster3' if x == 2 else\
                           'cluster4' for x in label]
In [34]:
plt.style.use('ggplot')
sns.countplot(combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,palette = 'rainbow')
plt.title('Cluster distribution')
plt.ylabel(None)
plt.xlabel(None)
Out[34]:
Text(0.5, 0, '')

Cluster Identification

In [35]:
cluster_1 = combinedDf[combinedDf['cluster_predicted'] == 0]
cluster_2 = combinedDf[combinedDf['cluster_predicted'] == 1]
cluster_3 = combinedDf[combinedDf['cluster_predicted'] == 2]
In [36]:
cluster_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1211 entries, 0 to 1815
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              1211 non-null   object
 1   Gender                1211 non-null   object
 2   Blood Type            1211 non-null   object
 3   Barangay              1211 non-null   object
 4   Network               1211 non-null   object
 5   Month                 1211 non-null   object
 6   Blood Drive Location  1211 non-null   object
 7   cluster_predicted     1211 non-null   uint16
 8   cluster_name          1211 non-null   object
dtypes: object(8), uint16(1)
memory usage: 87.5+ KB
In [37]:
cluster_1.head()
Out[37]:
Donor ID Gender Blood Type Barangay Network Month Blood Drive Location cluster_predicted cluster_name
0 P1 Female O+ Western Bicutan TNT January TPDH 0 cluster1
2 P3 Female B+ Tanyag Globe/TM January TPDH 0 cluster1
3 P4 Male O+ Western Bicutan Globe/TM January TPDH 0 cluster1
4 P5 Female B+ Upper Bicutan Globe/TM January TPDH 0 cluster1
5 P6 Male B+ Tanyag Globe/TM January TPDH 0 cluster1
In [38]:
cluster_2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 371 entries, 18 to 1817
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              371 non-null    object
 1   Gender                371 non-null    object
 2   Blood Type            371 non-null    object
 3   Barangay              371 non-null    object
 4   Network               371 non-null    object
 5   Month                 371 non-null    object
 6   Blood Drive Location  371 non-null    object
 7   cluster_predicted     371 non-null    uint16
 8   cluster_name          371 non-null    object
dtypes: object(8), uint16(1)
memory usage: 26.8+ KB
In [39]:
cluster_2.head()
Out[39]:
Donor ID Gender Blood Type Barangay Network Month Blood Drive Location cluster_predicted cluster_name
18 P19 Female B+ Tuktukan TNT January TPDH 1 cluster2
21 P22 Male B+ Western Bicutan Smart January TPDH 1 cluster2
22 P23 Male B+ Maharlika Village Smart January TPDH 1 cluster2
24 P25 Male B+ Western Bicutan Smart January TPDH 1 cluster2
30 P31 Male B+ Western Bicutan Smart January TPDH 1 cluster2
In [40]:
cluster_3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 237 entries, 1 to 1818
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Donor ID              237 non-null    object
 1   Gender                237 non-null    object
 2   Blood Type            237 non-null    object
 3   Barangay              237 non-null    object
 4   Network               237 non-null    object
 5   Month                 237 non-null    object
 6   Blood Drive Location  237 non-null    object
 7   cluster_predicted     237 non-null    uint16
 8   cluster_name          237 non-null    object
dtypes: object(8), uint16(1)
memory usage: 17.1+ KB
In [41]:
cluster_3.head()
Out[41]:
Donor ID Gender Blood Type Barangay Network Month Blood Drive Location cluster_predicted cluster_name
1 P2 Female A+ Hagonoy Others January TPDH 2 cluster3
10 P11 Male A+ Western Bicutan Smart January TPDH 2 cluster3
20 P21 Male A+ Bagumbayan TNT January TPDH 2 cluster3
26 P27 Male A+ Western Bicutan Smart January TPDH 2 cluster3
39 P40 Male A+ Western Bicutan TNT January TPDH 2 cluster3
In [42]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Gender'])
plt.show()
In [43]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Type'])
plt.show()
In [44]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,12))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Barangay'])
plt.show()
In [45]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,5))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Network'])
plt.show()
In [46]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Month'])
plt.show()
In [47]:
plt.style.use('ggplot')
plt.subplots(figsize = (15,10))
sns.countplot(x=combinedDf['cluster_name'],order=combinedDf['cluster_name'].value_counts().index,hue=combinedDf['Blood Drive Location'])
plt.show()
In [48]:
# Extract clustered data to csv
combinedDf.to_csv(r'C:\Users\ACER\Desktop\Maico - Files\Data Science\Bootcamp Materials\Capstone\Cao_v1_n3.csv')

Transform segmented data to Power BI visualization

Power%20BI%20GIF.gif

Generate insights

Analyzing the segmented data, we were able to identify personas for each cluster. These are based on the dominant features (blood type, gender, barangay, donation facility, month, network) per grouping as well as the context of the donors.

  • Cluster 1: Key group of new donors
  • Cluster 2: Regular donors (Returnees)
  • Cluster 3: New donors

Personas.PNG

Recommendations

Data-driven recommendations

  • Gather more demographic information (ex. Sex, Age, Blood volume donated (cc), Employment Status) that can help improve segmentation.
  • Conduct surveys to observe behavioral pattern of donors (ex. motivation to donate).
  • Assign a health coordinator that would recruit blood donors local areas / barangays with lowest number of donors.
  • Continue retention initiatives such as recognizing commitment in doing voluntary blood donation and partnering with companies for local areas / barnagay with highest number of donors.
  • Increase in frequency of mobile blood drive during month of July and August. Target areas / barangays far from the in house facility.

Ways of working recommendations

  • Automate follow-up invites for blood donations by creating a scheduling tool based on the last date of the donation + 3 months and the contact number provided in initial donation.
  • Digitalize the donor data collection through use of registration laptop and online survey. To decentralize, you may opt to provide QR code directing to survey link and inspect “completed/submitted results” before proceeding.
  • Advocacy talks Innovate materials to be used (ex. redesign slides, use of videos/testimonials), tie up with barangay for awareness sessions

Appendix

In [49]:
# Combined charts
for col in combinedDf:
    plt.style.use('ggplot')
    plt.subplots(figsize = (15,10))
    sns.countplot(x='cluster_name', hue=col, data = combinedDf)
    plt.show()