The World Happiness Report measures the overall state of global happiness by factoring in each country's status in economy, health, social support, government trust, etc. and ranks the countries by how happy their citizens perceive themselves to be. Since it covers different factors, it can be a useful guide to measure policy effectiveness and good governance and assess the quality of life among countries.
In this project, we explore and visualize the distribution of happiness scores among countries, identify strong determinants and relationship of factors to one another, and apply ML Linear Regression to test the model used for the report.
Here is the link to the slide pack used: https://drive.google.com/file/d/15q6585HzYGlcFx3PbSPpw5spmRCw_Ozx/view?usp=sharing
Proponents of the project are Janine Cheong, Raymund Norada, Karen Salas, and Maico Rebong. This final project serves as a requirement in our Diploma Course in Foundations in Data Science under De La Salle University. This study was awarded Best in Data Analysis among the thirteen presentations.
Guide:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# For visualizations
import plotly.graph_objs as go # plotly graphical object
import chart_studio
chart_studio.tools.set_config_file(world_readable=False, sharing='private')
from string import ascii_letters
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from sklearn.cluster import KMeans
from scipy.stats import norm
from scipy import stats
# For scaling dataset
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Regression Models
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings("ignore")
plt.style.use('tableau-colorblind10') #seaborn-whitegrid #tableau-colorblind10 #fivethirtyeight
textColor = '#006680'
highlightColor = '#3b738f'
# Load 2019 deep dive data
df = pd.read_csv(r'C:\Users\ACER\Desktop\DLSU\[WIP] Data Sci Project\[WIP] Data Sci Project\2019.csv')
df['Year'] = df['Year'].astype(object)
df.rename(columns = {'Happiness Rank' : 'Happiness_Rank',
'Happiness Score' : 'Happiness_Score',
'Standard Error' : 'Standard_Error',
'Economy (GDP per Capita)' : 'Economy',
'Health (Life Expectancy)' : 'Health',
'Trust (Government Corruption)' : 'Government Trust',
'Dystopia Residual' : 'Dystopia_Residual',
'Social Support': 'Social_Support'}, inplace = True)
df.head()
df.describe()
df.isna().sum()
# Add actual mean and median values in the box!!!
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
ax = df['Happiness_Score'].hist(edgecolor='w', figsize=(12, 7), color='#E7E4CB')
ax.set_title('Happiness Distribution of Countries', color='#006680', fontsize=20, fontweight='bold')
ax.set_xlabel('Happiness Score', color='#006680', fontsize=15, fontweight='bold')
ax.set_ylabel('No. of Countries', color='#006680', fontsize=15, fontweight='bold');
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.xaxis.grid(False)
ax.yaxis.grid(False)
ax.axvline(x=df['Happiness_Score'].mean(), color='#164066', linestyle='--', label='mean')
ax.axvline(x=df['Happiness_Score'].median(), color='#25C5BF', linewidth=4, label='median')
plt.text(5.5,25, 'Mean = 5.41', fontweight='bold', fontsize = 15, color='#164066')
plt.text(5.5,23, 'Median = 5.37', fontweight='bold', fontsize = 15, color='#25C5BF')
ax.legend(fontsize=14);
# Create a boxplot
plt.style.use('tableau-colorblind10')
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
sns.boxplot(data=df, x=df['Continent'], y=df['Happiness_Score'])
ax.set_title('Distribution of Happiness Scores per Continent', color='#006680', fontweight='bold', fontsize=20,)
ax.set_ylabel('Happiness Score', color='#006680', fontsize=15, fontweight='bold');
ax.set_xlabel('Continent', color='#006680', fontsize=15, fontweight='bold');
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(fontsize=13, rotation=0)
plt.yticks(fontsize=13, rotation=0)
plt.show()
We also visualized through Box and whiskers plot to provide the distribution of scores per continent. Here we can see that Europe places higher scores whereas Africa falls below.
The observation from the previous figure can be supported through the next Choropleth map, where Europe region is at green, and Africa with red.
data = [dict(
type='choropleth',
colorscale = 'RdYlGn', #algae #darkmint #RdYlGn
locations = df['Country'],
z = df['Happiness_Score'],
locationmode = 'country names',
text = df['Country'],
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar = dict(
title = 'Happiness Score',
titlefont=dict(size=13),
tickfont=dict(size=13))
)]
layout = dict(
title = 'Happiness Score',
titlefont = dict(size=20),
geo = dict(
showframe = True,
showcoastlines = True,
projection = dict(type = 'equirectangular')
)
)
choromap = go.Figure(data = data, layout = layout)
choromap
Economy, Social Support, and Healthy Life Expectancy have a strong degree of relationship with the Happiness Score. As each of these factors increases, the overall Happiness Score increases as well.
Factor to factor analysis shows strong degree of relationship between the following:
df4 = df.drop(['Happiness_Rank', 'Year'], 1)
fig_dims = (11, 9)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
cmap = sns.diverging_palette(220, 10, as_cmap=True) #set the plot structure
sns.heatmap(df4.corr(), annot=True, cmap=cmap)
ax.set_title('Correlation Matrix of the Happiness Factors', color='#006680', fontweight='bold', fontsize=20,)
plt.yticks(fontsize=13, rotation=0)
plt.xticks(fontsize=13, rotation=45)
plt.show()
For the three key factors, we also did a scatterplot for each and it proves the strong relationship with the Happiness Score. This scatterplot is also labelled per continent which can be seen through the colors. It is consistent that the blue dots African countries are on the lower left quadrant whereas light gray Europe countries are on the upper right quadrant.
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
plt.style.use('tableau-colorblind10') #style.available
groups = df4.groupby("Continent")
for name, group in groups:
plt.plot(group["Economy"], group["Happiness_Score"], marker="o", linestyle='', label=name, ms=10, alpha=0.9)
ax.set_xlabel('GDP per capita', color='#006680', fontsize=15, fontweight='bold')
ax.set_ylabel('Happiness Score', color='#006680', fontsize=15, fontweight='bold')
ax.set_title('GDP per capita against Happiness Score per Continent', color='#006680', fontweight='bold', fontsize=20);
plt.legend();
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
plt.style.use('tableau-colorblind10') #style.available
groups = df4.groupby("Continent")
for name, group in groups:
plt.plot(group["Social_Support"], group["Happiness_Score"], marker="o", linestyle='', label=name, ms=10, alpha=0.9)
ax.set_xlabel('Social Support', color='#006680', fontsize=15, fontweight='bold')
ax.set_ylabel('Happiness Score', color='#006680', fontsize=15, fontweight='bold')
ax.set_title('Social Support Status against Happiness Score per Continent', color='#006680', fontweight='bold', fontsize=20);
plt.legend();
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
plt.style.use('tableau-colorblind10') #style.available
groups = df4.groupby("Continent")
for name, group in groups:
plt.plot(group["Health"], group["Happiness_Score"], marker="o", linestyle='', label=name, ms=10, alpha=0.9)
ax.set_xlabel('Healthy Life Expectancy', color='#006680', fontsize=15, fontweight='bold')
ax.set_ylabel('Happiness Score', color='#006680', fontsize=15, fontweight='bold')
ax.set_title('Healthy Life Expectancy against Happiness Score per Continent', color='#006680', fontweight='bold', fontsize=20);
plt.legend();
Regression plots were also made/ for the high factor-to-factor correlations/ to observe the values. The regression line indicates an upward trend and proves strong correlation.
plt.style.use('tableau-colorblind10')
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
sns.regplot(x='Economy',y='Social_Support', data=df4)
sns.set(style="white")
plt.xticks(rotation=90)
ax.set_title('Correlation between Economy and Social Support', color='#006680', fontweight='bold', fontsize=20)
ax.set_xlabel('Economy', color='#006680', fontsize=15, fontweight='bold');
ax.set_ylabel('Social Support', color='#006680', fontsize=15, fontweight='bold');
plt.xticks(rotation=0)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.style.use('tableau-colorblind10')
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
sns.regplot(x='Economy',y='Health', data=df4)
sns.set(style="white")
plt.xticks(rotation=90)
ax.set_title('Correlation between Economy and Healthy Life Expectancy', color='#006680', fontweight='bold', fontsize=20)
ax.set_xlabel('Economy', color='#006680', fontsize=15, fontweight='bold');
ax.set_ylabel('Health', color='#006680', fontsize=15, fontweight='bold');
plt.xticks(rotation=0)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.style.use('tableau-colorblind10')
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
sns.regplot(x='Social_Support',y='Health', data=df4)
sns.set(style="white")
plt.xticks(rotation=90)
ax.set_title('Correlation between Social Support and Healthy Life Expectancy', color='#006680', fontweight='bold', fontsize=20)
ax.set_xlabel('Social Support', color='#006680', fontsize=15, fontweight='bold');
ax.set_ylabel('Health', color='#006680', fontsize=15, fontweight='bold');
plt.xticks(rotation=0)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
We also checked the country-level factor correlations. While for over-all analysis, Economy plays the biggest factor, it is a different story for the top 10 and bottom 10 countries.
For the Top 10, the highest factor is Social Support, and least factor is generosity. For the Bottom 10, the highest factors are Healthy Life Expectancy and Social Support as well, while least impact is Economy.
This gives us an insight how our sense of belongingness affects overall happiness regardless if you are from a rich or poor country.
df2 = df.drop(['Year'], 1)
df2.head()
Top_10 = []
for rank in df2['Happiness_Rank']:
if rank <= 10:
Top_10.append(1)
else:
Top_10.append(0)
df2['Top_10'] = Top_10
df2.head(10)
df2.tail(10)
df_sort_happiness = df2.sort_values(by = ["Happiness_Score"])
top10_countries = df_sort_happiness["Country"].tail(10).values
bottom10_countries = df_sort_happiness["Country"].head(10).values
# Normalize to make variables comparable
min_max_scaler = preprocessing.MinMaxScaler()
columns = ['Economy', 'Social_Support', 'Health', 'Freedom', 'Generosity', 'Government Trust']
df_sort_happiness = df_sort_happiness[columns]
df_sort_happiness = df_sort_happiness.dropna()
df_sort_happiness = pd.DataFrame(min_max_scaler.fit_transform(df_sort_happiness[columns]), columns = columns)
df_sort_happiness.columns = ['Economy', 'Social_Support', 'Health', 'Freedom', 'Generosity', 'Government Trust']
df_sort_happiness.shape
df_top10 = df_sort_happiness.tail(10)
Index = top10_countries
Cols = list(df_top10.columns)
fig_dims = (8, 11)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
df_top10_heat = pd.DataFrame(df_top10.values,index = Index, columns = Cols)
sns.heatmap(df_top10_heat, cbar = True, square = True, annot=True, cmap='GnBu', linewidths = .5)
plt.yticks(rotation=0)
ax.set_title('Top 10 Countries', color='#006680', fontweight='bold', fontsize=20)
plt.yticks(fontsize=13, rotation=0)
plt.xticks(fontsize=13, rotation=45)
plt.savefig('Top10_Happiness.png')
df_bottom10 = df_sort_happiness.head(10)
Index = bottom10_countries
Cols = list(df_bottom10.columns)
fig_dims = (8, 11)
fig, ax = plt.subplots(figsize=fig_dims, dpi=80)
df_bottom10_heat = pd.DataFrame(df_bottom10.values,index = Index, columns = Cols)
sns.heatmap(df_bottom10_heat, cbar = True, square = True, annot=True, cmap='GnBu', linewidths = .5)
plt.yticks(rotation=0)
ax.set_title('Bottom 10 Countries', color='#006680', fontweight='bold', fontsize=20)
plt.yticks(fontsize=13, rotation=0)
plt.xticks(fontsize=13, rotation=45)
plt.savefig('Bottom10_Happiness.png')
df2[['Happiness_Score', 'Economy','Social_Support', 'Health', 'Freedom', 'Generosity','Government Trust']].describe()
cols = ['Economy', 'Social_Support', 'Health', 'Freedom', 'Generosity', 'Government Trust']
print("Frequency and Distribution of Data")
plt.figure(figsize = (20, 15))
for num in range(len(cols)):
plt.subplot(3, 3, num+1)
plt.hist(df2[cols[num]])
plt.ylabel('Frequency')
plt.xlabel(cols[num])
plt.show()
We also used regression to analyze the six independent variable of the dataset. The succeeding visualizations display how the factors are related to the Happiness Score. Economy, Social Support, and Healthy Life Expectancy show a strong linear relationship to the happiness score. Perception of Corruption and Generosity have no remarkable impact on happiness score, whereas the plot for Freedom to Make Life Choices established a mild linear relationship.
X = df2[['Happiness_Score','Economy','Social_Support','Health','Freedom', 'Generosity','Government Trust']] #Subsetting the data
Y = X #Subsetting for future use
sns.set_style("white")
cols = ['Economy','Social_Support', 'Health', 'Freedom', 'Generosity', 'Government Trust']
plt.figure(figsize = (20, 15))
for num in range(len(cols)):
plt.subplot(3, 3, num+1)
plt.scatter(df2[cols[num]], df2['Happiness_Score'])
plt.ylabel('Score')
plt.xlabel(cols[num])
plt.show()
plt.savefig('LinearRegression_Happiness.png')
Linear regression model for Economy, Social Support, and Health proved correlation matrix results. The trend of the scatterplot above also visualizes the R-squared values derived from the following code.
# get R^2, fitness of model
for var in cols:
lin_reg = LinearRegression().fit(df2[[var]], df2['Happiness_Score'])
print("R-squared for {}: {}".format(var, round(lin_reg.score(df2[[var]], df2['Happiness_Score']), 4)))
The three key factors with the highest correlation scores against Happiness scores are:
These three key factors also proved to have a positive linear relationship. We will use these key factors as the independent variable in the linear model that we will build. Happiness score is the dependent variable.
To start building the model, we first drop the variables which are least correlated and have no direct relationship with the happiness score.
We plot a histogram to verify the distribution of the data and it shows that data is normally distributed.
We can now proceed in building the linear regression model.
#Model Prediction
model = df2[['Happiness_Score','Economy','Social_Support','Health','Freedom', 'Generosity', 'Government Trust']] #Subsetting the data
plt.figure(figsize=(10,10))
sns.distplot(Y['Happiness_Score'], fit = norm)
plt.title("Score Distribution Plot",size=15, weight='bold')
plt.savefig('NormalDistribution_Happiness.png')
plt.show()
We split the data into training and test sets. 70% of the data was used as the training set, and 30% was used as the test set. The linear model is trained using the training data set and the test data is used to predict the Happiness Score.
# Split the dataset to training and testing set
df2_train, df2_test = train_test_split(df2, test_size=0.3, random_state=1)
x_train = df2_train[['Economy','Social_Support', 'Health']]
y_train = df2_train['Happiness_Score']
x_test = df2_test[['Economy','Social_Support', 'Health']]
y_test = df2_test['Happiness_Score']
#NOTE: Only scale the predictor variables, NOT the target variable
#Instantiate the Scaler
scaler = StandardScaler()
#Fit to the TRAIN set
scaler.fit(x_train)
#Apply to the TRAIN set
x_train_s = scaler.transform(x_train)
#Apply to the TEST set
x_test_s = scaler.transform(x_test)
#Optional:
#Convert to DataFrame for viewing
x_train_sdf = pd.DataFrame(x_train_s, columns=x_train.columns, index=x_train.index)
#Convert to DataFrame for viewing
x_test_sdf = pd.DataFrame(x_test_s, columns=x_test.columns, index=x_test.index)
x_train_sdf.head()
#Instantiate the Linear Regression Algorithm
linreg = linear_model.LinearRegression()
#Train the Model
linreg.fit(x_train_sdf, y_train)
pd.DataFrame(linreg.coef_, index=x_train.columns, columns=['Coef'])
# Predict the values
y_pred = linreg.predict(x_test_sdf)
#Measure the performance of the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(("r2: %.2f") %r2)
print(("mae: %.2f") %mae)
print(("mse: %.2f") %mse)
print(("rmse: %.2f") %rmse)
lm_results = pd.DataFrame(y_test)
lm_results["Predicted"] = y_pred
lm_results.head()
#Plotting the linear regression model
plt.figure(figsize=(16,8))
sns.regplot(y_pred,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions", color='#006680', fontweight='bold', fontsize=20)
plt.grid(False)
plt.savefig('LinearModelPrediction.png')
plt.show()
Most of the data points appear to be close to the regression line indicating a good fit. This also proved positive correlation with the three key factors. With an R-squared value 0.71, it confirmed a good fit - the regression model worked good in the happiness dataset.
#Visualize the Results
fig = plt.figure(figsize=(25,15))
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
ax3 = ax2.twiny()
x_test_sort_economy = x_test.Economy.sort_values()
x_test_sort_social_support = x_test.Social_Support.sort_values()
x_test_sort_health = x_test.Health.sort_values()
predict_sort = pd.Series(y_pred, index = x_test.index).sort_values()
ax1.scatter(df2["Economy"], df2["Happiness_Score"],edgecolor="navy", alpha=0.5, linewidths=8)
ax2.scatter(df2["Social_Support"], df2["Happiness_Score"], edgecolor="orange", alpha=0.5, linewidths=8)
ax3.scatter(df2["Health"], df2["Happiness_Score"], edgecolor="green", alpha=0.5, linewidths=8)
ax1.plot(x_test_sort_economy, predict_sort, color="r", label = "Predicted Score on Economy", linewidth=3)
ax1.plot(x_test_sort_social_support, predict_sort, color="m", label = "Predicted Score on Social Support", linewidth=3)
ax1.plot(x_test_sort_health, predict_sort, color="k", label = "Predicted Score on Health", linewidth=3)
ax1.set_xlabel("Economy")
ax1.set_ylabel("Happiness Score")
ax2.set_xlabel("Social Support")
ax3.set_xlabel("Health")
ax1.legend(loc='lower right')
plt.title('Predicted Score on Economy, Social Support and Health', color='#006680', fontweight='bold', fontsize=20,)
plt.savefig('PredictedScores_Happiness.png')
plt.show()
To test the model, we run and plot the key x variables and the y variable and it returned a positive linear relationship.
Happiness Scores using the predicted linear model resulted to a positive outcome