Analysing Socioeconomic and Health Factors Using PCA

Vusi Kubheka
Nov 21, 2024
3 min read

When tasked with helping an international humanitarian NGO allocate $10 million in aid, our goal as analysts is to identify the countries most in need of support. Using a dataset sourced from Kaggle, we employed data analysis techniques to classify countries based on socio-economic and health factors. This will explain preparing the data, exploring it visually, and using Principal Component Analysis (PCA) to uncover meaningful patterns.

Step 1: Loading and Exploring the Data

The analysis began with importing the required libraries, particularly Pandas, to load and preview the dataset. The dataset provides several socio-economic and health metrics for each country, such as GDP, child mortality, and income inequality. After loading the data, we displayed the first few rows to get an initial understanding of its structure.

import pandas as pd

data = pd.read_csv('/content/Country-data.csv')

data.head()

This step ensures familiarity with the dataset and allows us to identify key columns relevant to the analysis.

Step 2: Preparing the Data for Analysis

The dataset included a column for the country names, which wasn’t needed for numerical computations. To simplify the analysis, we removed this column, leaving only numerical features.

data_1 = data.drop('country', axis=1)

data_1.head()

I then used .describe() and .info() methods to summarise the data, check for missing values, and understand the range of each variable.

Step 3: Visualising the Data

Understanding the dataset’s distribution is crucial before applying machine learning techniques. Using Matplotlib and Seaborn, we plotted histograms for each feature. These plots reveal the spread of data and help detect potential outliers.

import matplotlib.pyplot as plt

import seaborn as sns

for col in data_1.columns:

    plt.figure()

    plt.title(f'Histogram of {col}')

    sns.histplot(data_1, x=col, kde=True)

    plt.savefig(f'histogram_{col}.png')

The inclusion of kernel density estimates (KDE) provides insights into the underlying probability distribution, smoothing over the raw data's variation.

Additionally, we used a pairplot to visualise relationships between variables:

sns.pairplot(data_1)

plt.savefig('pairplot.png')

This step provides a glimpse of correlations between features, identifying trends that may guide subsequent steps.

Step 4: Correlation Heatmap

To further analyse relationships between variables, I created a correlation matrix heatmap. This technique highlights features with strong positive or negative correlations, helping to decide which features may influence clustering results.

sns.heatmap(data_1.corr(), annot=True, cmap='rocket_r')

plt.savefig('heatmap.png')

Such visualisation is essential for determining how variables are interlinked, aiding the decision-making process for dimensionality reduction.

Step 5: Applying Principal Component Analysis (PCA)

Before clustering, we used PCA to reduce the dataset's dimensionality. This technique simplifies the data by combining variables into a smaller number of "principal components" while retaining most of the variance. To begin, we scaled the data using StandardScaler, ensuring each feature contributed equally to the analysis:

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data_1)

pca = PCA()

pca.fit(data_scaled)

Step 6: Determining the Optimal Number of Components

To decide how many principal components to retain, we plotted the cumulative explained variance ratio:

import numpy as np

plt.figure()

plt.plot(np.cumsum(pca.explained_variance_ratio_))

plt.xlabel('Number of Components')

plt.ylabel('Cumulative Explained Variance')

plt.title('Explained Variance by Principal Components')

plt.show()

This plot demonstrates how much variance each component captures, enabling us to select the number of components that best balance simplicity and information retention.

Step 7: Analysing PCA Outputs

We summarised the PCA results in a DataFrame, including the eigenvalues, explained variance percentages, and cumulative variance. This information highlights the relative importance of each principal component.

pca_informations = pd.DataFrame({

    'Components': [f'PC {i}' for i in range(data_scaled.shape[1])],

    'eigen values': pca.explained_variance_,

    'explained variance (%)': pca.explained_variance_ * 100,

    'explained variance cumulative (%)': [sum(pca.explained_variance_ratio_[:i]) * 100 for i in range(1, data_scaled.shape[1] + 1)]})

pca_informations

Components	Eigen Values	Explained Variance (%)	Explained variance cumulative (%)
PC 0	4.160570175	416.0570175	45.95173979
PC 1	1.555661648	155.5661648	63.13336544
PC 2	1.177433803	117.7433803	76.13762433
PC 3	1.000777241	100.0777241	87.19078614
PC 4	0.664598661	66.45986609	94.53099756
PC 5	0.224927995	22.49279951	97.01523224
PC 6	0.114122102	11.41221017	98.27566264
PC 7	0.088847377	8.884737663	99.25694438
PC 8	0.067277868	6.727786756	100

Step 8: Visualising Principal Components

After selecting two principal components, we projected the dataset into this new two-dimensional space for visualisation. The scatterplot revealed clusters or patterns that could inform decision-making.

final_pca = PCA(n_components=2)

final_result = pca.fit_transform(data_scaled)

plt.figure(figsize=(12, 4), dpi=100)

sns.scatterplot(x=final_result[:, 0], y=final_result[:, 1], s=60)

plt.xlabel('Component 1')

plt.ylabel('Component 2')

plt.axhline(y=0, ls='--', c='red')

plt.axvline(x=0, ls='--', c='red')

PCA has simplified the dataset while retaining meaningful patterns, making it ready for k-means clustering. In the next steps, I will implement k-means to group countries into clusters and identify those most in need of aid.

BHSc (Honours) Health Systems Sciences, Witwatersrand University