Analysing Socioeconomic and Health Factors Using PCA
- Vusi Kubheka
- Nov 21, 2024
- 3 min read
When tasked with helping an international humanitarian NGO allocate $10 million in aid, our goal as analysts is to identify the countries most in need of support. Using a dataset sourced from Kaggle, we employed data analysis techniques to classify countries based on socio-economic and health factors. This will explain preparing the data, exploring it visually, and using Principal Component Analysis (PCA) to uncover meaningful patterns.
Step 1: Loading and Exploring the Data
The analysis began with importing the required libraries, particularly Pandas, to load and preview the dataset. The dataset provides several socio-economic and health metrics for each country, such as GDP, child mortality, and income inequality. After loading the data, we displayed the first few rows to get an initial understanding of its structure.
import pandas as pd
data = pd.read_csv('/content/Country-data.csv')
data.head()
This step ensures familiarity with the dataset and allows us to identify key columns relevant to the analysis.
Step 2: Preparing the Data for Analysis
The dataset included a column for the country names, which wasn’t needed for numerical computations. To simplify the analysis, we removed this column, leaving only numerical features.
data_1 = data.drop('country', axis=1)
data_1.head()
I then used .describe() and .info() methods to summarise the data, check for missing values, and understand the range of each variable.
Step 3: Visualising the Data
Understanding the dataset’s distribution is crucial before applying machine learning techniques. Using Matplotlib and Seaborn, we plotted histograms for each feature. These plots reveal the spread of data and help detect potential outliers.
import matplotlib.pyplot as plt
import seaborn as sns
for col in data_1.columns:
plt.figure()
plt.title(f'Histogram of {col}')
sns.histplot(data_1, x=col, kde=True)
plt.savefig(f'histogram_{col}.png')
The inclusion of kernel density estimates (KDE) provides insights into the underlying probability distribution, smoothing over the raw data's variation.
Additionally, we used a pairplot to visualise relationships between variables:
sns.pairplot(data_1)
plt.savefig('pairplot.png')
This step provides a glimpse of correlations between features, identifying trends that may guide subsequent steps.

Step 4: Correlation Heatmap
To further analyse relationships between variables, I created a correlation matrix heatmap. This technique highlights features with strong positive or negative correlations, helping to decide which features may influence clustering results.
sns.heatmap(data_1.corr(), annot=True, cmap='rocket_r')
plt.savefig('heatmap.png')

Such visualisation is essential for determining how variables are interlinked, aiding the decision-making process for dimensionality reduction.
Step 5: Applying Principal Component Analysis (PCA)

Before clustering, we used PCA to reduce the dataset's dimensionality. This technique simplifies the data by combining variables into a smaller number of "principal components" while retaining most of the variance. To begin, we scaled the data using StandardScaler, ensuring each feature contributed equally to the analysis:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_1)
pca = PCA()
pca.fit(data_scaled)
Step 6: Determining the Optimal Number of Components
To decide how many principal components to retain, we plotted the cumulative explained variance ratio:
import numpy as np
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components')
plt.show()

This plot demonstrates how much variance each component captures, enabling us to select the number of components that best balance simplicity and information retention.
Step 7: Analysing PCA Outputs
We summarised the PCA results in a DataFrame, including the eigenvalues, explained variance percentages, and cumulative variance. This information highlights the relative importance of each principal component.
pca_informations = pd.DataFrame({
'Components': [f'PC {i}' for i in range(data_scaled.shape[1])],
'eigen values': pca.explained_variance_,
'explained variance (%)': pca.explained_variance_ * 100,
'explained variance cumulative (%)': [sum(pca.explained_variance_ratio_[:i]) * 100 for i in range(1, data_scaled.shape[1] + 1)]})
pca_informations
Components | Eigen Values | Explained Variance (%) | Explained variance cumulative (%) |
PC 0 | 4.160570175 | 416.0570175 | 45.95173979 |
PC 1 | 1.555661648 | 155.5661648 | 63.13336544 |
PC 2 | 1.177433803 | 117.7433803 | 76.13762433 |
PC 3 | 1.000777241 | 100.0777241 | 87.19078614 |
PC 4 | 0.664598661 | 66.45986609 | 94.53099756 |
PC 5 | 0.224927995 | 22.49279951 | 97.01523224 |
PC 6 | 0.114122102 | 11.41221017 | 98.27566264 |
PC 7 | 0.088847377 | 8.884737663 | 99.25694438 |
PC 8 | 0.067277868 | 6.727786756 | 100 |
Step 8: Visualising Principal Components
After selecting two principal components, we projected the dataset into this new two-dimensional space for visualisation. The scatterplot revealed clusters or patterns that could inform decision-making.
final_pca = PCA(n_components=2)
final_result = pca.fit_transform(data_scaled)
plt.figure(figsize=(12, 4), dpi=100)
sns.scatterplot(x=final_result[:, 0], y=final_result[:, 1], s=60)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.axhline(y=0, ls='--', c='red')
plt.axvline(x=0, ls='--', c='red')

PCA has simplified the dataset while retaining meaningful patterns, making it ready for k-means clustering. In the next steps, I will implement k-means to group countries into clusters and identify those most in need of aid.
コメント