Post

Modeling life satisfaction with a linear regressor

Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
from os.path import join
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display
from scipy.stats import normaltest
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

sns.set_style("whitegrid")
sns.set_palette("viridis")
1
2
3
data_root = "https://github.com/ageron/data/raw/main/"
lifesat = pd.read_csv(join(data_root, "lifesat", "lifesat.csv"))
lifesat.sort_values(by="Life satisfaction", ascending=False)
 CountryGDP per capita (USD)Life satisfaction
25Denmark55938.2128097.6
17Finland47260.8004587.6
23Iceland52279.7288517.5
24Netherlands54209.5638367.4
16Canada45856.6256267.4
15New Zealand42404.3937387.3
20Sweden50683.3235107.3
19Australia48697.8370287.3
11Israel38341.3075707.2
22Austria51935.6038627.1
21Germany50922.3580237.0
26United States60235.7284926.9
18Belgium48210.0331116.9
13United Kingdom41627.1292696.8
14France42025.6173736.5
8Spain36215.4475916.3
6Poland32238.1572596.1
12Italy38992.1483816.0
10Lithuania36732.0347445.9
9Slovenia36547.7389565.9
3Latvia29932.4939105.9
0Russia26456.3879385.8
7Estonia35638.4213515.7
4Hungary31007.7684075.6
2Turkey28384.9877855.5
1Greece27287.0834015.4
5Portugal32181.1545375.4
1
2
3
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values
X.shape, y.shape
    ((27, 1), (27, 1))
1
lifesat.describe()
 GDP per capita (USD)Life satisfaction
count27.00000027.000000
mean41564.5217716.566667
std9631.4523190.765607
min26456.3879385.400000
25%33938.2893055.900000
50%41627.1292696.800000
75%49690.5802697.300000
max60235.7284927.600000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
xmin, xmax = 23500, 62500
ymin, ymax = 4, 9

markdown = f"""
These are the boundaries of the axis we are going to use:

$
xmin, \ xmax = {xmin}, \ {xmax}
$
\\
$
ymin, \ ymax = {ymin}, \ {ymax}
$
"""

display(Markdown(markdown))

These are the boundaries of the axis we are going to use:

$ xmin, \ xmax = 23500, \ 62500 $
$ ymin, \ ymax = 4, \ 9 $

If both GDP per capita and Life satisfaction follow a normal distribution it makes sense to compute the correlation between them. We can know that a feature follows a Gaussian distribution by plotting them (and see if they follow a normal distribution) or with statistical tests.

Hidden code
1
2
3
4
5
6
7
8
9
10
fig, axes = plt.subplots(1, 2, figsize=(9, 3))

sns.kdeplot(X, ax=axes[0])
axes[0].set_title("GDP per capita (USD)")

sns.kdeplot(y, ax=axes[1])
axes[1].set_title("Life satisfaction")

plt.tight_layout()
plt.show()

png

$H0$ also known as null hypothesis of normaltest is based on the asumption that it follows a normal form. A typical value use to be sure that the null hypothesis is false is when the percentil is smaller than 0.05, i.e., $p < 0.05$.

1
2
print("GDP per capita (USD)")
normaltest(X)
1
2
GDP per capita (USD)
NormaltestResult(statistic=array([3.26038932]), pvalue=array([0.19589144]))
1
2
print("Life satisfaction")
normaltest(y)
1
2
Life satisfaction
NormaltestResult(statistic=array([14.71794254]), pvalue=array([0.00063685]))

From the visual and static normal test we can conclude that only GDP per capita follows a normal distribution. From this information we know that probably the compute of the correlation and the use of linear regressor is not the best aprouch to make generalitations from the distribution data. Yet the linear correlation, points out that there is a high linear relation between the independent and dependent variable.

1
2
corr = lifesat[["GDP per capita (USD)", "Life satisfaction"]].corr()
corr
 GDP per capita (USD)Life satisfaction
GDP per capita (USD)1.0000000.852796
Life satisfaction0.8527961.000000

Modeling with a regressor means creating a line that minimizes the distance between the samples.

$Life \ satisfaction = \theta_{0} + \theta_{1} * GDP$

1
2
def life_satisfaction(x, t0, t1):
    return t0 + t1 * x
1
2
3
4
5
6
7
model = LinearRegression()
model.fit(X, y)

t0 = round(model.intercept_[0], 3)
t1 = model.coef_[0][0]

t0, t1
    (3.749, 6.778899694341222e-05)
1
2
3
4
5
6
7
markdown = f"""
The linear equation that minimizes the error in the hole dataset is the follow:

$Life \ satisfaction = {t0} + {t1} * GDP$
"""

display(Markdown(markdown))

The linear equation that minimizes the error in the hole dataset is the follow:

$Life \ satisfaction = 3.749 + 6.778899694341222e-05 * GDP$

1
2
yhat = life_satisfaction(X, t0, t1)
yhat[:5]
    array([[5.542452  ],
           [5.59876401],
           [5.67318985],
           [5.77809374],
           [5.85098552]])
1
2
yhat_m = model.predict(X)
yhat_m[:5]
    array([[5.54250143],
           [5.59881344],
           [5.67323928],
           [5.77814317],
           [5.85103495]])

The values predicted by the model and the function life_satisfaction are very similar, as you can appreciate it by the compute of the median error using the Manhattan distance.

\[error = \frac{\sum_{i=0}^{n} | y_i - \hat{y}_i |}{n}\]
1
2
err = np.abs(np.sum((yhat - yhat_m))) / len(yhat)
err
    4.942737690907909e-05
Hidden code
1
2
3
4
5
6
7
8
9
10
11
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

sns.lineplot(x=X.ravel(), y=yhat.ravel(), color="#d69cff")
sns.scatterplot(x=X.ravel(), y=y.ravel())

ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life satisfaction")

plt.show()

png

As you probably noticed, this model is kinda useless. It is cool to explain maths but we make no prediction on unseen data. So lets simulate that we have two dataset, one for training the model and one for actual testing the model and lets see its performance.

1
2
3
4
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
    ((21, 1), (6, 1), (21, 1), (6, 1))
Hidden code
1
2
3
4
5
6
7
8
9
10
11
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

sns.scatterplot(x=X_train.ravel(), y=y_train.ravel())
sns.scatterplot(x=X_test.ravel(), y=y_test.ravel(), color="orange", marker="$\circ$")

ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life satisfaction")

plt.show()

png

1
2
3
4
5
6
7
model = LinearRegression()
model.fit(X_train, y_train)

t0 = round(model.intercept_[0], 3)
t1 = model.coef_[0][0]

t0, t1
    (3.533, 7.18650303768337e-05)
1
2
3
4
5
6
7
markdown = f"""
The linear equation that minimizes the error in the train dataset is the follow:

$Life \ satisfaction = {t0} + {t1} * GDP$
"""

display(Markdown(markdown))

The linear equation that minimizes the error in the train dataset is the follow:

$Life \ satisfaction = 3.533 + 7.18650303768337e-05 * GDP$

1
2
yhat_train = model.predict(X_train)
yhat_train[:5]
    array([[6.8281986 ],
           [6.92910967],
           [6.33488274],
           [7.42848276],
           [5.49369788]])
Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

sns.scatterplot(x=X_test.ravel(), y=y_test.ravel(), color="orange", marker="$\circ$")
sns.scatterplot(x=X_train.ravel(), y=y_train.ravel())
sns.lineplot(x=X_train.ravel(), y=yhat_train.ravel(), color="#d69cff")

ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life satisfaction")

plt.show()

png

1
2
yhat_test = model.predict(X_test)
yhat_test[:5]
    array([[6.13533505],
           [6.52424572],
           [6.15921518],
           [7.19224761],
           [5.43399993]])

In the following figure we can appreciate the predicted values which are projected in the regression line vs the real values gave by the test dataset.

Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

sns.lineplot(x=X_train.ravel(), y=yhat_train.ravel(), color="#d69cff")
sns.scatterplot(x=X_train.ravel(), y=y_train.ravel())

for i, x in enumerate(y_test):
    stack_y = np.vstack((x, yhat_test[i]))
    stack_x = np.repeat(X_test[i], len(stack_y))

    sns.lineplot(x=stack_x.ravel(), y=stack_y.ravel(), color="orange")

sns.scatterplot(x=X_test.ravel(), y=y_test.ravel(), color="orange", marker="$\circ$")
sns.scatterplot(x=X_test.ravel(), y=yhat_test.ravel(), color="orange")

ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life satisfaction")

plt.show()

png

Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.scatterplot(x=X.ravel(), y=y.ravel(), ax=axes[0])
sns.lineplot(x=X.ravel(), y=yhat.ravel(), color="#d69cff", ax=axes[0])

sns.lineplot(x=X_train.ravel(), y=yhat_train.ravel(), color="#d69cff", ax=axes[1])
sns.scatterplot(x=X_train.ravel(), y=y_train.ravel(), ax=axes[1])

for i, x in enumerate(y_test):
    stack_y = np.vstack((x, yhat_test[i]))
    stack_x = np.repeat(X_test[i], len(stack_y))
    sns.lineplot(x=stack_x.ravel(), y=stack_y.ravel(), color="orange", ax=axes[1])

sns.scatterplot(x=X_test.ravel(), y=y_test.ravel(), color="orange", marker="$\circ$")
sns.scatterplot(x=X_test.ravel(), y=yhat_test.ravel(), color="orange", ax=axes[1])

for ax in axes:
    ax.set_xlim([xmin, xmax])
    ax.set_ylim([ymin, ymax])
    ax.set_xlabel("GDP per capita (USD)")
    ax.set_ylabel("Life satisfaction")

axes[0].set_title("Linear model with the hole dataset")
axes[1].set_title("Linear model with train and test dataset")

plt.show()

png

This post is licensed under CC BY 4.0 by the author.