Blog Post 1 - Data Visualization with Matplotlib

In this blog post, we will explore Palmer’s Penguin dataset with multiple visualizations.

§0. Import Data

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Let’s take a look at the first few rows of our data.

penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

§1. Exploring Data

# shortern the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

To learn more about the relationships between penguin species and different features, we will write a function to see the different median values of different features among different species.

def penguin_summary_table(group_cols, value_cols):
    return penguins.groupby(group_cols)[value_cols].median().round(2)
penguin_summary_table(["Species", "Sex", "Island"], 
                      ["Culmen Length (mm)", "Body Mass (g)", 
                       "Culmen Depth (mm)", "Flipper Length (mm)"])
Culmen Length (mm) Body Mass (g) Culmen Depth (mm) Flipper Length (mm)
Species Sex Island
Adelie FEMALE Biscoe 37.75 3375.0 17.70 187.0
Dream 36.80 3400.0 17.80 188.0
Torgersen 37.60 3400.0 17.45 189.0
MALE Biscoe 40.80 4000.0 18.90 191.0
Dream 40.25 3987.5 18.65 190.5
Torgersen 41.10 4000.0 19.20 195.0
Chinstrap FEMALE Dream 46.30 3550.0 17.65 192.0
MALE Dream 50.95 3950.0 19.30 200.5
Gentoo . Biscoe 44.50 4875.0 15.70 217.0
FEMALE Biscoe 45.50 4700.0 14.25 212.0
MALE Biscoe 49.50 5500.0 15.70 221.0

As shown in the table, only Adelie penguins live on Torgersen island. On Biscoe island, there are Gentoo and Adelie penguins. On Dream island, there are Chinstrap and Adelie penguins.

With respect to each species, the values for each feature for male and female don’t differ too much.

The culmen length of Adelie is significantly shorter than Chinstrap and Gentoo, and the body mass of Gentoo is significantly larger than Adelie and Chinstrap.

1.1 Inspect individual features with respect to species

Let’s start from creating one histogram to compare the culmen length (mm) among different species. We can do do with hist function in the matplotlib package in python.

from matplotlib import pyplot as plt
for s in penguins["Species"].unique():
    # select the rows with species == s
    df = penguins[penguins["Species"] == s]
    # create histogram
    plt.hist(df["Culmen Length (mm)"], label = s, alpha = 0.5)

# add legend
plt.legend()

# add x-axis
plt.xlabel("Culmen Length (mm)")

# add y-axis
plt.ylabel("Frequency")

Text(0, 0.5, 'Frequency')

single-hist

Great! Now, we can create histograms to visualize how Culmen Length (mm), Body Mass (g), Culmen Depth (mm), and Flipper Length (mm) values differ for each species of penguin in our data set.

fig, ax = plt.subplots(1,4, figsize = (13,3), sharey = True)
ax[0].set(ylabel = "Number of penguins")
features = ["Culmen Length (mm)", "Body Mass (g)", 
            "Culmen Depth (mm)","Flipper Length (mm)"]

for i in range(0,len(features)):
    for s in penguins["Species"].unique():
        df = penguins[penguins["Species"] == s]
        ax[i].hist(df[features[i]], label = s, alpha = 0.3)
        ax[i].set(xlabel = features[i])
        
plt.tight_layout()
plt.legend()
<matplotlib.legend.Legend at 0x26128b416a0>

histo

From the histograms, values of body mass, culmen length, and flipper length for Chinstrap penguins don’t differ much from those of Adelie penguins, but the culmen lengths for Chinstrap and Gentoo penguins are significantly different from those of Adelie penguins.

1.2 Inspect correlations between features

Now, we want to see if there’s some relationship between features. We can do so with the help of scatterplots. We can do do with scatter function in the matplotlib package in python.

First, let’s create a single scatterplot to observe the relationship between culmen length and culmen depth.

for s in penguins["Species"].unique():
    # select the rows with species == s
    df = penguins[penguins["Species"] == s]
    # create scatter
    plt.scatter(df["Culmen Length (mm)"], df["Culmen Depth (mm)"], label = s)

# add legend
plt.legend()

# add x-axis
plt.xlabel("Culmen Length (mm)")

# add y-axis
plt.ylabel("Culmen Depth (mm)")

Text(0, 0.5, 'Culmen Depth (mm)')

single-scatter

Now, we can create multiple scatterplots.

x = "Culmen Length (mm)"
y = ["Body Mass (g)", "Culmen Depth (mm)","Flipper Length (mm)"]
marker = {"Adelie"   : ".",
          "Chinstrap": "^",
          "Gentoo"   : "*"}
fig, ax = plt.subplots(1,3, figsize = (12,4))
for i in range(3):
    for s in penguins["Species"].unique():
        df = penguins[penguins["Species"] == s]
        ax[i].scatter(df[x], df[y[i]], label = s, marker = marker[s])
        ax[i].set(xlabel = x, ylabel = y[i])

plt.tight_layout()
plt.legend()
<matplotlib.legend.Legend at 0x2612adadf70>

scatterplot

As we can see from the scatterplots, for Adelie, both body mass and flipper length are positively correlated with culmen length. For Chinstrap and Gentoo, all of body mass, culmen depth, and flipper length are positively correlated with culmen length.

Written on April 4, 2021