Blog Post 1 - Data Visualization with Matplotlib

In this blog post, we will explore Palmer’s Penguin dataset with multiple visualizations.

§0. Import Data

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Let’s take a look at the first few rows of our data.

penguins.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

§1. Exploring Data

# shortern the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

To learn more about the relationships between penguin species and different features, we will write a function to see the different median values of different features among different species.

def penguin_summary_table(group_cols, value_cols):
    return penguins.groupby(group_cols)[value_cols].median().round(2)

penguin_summary_table(["Species", "Sex", "Island"], 
                      ["Culmen Length (mm)", "Body Mass (g)", 
                       "Culmen Depth (mm)", "Flipper Length (mm)"])

			Culmen Length (mm)	Body Mass (g)	Culmen Depth (mm)	Flipper Length (mm)
Species	Sex	Island
Adelie	FEMALE	Biscoe	37.75	3375.0	17.70	187.0
		Dream	36.80	3400.0	17.80	188.0
		Torgersen	37.60	3400.0	17.45	189.0
	MALE	Biscoe	40.80	4000.0	18.90	191.0
		Dream	40.25	3987.5	18.65	190.5
		Torgersen	41.10	4000.0	19.20	195.0
Chinstrap	FEMALE	Dream	46.30	3550.0	17.65	192.0
Chinstrap	MALE	Dream	50.95	3950.0	19.30	200.5
Gentoo	.	Biscoe	44.50	4875.0	15.70	217.0
	FEMALE	Biscoe	45.50	4700.0	14.25	212.0
	MALE	Biscoe	49.50	5500.0	15.70	221.0

As shown in the table, only Adelie penguins live on Torgersen island. On Biscoe island, there are Gentoo and Adelie penguins. On Dream island, there are Chinstrap and Adelie penguins.

With respect to each species, the values for each feature for male and female don’t differ too much.

The culmen length of Adelie is significantly shorter than Chinstrap and Gentoo, and the body mass of Gentoo is significantly larger than Adelie and Chinstrap.

1.1 Inspect individual features with respect to species

Let’s start from creating one histogram to compare the culmen length (mm) among different species. We can do do with hist function in the matplotlib package in python.

from matplotlib import pyplot as plt

for s in penguins["Species"].unique():
    # select the rows with species == s
    df = penguins[penguins["Species"] == s]
    # create histogram
    plt.hist(df["Culmen Length (mm)"], label = s, alpha = 0.5)

# add legend
plt.legend()

# add x-axis
plt.xlabel("Culmen Length (mm)")

# add y-axis
plt.ylabel("Frequency")

Text(0, 0.5, 'Frequency')

single-hist

Great! Now, we can create histograms to visualize how Culmen Length (mm), Body Mass (g), Culmen Depth (mm), and Flipper Length (mm) values differ for each species of penguin in our data set.

fig, ax = plt.subplots(1,4, figsize = (13,3), sharey = True)
ax[0].set(ylabel = "Number of penguins")
features = ["Culmen Length (mm)", "Body Mass (g)", 
            "Culmen Depth (mm)","Flipper Length (mm)"]

for i in range(0,len(features)):
    for s in penguins["Species"].unique():
        df = penguins[penguins["Species"] == s]
        ax[i].hist(df[features[i]], label = s, alpha = 0.3)
        ax[i].set(xlabel = features[i])
        
plt.tight_layout()
plt.legend()

<matplotlib.legend.Legend at 0x26128b416a0>

histo

From the histograms, values of body mass, culmen length, and flipper length for Chinstrap penguins don’t differ much from those of Adelie penguins, but the culmen lengths for Chinstrap and Gentoo penguins are significantly different from those of Adelie penguins.

1.2 Inspect correlations between features

Now, we want to see if there’s some relationship between features. We can do so with the help of scatterplots. We can do do with scatter function in the matplotlib package in python.

First, let’s create a single scatterplot to observe the relationship between culmen length and culmen depth.

for s in penguins["Species"].unique():
    # select the rows with species == s
    df = penguins[penguins["Species"] == s]
    # create scatter
    plt.scatter(df["Culmen Length (mm)"], df["Culmen Depth (mm)"], label = s)

# add legend
plt.legend()

# add x-axis
plt.xlabel("Culmen Length (mm)")

# add y-axis
plt.ylabel("Culmen Depth (mm)")

Text(0, 0.5, 'Culmen Depth (mm)')

single-scatter

Now, we can create multiple scatterplots.

x = "Culmen Length (mm)"
y = ["Body Mass (g)", "Culmen Depth (mm)","Flipper Length (mm)"]
marker = {"Adelie"   : ".",
          "Chinstrap": "^",
          "Gentoo"   : "*"}
fig, ax = plt.subplots(1,3, figsize = (12,4))
for i in range(3):
    for s in penguins["Species"].unique():
        df = penguins[penguins["Species"] == s]
        ax[i].scatter(df[x], df[y[i]], label = s, marker = marker[s])
        ax[i].set(xlabel = x, ylabel = y[i])

plt.tight_layout()
plt.legend()

<matplotlib.legend.Legend at 0x2612adadf70>

scatterplot

As we can see from the scatterplots, for Adelie, both body mass and flipper length are positively correlated with culmen length. For Chinstrap and Gentoo, all of body mass, culmen depth, and flipper length are positively correlated with culmen length.

Written on April 4, 2021