Blog Post 2 - Data Query and Interactive Visualization with Plotly

In this blog, we will learn how to create a database and use data query to extract information. We will also learn to make interactive plots using plotly.

§1. Create a Database

First, let’s import important packages.

import pandas as pd
import seaborn as sns 
from matplotlib import pyplot as plt
import numpy as np
import sqlite3
conn = sqlite3.connect("temps.db") # create a database in current directory called temps.db

We will write a function to clean our data before incorporating it into our database.

def prepare_df(df):
    '''
    this function takes a dataframe with months as columns names with temperature as values
    and returns a dataframe with month and temperature as column names 
    '''
    # convert all the columns that we don't want to stack into a multi-index for the data frame
    df = df.set_index(keys=["ID", "Year"])
    # stacking
    df = df.stack()
    # recover ID and Year columns
    df = df.reset_index()
    # rename columns to make them more readable
    df = df.rename(columns = {"level_2"  : "Month" , 0 : "Temp"})
    # extract the integer part to be month value
    df["Month"] = df["Month"].str[5:].astype(int)
    df["Temp"]  = df["Temp"] / 100
    # this will be used later for joining
    df["FIPS 10-4"] = df["ID"].str[0:2]
    return(df)

Now, let’s add data to our database!

# temperature table 
df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
    df = prepare_df(df)
    df.to_sql("temperatures", conn, if_exists = "append", index = False)
C:\Users\35132\anaconda3\envs\PIC16B\lib\site-packages\pandas\core\generic.py:2779: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(
df.head(5)
ID Year Month Temp FIPS 10-4
0 USW00014924 2016 1 -13.69 US
1 USW00014924 2016 2 -8.40 US
2 USW00014924 2016 3 -0.20 US
3 USW00014924 2016 4 3.21 US
4 USW00014924 2016 5 13.85 US
# stations table
station_url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(station_url)
stations.to_sql("stations", conn, if_exists = "replace", index = False)
stations.head(5)
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR
# countries table 
country_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(country_url)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
C:\Users\35132\anaconda3\envs\PIC16B\lib\site-packages\pandas\core\generic.py:2779: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(
countries.head(5)
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa

Let’s check if we’ve correctly added data to our database with cursor.

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]

Great! We’ve correctly added three tables to our database. It’s also a good idea to check what’s in each table.

cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")

for result in cursor.fetchall():
    print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
  "Year" INTEGER,
  "Month" INTEGER,
  "Temp" REAL,
  "FIPS 10-4" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
  "LATITUDE" REAL,
  "LONGITUDE" REAL,
  "STNELEV" REAL,
  "NAME" TEXT
)
CREATE TABLE "countries" (
"FIPS 10-4" TEXT,
  "ISO 3166" TEXT,
  "Name" TEXT
)
# close the database connection
conn.close()

§2. Write a Query Function

conn = sqlite3.connect("temps.db")

The SQL syntax is based on a similar idea as pandas.

  • SELECT, like the syntax [], controls which column(s) will be returned.
  • FROM tells us which table to return columns from.
  • WHERE is like the Boolean index [temperatures["year"] == 1990]. Only rows in which this criterion is satisfied will be returned.
def query_climate_database(country, year_begin, year_end, month):
    '''
    this function gives the temperature of a country within specified year range in the specific month.
    inputs: 
    country, a string giving the name of a country for which data should be returned.
    year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
    month, an integer giving the month of the year for which should be returned.
    ouput:
    a Pandas dataframe of temperature readings for a country within specified year range in the specific month
    '''
    
    cmd = \
    """
    SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON T.'fips 10-4'= C.'fips 10-4'
    WHERE T.year BETWEEN ? AND ?
    AND T.month = ?
    AND C.name = ?
    """
    
    df = pd.read_sql_query(cmd, conn, params = (year_begin, year_end, month, country))
    # rename the column of country name to Country
    df.rename(columns={"Name": "Country"},inplace = True)
    return(df)
query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
12603 DARJEELING 27.050 88.270 India 1983 1 5.10
12604 DARJEELING 27.050 88.270 India 1986 1 6.90
12605 DARJEELING 27.050 88.270 India 1994 1 8.10
12606 DARJEELING 27.050 88.270 India 1995 1 5.60
12607 DARJEELING 27.050 88.270 India 1997 1 5.70

12608 rows × 7 columns

§3. Write a Geographic Scatter Function for Yearly Temperature Increases

In this section, we will use package plotly from module plotly to create interative visualizations! We want to explore how the average yearly change in temperature vary within a given country.

from plotly import express as px
import calendar #convert number to month name 

To quantify average yearly change, we will compute the first coefficient of a linear regression model at that station.

from sklearn.linear_model import LinearRegression

def coef(data_group):
    '''
    this function computes the first coefficient of the linear model Year vs. Temp
    '''
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]

Now, we might want to have different colors for climate stations according to average yearly change in temperature, which can be computed using the function we just wrote above. A good idea is to use geographic scatterplot.

def temperature_coefficient_plot(country,year_begin, year_end, month, min_obs, **kwargs):
    '''
    this functions generates an interactive geographic scatterplot to visualize yearly average change in temperature 
    in a specific month in a given country during specified years.
    Inputs:
    country, a string giving the name of a country for which data should be returned.
    year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
    month, an integer giving the month of the year for which should be returned.
    min_obs: the minimum required number of years of data for any given station.
    output:
    a geographic scatterplot
    '''
    
    # create a dataframe containing all the information needed
    df = query_climate_database(country, year_begin, year_end, month)
    # count the number of years of data for each temperature station
    df['freq'] = df.groupby('NAME')['NAME'].transform('count')
    # only keep data for stations with at least min_obs years worth of data  
    df = df[df['freq']>= min_obs]
    c = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef).reset_index()    
    # add a new column to store coefficients round up to 4 decimal places
    z = "Estimated Yearly Increase " + u"(\N{DEGREE SIGN}C)"
    c[z] = c[0].round(decimals = 4)
    
    title = "Estimte of yearly increase in temperature in "+ list(calendar.month_name)[month] +" <br>for stations in "+ country +" , years "+ str(year_begin)+" - "+str(year_end)
    fig = px.scatter_mapbox(c, 
                        lat = "LATITUDE",
                        lon = "LONGITUDE", 
                        hover_name = "NAME",
                        color = z,
                        title = title,
                        **kwargs)
    return(fig)
color_map = px.colors.diverging.RdGy_r # choose a colormap

fig1 = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_midpoint = 0, # 0 corresponds to the center of colorbar
                                   color_continuous_scale=color_map)

fig1.show()
from plotly.io import write_html
write_html(fig1, "geo_scatter.html")

The colors get brighter when the temperature increases more. To learn about the information for a temperature station, we can simply put our cursor on the dot representing that station, then we will be able to read its latitude, longitude, and estimated yearly increase in Celcius.

§4. Create Two More Interesting Figures

Is there a relationship between the elevation of a temperature station and its yearly increase in temperature? To explore this question, it’s a good idea to control variables. To be specific, we want to control latitude and month (we assume temperature change is not closely related to longitude). As we did above, we will first write a query function to create the dataframe desired.

def elevation_query(latitude_min, latitude_max, month):
    '''
    this function gives the temperature of countries within specified latitude range in the specific month.
    inputs: 
    latitude_min and latitude_max, two doubles giving the smallest and largest latitude.
    month, an integer giving the month of the year for which should be returned.
    ouput:
    a Pandas dataframe of temperature readings for countrys within specified latitude range in the specific month
    '''
    
    cmd = \
    """
    SELECT S.name, T.year, T.month, T.temp, S.STNELEV, C.name
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON T.'fips 10-4'= C.'fips 10-4'
    WHERE S.latitude BETWEEN ? AND ?
    AND T.month = ?
    """

    df = pd.read_sql_query(cmd, conn, params = (latitude_min, latitude_max, month))
    # rename the column containing country names to avoid confusion
    df.rename(columns={"Name": "Country"},inplace = True)
    return(df)

scatterplot can help us visualize correlations easily.We want to examine whether there’s an correlation between elevation and yearly change in temperature within each country when latitude and month is constant.

def elevation_temp_scatter(latitude_min, latitude_max, month, min_obs, **kwargs):
    '''
    this functions generates an interactive scatterplot to visualize yearly average change in temperature v.s.elevation
    in a specific month for countries within specified latitude range.
    Inputs:
    latitude_min and latitude_max, two doubles giving the smallest and largest latitude.
    month, an integer giving the month of the year for which should be returned.
    min_obs: the minimum required number of stations for any given country.
    output:
    a scatterplot
    '''
    
    # create a dataframe containing all the information needed
    df = elevation_query(latitude_min, latitude_max, month)
    # obtain the yearly change in temeprature
    c = df.groupby(["NAME","STNELEV", "Country"]).apply(coef).reset_index()
    # rename columns for convenience
    z = "Yearly change in Temperatures " + u"(\N{DEGREE SIGN}C)"
    c.rename(columns = {"NAME": "Station", 0:z, "STNELEV": "Elevation"},inplace = True)
    # compute the number of different stations within each country and store the info as a new col
    c['freq'] = c.groupby('Country')['Country'].transform('count')
    # only keep data for countries with at least min_obs many stations 
    c = c[c['freq']>= min_obs] 
    # name for y_axis
    y_axis = "Yearly change in Temperatures " + u"(\N{DEGREE SIGN}C) between latitude "+ str(latitude_min)+" and "+str(latitude_max)+ " in "+ list(calendar.month_name)[month]

    fig = px.scatter(data_frame = c, 
                     x = "Elevation", 
                     y = z,
                     labels = {"Elevation" : "Elevation",
                               z : y_axis,
                               'Country' : 'Countries'},
                     hover_name = "Station",
                     hover_data = ["Elevation", z],
                     color = "Country",
                     **kwargs)
    
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return(fig)
fig2 = elevation_temp_scatter(10,15,1, min_obs = 10)
fig2.show()
write_html(fig2, "scatter.html")

This graph could look overwhelming at the first sight, and it’s hard to tell any patterns when all colors of dots are kind of just clustered. Don’t worry! If you double click on the country you are interested in on the lengend, the plot will immediately give you only the dots that are under that country. For example, if we double click on Colombia, we can see that there is no obvious pattern between elevation and yearly change in temperature. This conclusion actually applies to almost all the countries for our selection. Therefore, further inquery is needed to figure out the relationship between elevation and yearly temperature change.

In the next visualization, I want to explore if there’s difference in yearly change in temperature between two countries that are pretty close to each other.

def country_query(country1, country2):
    '''
    this function gives the temperature of two chosen countries
    inputs: 
    country 1 and country2: two strings that gives the names of two different countries
    ouput:
    a Pandas dataframe of temperature readings for the two chosen countrys
    '''
    
    cmd = \
    """
    SELECT S.name, T.year, T.month, T.temp, C.name
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON T.'fips 10-4'= C.'fips 10-4'
    WHERE C.name IN (?,?)    
    """

    df = pd.read_sql_query(cmd, conn, params = (country1, country2))
    # rename the column containing country names to avoid confusion
    df.rename(columns={"Name": "Country"},inplace = True)
    return(df)

Boxplot presents a nice summary of our measurement of interest. With boxplot, we can easily compare the median, max, min value of the measurement between two countries.

def country_comparison(country1, country2, **kwargs):
    '''
    this functions generates an interactive boxplot to visualize yearly average change in temperature 
    in two given countries for each month.
    Inputs:
    country 1 and country2: two strings that gives the names of two different countries
    output:
    a boxplot
    '''
    df = country_query(country1, country2)
    # yearly change in temperature over years for each month
    c = df.groupby(["NAME", "Month", "Country"]).apply(coef).reset_index()
    z = "Yearly change in Temperatures " + u"(\N{DEGREE SIGN}C)"
    c.rename(columns = {"NAME": "Station", 0: z},inplace = True)
    fig = px.box(c, 
                 x = "Month", 
                 y = z,
                 color = "Country")
    return(fig)
fig3 = country_comparison("India", "Afghanistan")
fig3.show()
write_html(fig3, "boxplot.html")

From the plot, we can easily see that Idian’s temperature is quite stable throughout the whole year for each year compared with Afghanistan.

conn.close()
Written on April 12, 2021