Exploring NOAA Climate Data: SQL Queries, Trend Analysis, and Advanced Visualizations with Plotly

SQL
Plotly
Pandas
NumPy
Author

Emil Blaignan

Published

February 1, 2025

3D Scatter Plot of Monthly Average Temperature by Latitude for Australia

Introduction

The NOAA Climate Data provides an extensive record of global temperature trends, collected from thousands of weather stations across various geographic regions. This dataset is important for analyzing long-term climate patterns, regional warming trends, and the impacts of climate change. In this blog, we will explore how to construct and query an SQL database to efficiently retrieve climate data, apply filtering techniques, and compute meaningful temperature trends. Additionally, we will leverage Plotly to create complex visualizations, including scatter map boxes, 3D plots, and faceted box plots, to uncover insights into global and regional temperature changes over time.

Creating an SQL Database

Our first objective is to create an SQL database containing three data frames for temperatures, stations, and countries. We will begin with the temperature data as it is the most difficult.

Downloading NOAA Temperature Data

We start by creating a folder, titled “datafiles,” in the same location as our ipynb to store our CSV files:

import os
# create folder named "datafiles" if it does not exist
if not os.path.exists("datafiles"): 
    os.mkdir("datafiles")

For our purpose, we will load climate data from 1901-2020. We first need to download the raw data from the GitHub URL containing the NOAA climate data. The dataset is split into decadal CSV files (e.g., 1901-1910.csv, 1911-1920.csv, etc). The script constructs URLs and downloads each file (saving it to datafiles/ folder):

import urllib.request
# List comprehension to construct decadal files
intervals = [f"{i}-{i+9}" for i in range(1901, 2020, 10)] # 1901-1910 to 2011-2020
# Cycles through and downloads each decade associated CSV
for interval in intervals:
    url = f"https://raw.githubusercontent.com/PIC16B-ucla/24F/main/datasets/noaa-ghcn/decades/{interval}.csv"
    urllib.request.urlretrieve(url, f"datafiles/{interval}.csv")

Once all our CSVs are downloaded into the datafiles folder, we are ready to create the database.

Creating Our SQL Database

We first want to import the necessary packages:

import sqlite3
import pandas as pd
import numpy as np

Next, we will establish the database titled temps.db:

# Establish a connection or (in this case) to create the SQLite database named "temps.db"
conn = sqlite3.connect("temps.db")

Before inserting the temperatures df into our database we need to write a function that will properly format the data frame. The raw NOAA dataset is stored in a wide format, where each row contains temperature readings for all 12 months of a given year. However, for effective querying and visualization, we need a long format where each row represents a single temperature reading for a specific month and year.

We can write the following function that melts the data into long-format:

def prepare_df(df):
    """
    prepares a piece of wide format dataframe into a long format data frame
    """
    # "Melt" converts monthly temperature columns into a single column
    df = df.melt(
        # Keep Station ID and Year as identifiers
        id_vars=["ID", "Year"],
        # Select VALUE1 to VALUE12 (monthly temperature columns)
        value_vars=[f"VALUE{i}" for i in range(1, 13)],
        var_name="Month",  # Column for month names
        value_name="Temp"  # Column for temperature values
    )
    # Clean the Month column (remove "VALUE" prefix and convert to integer)
    df["Month"] = df["Month"].str[5:].astype(int)
    # Convert temperature values to Celsius
    df["Temp"] = df["Temp"] / 100  
    # Remove NaN values (invalid temperature readings)
    df = df[~np.isnan(df["Temp"])]

    return df

Now that we have a callable function that will prepare our decadal temperature data frames, we are ready to add them to our temps database under a single table temperatures.

The code below adds all data to the database (from individual CSV files):

# Once again we define our intervals with list-comprehension
intervals = [f"{i}-{i+9}" for i in range(1901, 2020, 10)]
# For-loop interates every decadal CSV file
for i, interval in enumerate(intervals):
    filepath = f"datafiles/{interval}.csv"
    # Reads in CSV file as pandas df
    df = pd.read_csv(filepath)
    # Calls function we defined above, converting each decadal df into long-format 
    df = prepare_df(df)
    # Store data in the SQLite database 
    df.to_sql("temperatures", conn, 
              if_exists = "replace" if i == 0 else "append", index = False)

Thus, our temps database has one table temperatures with all our decade intervals.

We continue by adding the stations’ data next. The station data is much simpler, containing only one CSV file. Therefore, all we need to do is read the CSV as a pandas data frame, then add the data frame into our temps database as follows:

# Stores link to CSV data
filename = "https://raw.githubusercontent.com/PIC16B-ucla/25W/refs/heads/main/datasets/noaa-ghcn/station-metadata.csv"
# Reads in CSV
stations = pd.read_csv(filename)
# Adds the pandas df into the temps database as "stations"
stations.to_sql("stations", conn, if_exists = "replace", index=False)
27585

We can repeat this process with the country data:

# Stores link to CSV data
filename = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
# Reads in CSV
countries = pd.read_csv(filename)
# Adds the pandas df into the temps database as "countries"
countries.to_sql("countries", conn, if_exists = "replace", index=False)
279

At this point, we should have successfully added all three data frames to the temps database. To make sure that we did, we can run the following to see the contents of our database:

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]

As we can see, we have three tables: temperatures, stations, and countries so we are ready to proceed.

Writing a Reusable SQL Query Function

Now that our climate data is stored in an SQLite database, we need a way to retrieve specific temperature records based on user-defined filters. Instead of writing SQL queries manually every time, we’ll create a function that allows us to query data dynamically.

In a seperate python file, climate_database.py, we will write a function called query_climate_database() which accepts five arguments:

  • db_file, the file name for the database.
  • country, a string giving the name of a country for which data should be returned.
  • year_begin and year_end, two integers giving the earliest and latest years for which should be returned (inclusive).
  • month, an integer giving the month of the year for which should be returned.

After our function is written in climate_database.py, we may import the function and check if it is visible as follows:

from climate_database import query_climate_database
import inspect
print(inspect.getsource(query_climate_database))
def query_climate_database(db_file, country, year_begin, year_end, month):
    """
    Query the climate database for temperature readings in a specified country, date range, and month.

    Parameters:
    - db_file (str): The file name of the SQLite database (i.e. temps.db).
    - country (str): Target country.
    - year_begin (int): Start year.
    - year_end (int): End year.
    - month (int): The month of the year for which data should be returned (1-12).

    Returns:
    - pd.DataFrame: A data frame containing the 
    station name, latitude, longitude, country, year, month, and temperature.
    """
    # Connect to the database
    conn = sqlite3.connect(db_file)
    
    # Writes the SQL query using f-strings
    query = f"""
        SELECT 
            S.name AS NAME,
            S.latitude AS LATITUDE,
            S.longitude AS LONGITUDE,
            C.name AS Country,
            T.year AS Year,
            T.month AS Month,
            T.temp AS Temp
        FROM 
            temperatures T
        JOIN 
            stations S ON T.id = S.id
        JOIN 
            countries C ON SUBSTR(S.id, 1, 2) = C."FIPS 10-4" 
        WHERE 
            C.name = '{country}' AND
            T.year BETWEEN '{year_begin}' AND '{year_end}' AND
            T.month = '{month}'
        ORDER BY
            S.name ASC
    """
    # Matches stations ID from the temps and stations tables
    # Matches using countries: FIPS 10-4 with first two letters of ID from Stations
    # Filters by country name, start/end year, and month
    # Orders the station names in alphabetical order 
    df = pd.read_sql_query(query, conn)
    # Close database
    conn.close()
    
    return df

To check if our query is properly working we can run the following:

query_climate_database(db_file = "temps.db",
                       country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 AGARTALA 23.883 91.250 India 1980 1 18.21
1 AGARTALA 23.883 91.250 India 1981 1 18.25
2 AGARTALA 23.883 91.250 India 1982 1 19.31
3 AGARTALA 23.883 91.250 India 1985 1 19.25
4 AGARTALA 23.883 91.250 India 1988 1 19.54
... ... ... ... ... ... ... ...
3147 VISHAKHAPATNAM 17.717 83.233 India 2016 1 25.09
3148 VISHAKHAPATNAM 17.717 83.233 India 2017 1 23.90
3149 VISHAKHAPATNAM 17.717 83.233 India 2018 1 22.65
3150 VISHAKHAPATNAM 17.717 83.233 India 2019 1 22.20
3151 VISHAKHAPATNAM 17.717 83.233 India 2020 1 23.75

3152 rows × 7 columns

Our query function is working correctly, thus we are ready to visualize the data.

Visualizing the Data Using Plotly

Using the NOAA climate database and query function, we can generate various visualizations with Plotly. Visualizations are often created to address specific research questions and challenges. Therefore, choosing the type of visualization to represent data is critical for interpreting data within the context of specific niches.

Geographic Scatter Plot for Yearly Temperature Increases

The first question we will address is:

  • How does the average yearly change in temperature vary within a given country?

To explore this question, we will use the scatter_mapbox function from Plotly Express. This will allow us to visualize the stations on an interactive map. The interactive data points will reveal how certain stations report higher or lower average yearly temperature changes. We will also wrap the visualization code in a function so that users can generate visualizations for different parts of the data by providing different arguments (e.g., different dates, countries, etc.).

First, let’s import Plotly Express:

from plotly import express as px

We want the function to accept six explicit arguments and an undetermined number of keyword arguments:

  • db_file, country, year_begin, year_end, and month should remain as previously defined.
  • min_obs, the minimum required number of years of data for any given station.
  • **kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can control the colormap used, the mapbox style, etc.

The following function creates the scatter_mapbox according to user inputted specs:

def temperature_coefficient_plot(db_file, country, year_begin, 
                                 year_end, month, min_obs, **kwargs):
    """
    Creates an interactive geographic scatterplot showing yearly 
    temperature change (°C) for stations in a specified country.
    
    Parameters:
    - db_file (str): The file name of the SQLite database (temps.db).
    - country (str): Target country.
    - year_begin (int): Start year.
    - year_end (int): End year.
    - month (int): Target month (1-12).
    - min_obs (int): Minimum years of data required per station.
    - **kwargs: Additional keyword arguments for Plotly.
    
    Returns:
    - plotly.graph_objs._figure.Figure: Interactive map.
    """
    
    # Fetch data
    df = query_climate_database(db_file, country, year_begin, year_end, month)
    
    # Filter stations with observations >= min_obs
    df['count'] = df.groupby(['NAME', 'LATITUDE', 
                              'LONGITUDE'])['Year'].transform('count')
    df_filtered = df[df['count'] >= min_obs]
    
    # Function to calculate yearly temperature change coefficients
    def coeff_func(group):
        years = group['Year'].values
        temps = group['Temp'].values
        return np.polyfit(years, temps, 1)[0]  # Slope = yearly change (°C/year)
    
    # Group data and get coeff using .apply() method
    coeffs = df_filtered.groupby(['NAME', 'LATITUDE', 
                                  'LONGITUDE', 'Country']).apply(coeff_func)
    coeffs = coeffs.reset_index(name='Coefficient').dropna()
    # Makes sure coeff are rounded to a sober number of significant figures
    coeffs = round(coeffs, 4)
    
    # Converts int month (1-12) into associated str month name
    def month_converter(month):
        month_names = [
        "January", "February", "March", "April", "May", "June",
        "July", "August", "September", "October", "November", "December"]
        
        return month_names[month - 1]
    
    # Configures plot to meet standarized look
    fig = px.scatter_mapbox(
    coeffs,
    lat="LATITUDE",
    lon="LONGITUDE",
    color="Coefficient",  # Maps the color to the coeff column
    hover_name="NAME",
    labels={
        # Rename Coefficient in hover data
        "Coefficient": "Estimated Yearly Increase (°C)"
    }, **kwargs)
    
    # Continued plot config
    fig.update_layout(
        width=700,
        height=400,
    margin={"r": 140, "t": 60, "l": 0, "b": 0},
    title={ # Title config with f-string to update based on input country and dates
        "text": f"""The Estimated Yearly Increase in Temperature<br>
                    for Stations in {country} in {month_converter(month)}
                    Between {year_begin}-{year_end}.""",
        "x": 0.5,  
        "xanchor": "center",  
        "yanchor": "top"},  
    coloraxis_colorbar={
            "title": "Estimated Yearly Increase (°C)",  # Set the color bar title
            "title_side": "right",
            "x": 1.0},  
    coloraxis={  # Color scale range config
            "cmin": -0.13,  
            "cmax": 0.13})
    
    return fig

After running the cell with our temperature_coefficient_plot() function, we can test it by calling it with the same specs that we fed the query function to create a plot of estimated yearly increases in temperature during the month of January, in the interval 1980-2020, in India, as follows:

# assumes you have imported necessary packages
color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot(db_file = "temps.db",
                                   country = "India", 
                                   year_begin = 1980, 
                                   year_end = 2020, 
                                   month = 1, 
                                   min_obs = 10,
                                   zoom = 2, # Sets the initial camera 
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

To double-check that our temperature_coefficient_plot() function works properly and not just for India during the specified date range, we can try creating the Mapbox plot for Australia:

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot(db_file = "temps.db",
                                   country = "Australia", # Different country
                                   year_begin = 2000, # Different start date
                                   year_end = 2020, 
                                   month = 1, 
                                   min_obs = 10,
                                   zoom = 2, # Sets the initial camera 
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

It looks like our temperature_coefficient_plot() function works properly. Upon analyzing the plot for India and Australia, one can infer that climate temperatures are increasing more rapidly in more densely populated areas. This is likely due to increased levels of pollution which are expediting the effects of global warming. To further explore this, we can use a boxplot graph to address a question that would build upon our inference.

Mapping Monthly Average Temperature by Latitude with 3D Scatterplots

If we want to take a look at the bigger picture and see how latitude affects temperature, and how that temperature changes throughout the year in a given country, we can ask:

  • How does the average monthly temperature vary across latitudes over time? Is there a relationship?

We will use the px.scatter_3d() function from Plotly Express to construct an interactive three-dimensional graph with the month (1-12) on the x-axis, station latitude on the y-axis, and average monthly temperature (°C) on the z-axis. This will help identify any broad trends in both seasonal temperature and geographic factors.

Before we can construct a function that will plot our 3D graph, we must define a new query function in climate_database.py to handle and prep our data. Our first query function retrieved station-specific data, but this second query focuses on latitude-based temperature trends across months. Let’s call it second_query_climate_database().

The function will take the four following arguments for flexible user customization:

  • db_file, the file name for the database.
  • country, a string giving the name of a country for which data should be returned.
  • year_begin and year_end, two integers giving the earliest and latest years for which should be returned (inclusive).

After our function is written in climate_database.py, we may import the function and check if it is visible as follows:

from climate_database import second_query_climate_database
import inspect
print(inspect.getsource(second_query_climate_database))
def second_query_climate_database(db_file, country, year_begin, year_end):
    """
    Query the climate database for temperature data across months and latitudes.

    Parameters:
    - db_file (str): The file name of the SQLite database (i.e. temps.db).
    - country (str): Target country.
    - year_begin (int): Start year.
    - year_end (int): End year.

    Returns:
    - pd.DataFrame: A data frame containing month, latitude, and average temperature.
    """
    # Connect to the database
    conn = sqlite3.connect(db_file)
    
    # Write the SQL query using f-strings
    query = f"""
        SELECT 
            T.month AS Month,
            S.latitude AS LATITUDE,
            ROUND(AVG(T.temp),2) AS Avg_Temp
        FROM 
            temperatures T
        JOIN 
            stations S ON T.id = S.id
        JOIN 
            countries C ON SUBSTR(S.id, 1, 2) = C."FIPS 10-4"
        WHERE 
            C.name = '{country}' AND
            T.year BETWEEN {year_begin} AND {year_end}
        GROUP BY 
            T.month, S.latitude
        ORDER BY 
            T.month, S.latitude;
    """
    # Takes average temperature for each month and latitude
    # Execute the query and load the results into a DataFrame
    df = pd.read_sql_query(query, conn)
    
    # Close the database connection
    conn.close()
    
    return df

With our query ready to go, we can write the function that will plot our 3D graph.

We can construct the 3D visualization as follows:

def plot_3d_temperature(db_file, country, year_begin, year_end):
    """
    Create a 3D scatter plot for temperature data.

    Parameters:
    - db_file (str): The file name of the SQLite database (temps.db).
    - country (str): Target country.
    - year_begin (int): Start year.
    - year_end (int): End year.

    Returns:
    - plotly.graph_objects.Figure: The 3D scatter plot.
    """
    # Fetch data
    df = second_query_climate_database(db_file, country, year_begin, year_end)
    
    # Create 3D scatter plot with color scale to demark higher and lower temps
    fig = px.scatter_3d(df, x='Month', y='LATITUDE', z='Avg_Temp',
                        color='Avg_Temp',
                        title='temp_name',
                        labels={'Month': 'Month', 'Latitude': 'Latitude',
                                'Avg_Temp': 'Average Temperature (°C)'},
                        color_continuous_scale='thermal',
                        width=800,
                        height=500)
    
    fig.update_traces(marker=dict(size=5))
    
    fig.update_layout(
        margin={"r": 120, "t": 60, "l": 0, "b": 0},
        title={ # Title config with f-string to update based on input country and dates
        "text": f"""Monthly Average Temperature by Latitude in {country}<br>
        Between {year_begin}-{year_end}.""",
        "x": 0.5,  
        "xanchor": "center",  
        "yanchor": "top"},
        coloraxis_colorbar={
            "title": "Average Temp. (°C)",  # Set the color bar title
            "title_side": "right",
            "x": 0.85},
    ) 
    
    return fig

After running our 3D scatter function successfully, we can call it for Australia with the same inputs as made previously, omitting the months key since we are now plotting values for all months.

The following function calls plot_3d_temperature:

fig = plot_3d_temperature(db_file = "temps.db", 
                          country = "Australia", 
                          year_begin = 2000, 
                          year_end = 2020)

fig.show()

Through the three plots we explored, we uncovered a nuanced view of Australia’s climate. This analysis reinforces the relationship between latitude, seasonal changes, and long-term warming trends.