Table of Contents

  • 1  Test Calcs (Single Year)

  • 2  Process All Years

  • 3  Create a subset of the data

  • 4  Write the subset out to a csv file

Process Daily Data

Concatenate the raw data csv files into a single dataframe for all years, doing some quality control. We’ll be focussing on the YVR airport station.

  • Start with a single year to develop the code

  • Write a function so

[1]:
from pathlib import Path

import pandas as pd
[2]:
import context
in context.py, setting root_dir to /Users/phil/repos/eosc213_students/notebooks/pandas
******************************
context imported. Front of path:
/Users/phil/repos/eosc213_students/notebooks/pandas
/Users/phil/mini37/lib/python36.zip
******************************

Test Calcs (Single Year)

[3]:
station = "YVR"
stn_id = 51442
year = 2013
datafile = context.raw_dir / Path(f"weather_daily_{station}_{stn_id}_{year}.csv")
[4]:
df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)
df_in.head()
[4]:
Year Month Day Data Quality Max Temp (°C) Max Temp Flag Min Temp (°C) Min Temp Flag Mean Temp (°C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2013-01-01 2013 1 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 2013 1 2 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-03 2013 1 3 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-04 2013 1 4 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-05 2013 1 5 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 26 columns

Some years, such as 2013, have a chunk of missing data at the beginning or end because the data is split into two different files where the Environment Canada station ID changed partway through the year.

To make sure these rows don’t mess things up for concatenation later, we will simply discard any rows where all the measurements are missing (the measurements go from column 'Data Quality' to the final column).

Note that we are specifying the ‘Data Quality’ column by its name, not its column number

[5]:
# Create a series which is True for any row that has at least one non-null measurement
# and False otherwise

not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)
not_null.head()
[5]:
Date/Time
2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
dtype: bool
[6]:
# Extract the subset of df_in where the not_null Series is True
data = df_in[not_null]
data.head()
[6]:
Year Month Day Data Quality Max Temp (°C) Max Temp Flag Min Temp (°C) Min Temp Flag Mean Temp (°C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2013-06-13 2013 6 13 NaN 19.3 NaN 8.9 NaN 14.1 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 33 NaN
2013-06-14 2013 6 14 NaN 18.3 NaN 11.5 NaN 14.9 NaN ... 0.0 NaN 0.4 NaN NaN NaN 12.0 NaN 35 NaN
2013-06-15 2013 6 15 NaN 19.9 NaN 7.8 NaN 13.9 NaN ... 0.0 NaN 0.0 NaN NaN NaN NaN NaN <31 NaN
2013-06-16 2013 6 16 NaN 22.4 NaN 13.2 NaN 17.8 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 39 NaN
2013-06-17 2013 6 17 NaN 22.8 NaN 14.1 NaN 18.5 NaN ... 0.0 NaN 0.0 NaN NaN NaN 21.0 NaN 44 NaN

5 rows × 26 columns

To make life easier, let’s also remove the degree symbol from the column names, using the string method replace and a list comprehension

[7]:
# Create list of column names with degree symbols removed
columns = [nm.replace("\xb0", "") for nm in data.columns]
columns
[7]:
['Year',
 'Month',
 'Day',
 'Data Quality',
 'Max Temp (C)',
 'Max Temp Flag',
 'Min Temp (C)',
 'Min Temp Flag',
 'Mean Temp (C)',
 'Mean Temp Flag',
 'Heat Deg Days (C)',
 'Heat Deg Days Flag',
 'Cool Deg Days (C)',
 'Cool Deg Days Flag',
 'Total Rain (mm)',
 'Total Rain Flag',
 'Total Snow (cm)',
 'Total Snow Flag',
 'Total Precip (mm)',
 'Total Precip Flag',
 'Snow on Grnd (cm)',
 'Snow on Grnd Flag',
 'Dir of Max Gust (10s deg)',
 'Dir of Max Gust Flag',
 'Spd of Max Gust (km/h)',
 'Spd of Max Gust Flag']
[8]:
# Update the column names in the DataFrame
data.columns = columns
data.head()
[8]:
Year Month Day Data Quality Max Temp (C) Max Temp Flag Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2013-06-13 2013 6 13 NaN 19.3 NaN 8.9 NaN 14.1 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 33 NaN
2013-06-14 2013 6 14 NaN 18.3 NaN 11.5 NaN 14.9 NaN ... 0.0 NaN 0.4 NaN NaN NaN 12.0 NaN 35 NaN
2013-06-15 2013 6 15 NaN 19.9 NaN 7.8 NaN 13.9 NaN ... 0.0 NaN 0.0 NaN NaN NaN NaN NaN <31 NaN
2013-06-16 2013 6 16 NaN 22.4 NaN 13.2 NaN 17.8 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 39 NaN
2013-06-17 2013 6 17 NaN 22.8 NaN 14.1 NaN 18.5 NaN ... 0.0 NaN 0.0 NaN NaN NaN 21.0 NaN 44 NaN

5 rows × 26 columns

Process All Years

Let’s consolidate the above code into a function and then use it to process each year and concatenate the years into a single DataFrame.

[9]:
def process_data(datafile, skiprows=24):
    """Process data for a single year."""
    df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)

    # Create a series which is True for any row that has at least one
    # non-null measurement and False otherwise
    not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)

    # Extract the subset of df_in where the not_null Series is True
    data = df_in[not_null]

    # Create list of column names with degree symbols removed
    columns = [nm.replace("\xb0", "") for nm in data.columns]

    # Update the column names in the DataFrame
    data.columns = columns

    return data
[10]:
# Test the function on the data file from above
test = process_data(datafile)
test.head()
[10]:
Year Month Day Data Quality Max Temp (C) Max Temp Flag Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2013-06-13 2013 6 13 NaN 19.3 NaN 8.9 NaN 14.1 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 33 NaN
2013-06-14 2013 6 14 NaN 18.3 NaN 11.5 NaN 14.9 NaN ... 0.0 NaN 0.4 NaN NaN NaN 12.0 NaN 35 NaN
2013-06-15 2013 6 15 NaN 19.9 NaN 7.8 NaN 13.9 NaN ... 0.0 NaN 0.0 NaN NaN NaN NaN NaN <31 NaN
2013-06-16 2013 6 16 NaN 22.4 NaN 13.2 NaN 17.8 NaN ... 0.0 NaN 0.0 NaN NaN NaN 20.0 NaN 39 NaN
2013-06-17 2013 6 17 NaN 22.8 NaN 14.1 NaN 18.5 NaN ... 0.0 NaN 0.0 NaN NaN NaN 21.0 NaN 44 NaN

5 rows × 26 columns

  • First, concatenate all the data from the first set of years.

    • Note: The YVR data from 1937 is a bit wonky, so we’ll exclude it here.

  • Then, concatenate the data from recent years

[11]:
# Initialize an empty DataFrame
data_all = pd.DataFrame()

# Early data (1938 to mid 2013)
stn_id_early = 889
years_early = range(1938, 2014)

# Recent data (mid 2013 to 2017)
stn_id_recent = 51442
years_recent = range(2013, 2020)

# Loop over station IDs and years, using Python's zip function
for stn, years_list in zip([stn_id_early, stn_id_recent], [years_early, years_recent]):
    for year in years_list:
        filename = context.raw_dir / Path(f"weather_daily_{station}_{stn}_{year}.csv")
        data_in = process_data(filename)

        # Use the append method to append the new data
        data_all = data_all.append(data_in)
[12]:
data_all.head(2)
[12]:
Year Month Day Data Quality Max Temp (C) Max Temp Flag Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
1938-01-01 1938 1 1 NaN 9.4 NaN -0.6 NaN 4.4 NaN ... NaN M 0.3 NaN NaN NaN NaN NaN NaN NaN
1938-01-02 1938 1 2 NaN 7.2 NaN 1.7 NaN 4.5 NaN ... NaN M 0.5 NaN NaN NaN NaN NaN NaN NaN

2 rows × 26 columns

[13]:
data_all.tail(2)
[13]:
Year Month Day Data Quality Max Temp (C) Max Temp Flag Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... Total Snow (cm) Total Snow Flag Total Precip (mm) Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
Date/Time
2019-03-17 2019 3 17 NaN 11.9 NaN 1.4 NaN 6.7 NaN ... 0.0 NaN 0.0 NaN NaN NaN NaN NaN NaN M
2019-03-18 2019 3 18 NaN 14.0 NaN 1.2 NaN 7.6 NaN ... 0.0 NaN 0.0 NaN NaN NaN NaN M NaN M

2 rows × 26 columns

[14]:
data_all.shape
[14]:
(29632, 26)
[15]:
# Save the full set of concatenated data to file
year_min, year_max = data_all["Year"].min(), data_all["Year"].max()
savefile = (context.processed_dir
             / Path(f"weather_daily_{station}_{year_min}-{year_max}_all.csv"))
print(f"Saving to {savefile}")
data_all.to_csv(savefile)
Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_daily_YVR_1938-2019_all.csv

Create a subset of the data

For demos we’ll use a subset of the data columns and rename some of them for convenience.

[16]:
# Extract subset with columns of interest and rename some columns

columns = [
    "Year",
    "Month",
    "Day",
    "Mean Temp (C)",
    "Max Temp (C)",
    "Min Temp (C)",
    "Total Rain (mm)",
    "Total Snow (cm)",
    "Total Precip (mm)",
]

cols_dict = {
    "Mean Temp (C)": "T_mean (C)",
    "Max Temp (C)": "T_high (C)",
    "Min Temp (C)": "T_low (C)",
    "Total Rain (mm)": "Rain (mm)",
    "Total Snow (cm)": "Snow (cm)",
}
data_subset = data_all[columns].rename(columns=cols_dict)
data_subset.index.name = "Date"
data_subset.head()
[16]:
Year Month Day T_mean (C) T_high (C) T_low (C) Rain (mm) Snow (cm) Total Precip (mm)
Date
1938-01-01 1938 1 1 4.4 9.4 -0.6 NaN NaN 0.3
1938-01-02 1938 1 2 4.5 7.2 1.7 NaN NaN 0.5
1938-01-03 1938 1 3 1.7 7.2 -3.9 0.0 0.0 0.0
1938-01-04 1938 1 4 2.2 7.2 -2.8 0.0 0.0 0.0
1938-01-05 1938 1 5 2.2 7.2 -2.8 0.0 0.0 0.0

Write the subset out to a csv file

[17]:
savefile2 = context.processed_dir / Path(f"weather_{station}.csv")
print(f"Saving to {savefile2}")
data_subset.to_csv(savefile2)
Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_YVR.csv
[ ]: