Table of Contents
1 Test Calcs (Single Year)
2 Process All Years
3 Create a subset of the data
4 Write the subset out to a csv file
Process Daily Data¶
Concatenate the raw data csv files into a single dataframe for all years, doing some quality control. We’ll be focussing on the YVR airport station.
Start with a single year to develop the code
Write a function so
[1]:
from pathlib import Path
import pandas as pd
[2]:
import context
in context.py, setting root_dir to /Users/phil/repos/eosc213_students/notebooks/pandas
******************************
context imported. Front of path:
/Users/phil/repos/eosc213_students/notebooks/pandas
/Users/phil/mini37/lib/python36.zip
******************************
Test Calcs (Single Year)¶
[3]:
station = "YVR"
stn_id = 51442
year = 2013
datafile = context.raw_dir / Path(f"weather_daily_{station}_{stn_id}_{year}.csv")
[4]:
df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)
df_in.head()
[4]:
Year | Month | Day | Data Quality | Max Temp (°C) | Max Temp Flag | Min Temp (°C) | Min Temp Flag | Mean Temp (°C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2013-01-01 | 2013 | 1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2013-01-02 | 2013 | 1 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2013-01-03 | 2013 | 1 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2013-01-04 | 2013 | 1 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2013-01-05 | 2013 | 1 | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 26 columns
Some years, such as 2013, have a chunk of missing data at the beginning or end because the data is split into two different files where the Environment Canada station ID changed partway through the year.
To make sure these rows don’t mess things up for concatenation later, we will simply discard any rows where all the measurements are missing (the measurements go from column 'Data Quality'
to the final column).
Note that we are specifying the ‘Data Quality’ column by its name, not its column number
[5]:
# Create a series which is True for any row that has at least one non-null measurement
# and False otherwise
not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)
not_null.head()
[5]:
Date/Time
2013-01-01 False
2013-01-02 False
2013-01-03 False
2013-01-04 False
2013-01-05 False
dtype: bool
[6]:
# Extract the subset of df_in where the not_null Series is True
data = df_in[not_null]
data.head()
[6]:
Year | Month | Day | Data Quality | Max Temp (°C) | Max Temp Flag | Min Temp (°C) | Min Temp Flag | Mean Temp (°C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2013-06-13 | 2013 | 6 | 13 | NaN | 19.3 | NaN | 8.9 | NaN | 14.1 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 33 | NaN |
2013-06-14 | 2013 | 6 | 14 | NaN | 18.3 | NaN | 11.5 | NaN | 14.9 | NaN | ... | 0.0 | NaN | 0.4 | NaN | NaN | NaN | 12.0 | NaN | 35 | NaN |
2013-06-15 | 2013 | 6 | 15 | NaN | 19.9 | NaN | 7.8 | NaN | 13.9 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | <31 | NaN |
2013-06-16 | 2013 | 6 | 16 | NaN | 22.4 | NaN | 13.2 | NaN | 17.8 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 39 | NaN |
2013-06-17 | 2013 | 6 | 17 | NaN | 22.8 | NaN | 14.1 | NaN | 18.5 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 21.0 | NaN | 44 | NaN |
5 rows × 26 columns
To make life easier, let’s also remove the degree symbol from the column names, using the string method replace
and a list comprehension
[7]:
# Create list of column names with degree symbols removed
columns = [nm.replace("\xb0", "") for nm in data.columns]
columns
[7]:
['Year',
'Month',
'Day',
'Data Quality',
'Max Temp (C)',
'Max Temp Flag',
'Min Temp (C)',
'Min Temp Flag',
'Mean Temp (C)',
'Mean Temp Flag',
'Heat Deg Days (C)',
'Heat Deg Days Flag',
'Cool Deg Days (C)',
'Cool Deg Days Flag',
'Total Rain (mm)',
'Total Rain Flag',
'Total Snow (cm)',
'Total Snow Flag',
'Total Precip (mm)',
'Total Precip Flag',
'Snow on Grnd (cm)',
'Snow on Grnd Flag',
'Dir of Max Gust (10s deg)',
'Dir of Max Gust Flag',
'Spd of Max Gust (km/h)',
'Spd of Max Gust Flag']
[8]:
# Update the column names in the DataFrame
data.columns = columns
data.head()
[8]:
Year | Month | Day | Data Quality | Max Temp (C) | Max Temp Flag | Min Temp (C) | Min Temp Flag | Mean Temp (C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2013-06-13 | 2013 | 6 | 13 | NaN | 19.3 | NaN | 8.9 | NaN | 14.1 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 33 | NaN |
2013-06-14 | 2013 | 6 | 14 | NaN | 18.3 | NaN | 11.5 | NaN | 14.9 | NaN | ... | 0.0 | NaN | 0.4 | NaN | NaN | NaN | 12.0 | NaN | 35 | NaN |
2013-06-15 | 2013 | 6 | 15 | NaN | 19.9 | NaN | 7.8 | NaN | 13.9 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | <31 | NaN |
2013-06-16 | 2013 | 6 | 16 | NaN | 22.4 | NaN | 13.2 | NaN | 17.8 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 39 | NaN |
2013-06-17 | 2013 | 6 | 17 | NaN | 22.8 | NaN | 14.1 | NaN | 18.5 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 21.0 | NaN | 44 | NaN |
5 rows × 26 columns
Process All Years¶
Let’s consolidate the above code into a function and then use it to process each year and concatenate the years into a single DataFrame.
[9]:
def process_data(datafile, skiprows=24):
"""Process data for a single year."""
df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)
# Create a series which is True for any row that has at least one
# non-null measurement and False otherwise
not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)
# Extract the subset of df_in where the not_null Series is True
data = df_in[not_null]
# Create list of column names with degree symbols removed
columns = [nm.replace("\xb0", "") for nm in data.columns]
# Update the column names in the DataFrame
data.columns = columns
return data
[10]:
# Test the function on the data file from above
test = process_data(datafile)
test.head()
[10]:
Year | Month | Day | Data Quality | Max Temp (C) | Max Temp Flag | Min Temp (C) | Min Temp Flag | Mean Temp (C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2013-06-13 | 2013 | 6 | 13 | NaN | 19.3 | NaN | 8.9 | NaN | 14.1 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 33 | NaN |
2013-06-14 | 2013 | 6 | 14 | NaN | 18.3 | NaN | 11.5 | NaN | 14.9 | NaN | ... | 0.0 | NaN | 0.4 | NaN | NaN | NaN | 12.0 | NaN | 35 | NaN |
2013-06-15 | 2013 | 6 | 15 | NaN | 19.9 | NaN | 7.8 | NaN | 13.9 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | <31 | NaN |
2013-06-16 | 2013 | 6 | 16 | NaN | 22.4 | NaN | 13.2 | NaN | 17.8 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 20.0 | NaN | 39 | NaN |
2013-06-17 | 2013 | 6 | 17 | NaN | 22.8 | NaN | 14.1 | NaN | 18.5 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 21.0 | NaN | 44 | NaN |
5 rows × 26 columns
First, concatenate all the data from the first set of years.
Note: The YVR data from 1937 is a bit wonky, so we’ll exclude it here.
Then, concatenate the data from recent years
[11]:
# Initialize an empty DataFrame
data_all = pd.DataFrame()
# Early data (1938 to mid 2013)
stn_id_early = 889
years_early = range(1938, 2014)
# Recent data (mid 2013 to 2017)
stn_id_recent = 51442
years_recent = range(2013, 2020)
# Loop over station IDs and years, using Python's zip function
for stn, years_list in zip([stn_id_early, stn_id_recent], [years_early, years_recent]):
for year in years_list:
filename = context.raw_dir / Path(f"weather_daily_{station}_{stn}_{year}.csv")
data_in = process_data(filename)
# Use the append method to append the new data
data_all = data_all.append(data_in)
[12]:
data_all.head(2)
[12]:
Year | Month | Day | Data Quality | Max Temp (C) | Max Temp Flag | Min Temp (C) | Min Temp Flag | Mean Temp (C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
1938-01-01 | 1938 | 1 | 1 | NaN | 9.4 | NaN | -0.6 | NaN | 4.4 | NaN | ... | NaN | M | 0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1938-01-02 | 1938 | 1 | 2 | NaN | 7.2 | NaN | 1.7 | NaN | 4.5 | NaN | ... | NaN | M | 0.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 26 columns
[13]:
data_all.tail(2)
[13]:
Year | Month | Day | Data Quality | Max Temp (C) | Max Temp Flag | Min Temp (C) | Min Temp Flag | Mean Temp (C) | Mean Temp Flag | ... | Total Snow (cm) | Total Snow Flag | Total Precip (mm) | Total Precip Flag | Snow on Grnd (cm) | Snow on Grnd Flag | Dir of Max Gust (10s deg) | Dir of Max Gust Flag | Spd of Max Gust (km/h) | Spd of Max Gust Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2019-03-17 | 2019 | 3 | 17 | NaN | 11.9 | NaN | 1.4 | NaN | 6.7 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | M |
2019-03-18 | 2019 | 3 | 18 | NaN | 14.0 | NaN | 1.2 | NaN | 7.6 | NaN | ... | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | M | NaN | M |
2 rows × 26 columns
[14]:
data_all.shape
[14]:
(29632, 26)
[15]:
# Save the full set of concatenated data to file
year_min, year_max = data_all["Year"].min(), data_all["Year"].max()
savefile = (context.processed_dir
/ Path(f"weather_daily_{station}_{year_min}-{year_max}_all.csv"))
print(f"Saving to {savefile}")
data_all.to_csv(savefile)
Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_daily_YVR_1938-2019_all.csv
Create a subset of the data¶
For demos we’ll use a subset of the data columns and rename some of them for convenience.
[16]:
# Extract subset with columns of interest and rename some columns
columns = [
"Year",
"Month",
"Day",
"Mean Temp (C)",
"Max Temp (C)",
"Min Temp (C)",
"Total Rain (mm)",
"Total Snow (cm)",
"Total Precip (mm)",
]
cols_dict = {
"Mean Temp (C)": "T_mean (C)",
"Max Temp (C)": "T_high (C)",
"Min Temp (C)": "T_low (C)",
"Total Rain (mm)": "Rain (mm)",
"Total Snow (cm)": "Snow (cm)",
}
data_subset = data_all[columns].rename(columns=cols_dict)
data_subset.index.name = "Date"
data_subset.head()
[16]:
Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|
Date | |||||||||
1938-01-01 | 1938 | 1 | 1 | 4.4 | 9.4 | -0.6 | NaN | NaN | 0.3 |
1938-01-02 | 1938 | 1 | 2 | 4.5 | 7.2 | 1.7 | NaN | NaN | 0.5 |
1938-01-03 | 1938 | 1 | 3 | 1.7 | 7.2 | -3.9 | 0.0 | 0.0 | 0.0 |
1938-01-04 | 1938 | 1 | 4 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
1938-01-05 | 1938 | 1 | 5 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
Write the subset out to a csv file¶
[17]:
savefile2 = context.processed_dir / Path(f"weather_{station}.csv")
print(f"Saving to {savefile2}")
data_subset.to_csv(savefile2)
Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_YVR.csv
[ ]: