1 Test Calcs (Single Year)
2 Process All Years
3 Create a subset of the data
4 Write the subset out to a csv file

Process Daily Data¶

Concatenate the raw data csv files into a single dataframe for all years, doing some quality control. We’ll be focussing on the YVR airport station.

Start with a single year to develop the code
Write a function so

[1]:

from pathlib import Path

import pandas as pd

[2]:

import context

in context.py, setting root_dir to /Users/phil/repos/eosc213_students/notebooks/pandas
******************************
context imported. Front of path:
/Users/phil/repos/eosc213_students/notebooks/pandas
/Users/phil/mini37/lib/python36.zip
******************************

Test Calcs (Single Year)¶

[3]:

station = "YVR"
stn_id = 51442
year = 2013
datafile = context.raw_dir / Path(f"weather_daily_{station}_{stn_id}_{year}.csv")

[4]:

df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)
df_in.head()

[4]:

	Year	Month	Day	Data Quality	Max Temp (°C)	Max Temp Flag	Min Temp (°C)	Min Temp Flag	Mean Temp (°C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
2013-01-01	2013	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2013-01-02	2013	1	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2013-01-03	2013	1	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2013-01-04	2013	1	4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2013-01-05	2013	1	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 26 columns

Some years, such as 2013, have a chunk of missing data at the beginning or end because the data is split into two different files where the Environment Canada station ID changed partway through the year.

To make sure these rows don’t mess things up for concatenation later, we will simply discard any rows where all the measurements are missing (the measurements go from column 'Data Quality' to the final column).

Note that we are specifying the ‘Data Quality’ column by its name, not its column number

[5]:

# Create a series which is True for any row that has at least one non-null measurement
# and False otherwise

not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)
not_null.head()

[5]:

Date/Time
2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
dtype: bool

[6]:

# Extract the subset of df_in where the not_null Series is True
data = df_in[not_null]
data.head()

[6]:

	Year	Month	Day	Data Quality	Max Temp (°C)	Max Temp Flag	Min Temp (°C)	Min Temp Flag	Mean Temp (°C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
2013-06-13	2013	6	13	NaN	19.3	NaN	8.9	NaN	14.1	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	33	NaN
2013-06-14	2013	6	14	NaN	18.3	NaN	11.5	NaN	14.9	NaN	...	0.0	NaN	0.4	NaN	NaN	NaN	12.0	NaN	35	NaN
2013-06-15	2013	6	15	NaN	19.9	NaN	7.8	NaN	13.9	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	<31	NaN
2013-06-16	2013	6	16	NaN	22.4	NaN	13.2	NaN	17.8	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	39	NaN
2013-06-17	2013	6	17	NaN	22.8	NaN	14.1	NaN	18.5	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	21.0	NaN	44	NaN

5 rows × 26 columns

To make life easier, let’s also remove the degree symbol from the column names, using the string method replace and a list comprehension

[7]:

# Create list of column names with degree symbols removed
columns = [nm.replace("\xb0", "") for nm in data.columns]
columns

[7]:

['Year',
 'Month',
 'Day',
 'Data Quality',
 'Max Temp (C)',
 'Max Temp Flag',
 'Min Temp (C)',
 'Min Temp Flag',
 'Mean Temp (C)',
 'Mean Temp Flag',
 'Heat Deg Days (C)',
 'Heat Deg Days Flag',
 'Cool Deg Days (C)',
 'Cool Deg Days Flag',
 'Total Rain (mm)',
 'Total Rain Flag',
 'Total Snow (cm)',
 'Total Snow Flag',
 'Total Precip (mm)',
 'Total Precip Flag',
 'Snow on Grnd (cm)',
 'Snow on Grnd Flag',
 'Dir of Max Gust (10s deg)',
 'Dir of Max Gust Flag',
 'Spd of Max Gust (km/h)',
 'Spd of Max Gust Flag']

[8]:

# Update the column names in the DataFrame
data.columns = columns
data.head()

[8]:

	Year	Month	Day	Data Quality	Max Temp (C)	Max Temp Flag	Min Temp (C)	Min Temp Flag	Mean Temp (C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
2013-06-13	2013	6	13	NaN	19.3	NaN	8.9	NaN	14.1	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	33	NaN
2013-06-14	2013	6	14	NaN	18.3	NaN	11.5	NaN	14.9	NaN	...	0.0	NaN	0.4	NaN	NaN	NaN	12.0	NaN	35	NaN
2013-06-15	2013	6	15	NaN	19.9	NaN	7.8	NaN	13.9	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	<31	NaN
2013-06-16	2013	6	16	NaN	22.4	NaN	13.2	NaN	17.8	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	39	NaN
2013-06-17	2013	6	17	NaN	22.8	NaN	14.1	NaN	18.5	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	21.0	NaN	44	NaN

5 rows × 26 columns

Process All Years¶

Let’s consolidate the above code into a function and then use it to process each year and concatenate the years into a single DataFrame.

[9]:

def process_data(datafile, skiprows=24):
    """Process data for a single year."""
    df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)

    # Create a series which is True for any row that has at least one
    # non-null measurement and False otherwise
    not_null = df_in.loc[:, "Data Quality":].notnull().any(axis=1)

    # Extract the subset of df_in where the not_null Series is True
    data = df_in[not_null]

    # Create list of column names with degree symbols removed
    columns = [nm.replace("\xb0", "") for nm in data.columns]

    # Update the column names in the DataFrame
    data.columns = columns

    return data

[10]:

# Test the function on the data file from above
test = process_data(datafile)
test.head()

[10]:

	Year	Month	Day	Data Quality	Max Temp (C)	Max Temp Flag	Min Temp (C)	Min Temp Flag	Mean Temp (C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
2013-06-13	2013	6	13	NaN	19.3	NaN	8.9	NaN	14.1	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	33	NaN
2013-06-14	2013	6	14	NaN	18.3	NaN	11.5	NaN	14.9	NaN	...	0.0	NaN	0.4	NaN	NaN	NaN	12.0	NaN	35	NaN
2013-06-15	2013	6	15	NaN	19.9	NaN	7.8	NaN	13.9	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	<31	NaN
2013-06-16	2013	6	16	NaN	22.4	NaN	13.2	NaN	17.8	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	20.0	NaN	39	NaN
2013-06-17	2013	6	17	NaN	22.8	NaN	14.1	NaN	18.5	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	21.0	NaN	44	NaN

5 rows × 26 columns

First, concatenate all the data from the first set of years.
- Note: The YVR data from 1937 is a bit wonky, so we’ll exclude it here.
Then, concatenate the data from recent years

[11]:

# Initialize an empty DataFrame
data_all = pd.DataFrame()

# Early data (1938 to mid 2013)
stn_id_early = 889
years_early = range(1938, 2014)

# Recent data (mid 2013 to 2017)
stn_id_recent = 51442
years_recent = range(2013, 2020)

# Loop over station IDs and years, using Python's zip function
for stn, years_list in zip([stn_id_early, stn_id_recent], [years_early, years_recent]):
    for year in years_list:
        filename = context.raw_dir / Path(f"weather_daily_{station}_{stn}_{year}.csv")
        data_in = process_data(filename)

        # Use the append method to append the new data
        data_all = data_all.append(data_in)

[12]:

data_all.head(2)

[12]:

	Year	Month	Day	Data Quality	Max Temp (C)	Max Temp Flag	Min Temp (C)	Min Temp Flag	Mean Temp (C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
1938-01-01	1938	1	1	NaN	9.4	NaN	-0.6	NaN	4.4	NaN	...	NaN	M	0.3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1938-01-02	1938	1	2	NaN	7.2	NaN	1.7	NaN	4.5	NaN	...	NaN	M	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2 rows × 26 columns

[13]:

data_all.tail(2)

[13]:

	Year	Month	Day	Data Quality	Max Temp (C)	Max Temp Flag	Min Temp (C)	Min Temp Flag	Mean Temp (C)	Mean Temp Flag	...	Total Snow (cm)	Total Snow Flag	Total Precip (mm)	Total Precip Flag	Snow on Grnd (cm)	Snow on Grnd Flag	Dir of Max Gust (10s deg)	Dir of Max Gust Flag	Spd of Max Gust (km/h)	Spd of Max Gust Flag
Date/Time
2019-03-17	2019	3	17	NaN	11.9	NaN	1.4	NaN	6.7	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	M
2019-03-18	2019	3	18	NaN	14.0	NaN	1.2	NaN	7.6	NaN	...	0.0	NaN	0.0	NaN	NaN	NaN	NaN	M	NaN	M

2 rows × 26 columns

[14]:

data_all.shape

[14]:

(29632, 26)

[15]:

# Save the full set of concatenated data to file
year_min, year_max = data_all["Year"].min(), data_all["Year"].max()
savefile = (context.processed_dir
             / Path(f"weather_daily_{station}_{year_min}-{year_max}_all.csv"))
print(f"Saving to {savefile}")
data_all.to_csv(savefile)

Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_daily_YVR_1938-2019_all.csv

Create a subset of the data¶

For demos we’ll use a subset of the data columns and rename some of them for convenience.

[16]:

# Extract subset with columns of interest and rename some columns

columns = [
    "Year",
    "Month",
    "Day",
    "Mean Temp (C)",
    "Max Temp (C)",
    "Min Temp (C)",
    "Total Rain (mm)",
    "Total Snow (cm)",
    "Total Precip (mm)",
]

cols_dict = {
    "Mean Temp (C)": "T_mean (C)",
    "Max Temp (C)": "T_high (C)",
    "Min Temp (C)": "T_low (C)",
    "Total Rain (mm)": "Rain (mm)",
    "Total Snow (cm)": "Snow (cm)",
}
data_subset = data_all[columns].rename(columns=cols_dict)
data_subset.index.name = "Date"
data_subset.head()

[16]:

	Year	Month	Day	T_mean (C)	T_high (C)	T_low (C)	Rain (mm)	Snow (cm)	Total Precip (mm)
Date
1938-01-01	1938	1	1	4.4	9.4	-0.6	NaN	NaN	0.3
1938-01-02	1938	1	2	4.5	7.2	1.7	NaN	NaN	0.5
1938-01-03	1938	1	3	1.7	7.2	-3.9	0.0	0.0	0.0
1938-01-04	1938	1	4	2.2	7.2	-2.8	0.0	0.0	0.0
1938-01-05	1938	1	5	2.2	7.2	-2.8	0.0	0.0	0.0

Write the subset out to a csv file¶

[17]:

savefile2 = context.processed_dir / Path(f"weather_{station}.csv")
print(f"Saving to {savefile2}")
data_subset.to_csv(savefile2)

Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_YVR.csv

[ ]:

Table Of Contents

Table of Contents

Process Daily Data¶

Test Calcs (Single Year)¶

Process All Years¶

Create a subset of the data¶

Write the subset out to a csv file¶