{ "cells": [ { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "# Process Daily Data\n", "\n", "Concatenate the raw data csv files into a single dataframe for all years,\n", "doing some quality control. We'll be focussing on the YVR airport station.\n", "\n", "* Start with a single year to develop the code\n", "\n", "* Write a function so" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "in context.py, setting root_dir to /Users/phil/repos/eosc213_students/notebooks/pandas\n", "******************************\n", "context imported. Front of path:\n", "/Users/phil/repos/eosc213_students/notebooks/pandas\n", "/Users/phil/mini37/lib/python36.zip\n", "******************************\n", "\n" ] } ], "source": [ "import context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test Calcs (Single Year)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "station = \"YVR\"\n", "stn_id = 51442\n", "year = 2013\n", "datafile = context.raw_dir / Path(f\"weather_daily_{station}_{stn_id}_{year}.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (°C)Max Temp FlagMin Temp (°C)Min Temp FlagMean Temp (°C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
2013-01-01201311NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2013-01-02201312NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2013-01-03201313NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2013-01-04201314NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2013-01-05201315NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (°C) Max Temp Flag \\\n", "Date/Time \n", "2013-01-01 2013 1 1 NaN NaN NaN \n", "2013-01-02 2013 1 2 NaN NaN NaN \n", "2013-01-03 2013 1 3 NaN NaN NaN \n", "2013-01-04 2013 1 4 NaN NaN NaN \n", "2013-01-05 2013 1 5 NaN NaN NaN \n", "\n", " Min Temp (°C) Min Temp Flag Mean Temp (°C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "2013-01-01 NaN NaN NaN NaN ... \n", "2013-01-02 NaN NaN NaN NaN ... \n", "2013-01-03 NaN NaN NaN NaN ... \n", "2013-01-04 NaN NaN NaN NaN ... \n", "2013-01-05 NaN NaN NaN NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "2013-01-01 NaN NaN NaN \n", "2013-01-02 NaN NaN NaN \n", "2013-01-03 NaN NaN NaN \n", "2013-01-04 NaN NaN NaN \n", "2013-01-05 NaN NaN NaN \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "2013-01-01 NaN NaN NaN \n", "2013-01-02 NaN NaN NaN \n", "2013-01-03 NaN NaN NaN \n", "2013-01-04 NaN NaN NaN \n", "2013-01-05 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "2013-01-01 NaN NaN \n", "2013-01-02 NaN NaN \n", "2013-01-03 NaN NaN \n", "2013-01-04 NaN NaN \n", "2013-01-05 NaN NaN \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "2013-01-01 NaN NaN \n", "2013-01-02 NaN NaN \n", "2013-01-03 NaN NaN \n", "2013-01-04 NaN NaN \n", "2013-01-05 NaN NaN \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)\n", "df_in.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some years, such as 2013, have a chunk of missing data at the beginning or end because the data is split into two different files where the Environment Canada station ID changed partway through the year.\n", "\n", "To make sure these rows don't mess things up for concatenation later, we will simply discard any rows where all the measurements are missing (the measurements go from column `'Data Quality'` to the final column).\n", "\n", "Note that we are specifying the 'Data Quality' column by its name, not its column number" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date/Time\n", "2013-01-01 False\n", "2013-01-02 False\n", "2013-01-03 False\n", "2013-01-04 False\n", "2013-01-05 False\n", "dtype: bool" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a series which is True for any row that has at least one non-null measurement\n", "# and False otherwise\n", "\n", "not_null = df_in.loc[:, \"Data Quality\":].notnull().any(axis=1)\n", "not_null.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (°C)Max Temp FlagMin Temp (°C)Min Temp FlagMean Temp (°C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
2013-06-132013613NaN19.3NaN8.9NaN14.1NaN...0.0NaN0.0NaNNaNNaN20.0NaN33NaN
2013-06-142013614NaN18.3NaN11.5NaN14.9NaN...0.0NaN0.4NaNNaNNaN12.0NaN35NaN
2013-06-152013615NaN19.9NaN7.8NaN13.9NaN...0.0NaN0.0NaNNaNNaNNaNNaN<31NaN
2013-06-162013616NaN22.4NaN13.2NaN17.8NaN...0.0NaN0.0NaNNaNNaN20.0NaN39NaN
2013-06-172013617NaN22.8NaN14.1NaN18.5NaN...0.0NaN0.0NaNNaNNaN21.0NaN44NaN
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (°C) Max Temp Flag \\\n", "Date/Time \n", "2013-06-13 2013 6 13 NaN 19.3 NaN \n", "2013-06-14 2013 6 14 NaN 18.3 NaN \n", "2013-06-15 2013 6 15 NaN 19.9 NaN \n", "2013-06-16 2013 6 16 NaN 22.4 NaN \n", "2013-06-17 2013 6 17 NaN 22.8 NaN \n", "\n", " Min Temp (°C) Min Temp Flag Mean Temp (°C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "2013-06-13 8.9 NaN 14.1 NaN ... \n", "2013-06-14 11.5 NaN 14.9 NaN ... \n", "2013-06-15 7.8 NaN 13.9 NaN ... \n", "2013-06-16 13.2 NaN 17.8 NaN ... \n", "2013-06-17 14.1 NaN 18.5 NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "2013-06-13 0.0 NaN 0.0 \n", "2013-06-14 0.0 NaN 0.4 \n", "2013-06-15 0.0 NaN 0.0 \n", "2013-06-16 0.0 NaN 0.0 \n", "2013-06-17 0.0 NaN 0.0 \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "2013-06-13 NaN NaN NaN \n", "2013-06-14 NaN NaN NaN \n", "2013-06-15 NaN NaN NaN \n", "2013-06-16 NaN NaN NaN \n", "2013-06-17 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "2013-06-13 20.0 NaN \n", "2013-06-14 12.0 NaN \n", "2013-06-15 NaN NaN \n", "2013-06-16 20.0 NaN \n", "2013-06-17 21.0 NaN \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "2013-06-13 33 NaN \n", "2013-06-14 35 NaN \n", "2013-06-15 <31 NaN \n", "2013-06-16 39 NaN \n", "2013-06-17 44 NaN \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the subset of df_in where the not_null Series is True\n", "data = df_in[not_null]\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make life easier, let's also remove the degree symbol from the column names, using the string method `replace` and a [list comprehension](https://jakevdp.github.io/WhirlwindTourOfPython/11-list-comprehensions.html)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Year',\n", " 'Month',\n", " 'Day',\n", " 'Data Quality',\n", " 'Max Temp (C)',\n", " 'Max Temp Flag',\n", " 'Min Temp (C)',\n", " 'Min Temp Flag',\n", " 'Mean Temp (C)',\n", " 'Mean Temp Flag',\n", " 'Heat Deg Days (C)',\n", " 'Heat Deg Days Flag',\n", " 'Cool Deg Days (C)',\n", " 'Cool Deg Days Flag',\n", " 'Total Rain (mm)',\n", " 'Total Rain Flag',\n", " 'Total Snow (cm)',\n", " 'Total Snow Flag',\n", " 'Total Precip (mm)',\n", " 'Total Precip Flag',\n", " 'Snow on Grnd (cm)',\n", " 'Snow on Grnd Flag',\n", " 'Dir of Max Gust (10s deg)',\n", " 'Dir of Max Gust Flag',\n", " 'Spd of Max Gust (km/h)',\n", " 'Spd of Max Gust Flag']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create list of column names with degree symbols removed\n", "columns = [nm.replace(\"\\xb0\", \"\") for nm in data.columns]\n", "columns" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (C)Max Temp FlagMin Temp (C)Min Temp FlagMean Temp (C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
2013-06-132013613NaN19.3NaN8.9NaN14.1NaN...0.0NaN0.0NaNNaNNaN20.0NaN33NaN
2013-06-142013614NaN18.3NaN11.5NaN14.9NaN...0.0NaN0.4NaNNaNNaN12.0NaN35NaN
2013-06-152013615NaN19.9NaN7.8NaN13.9NaN...0.0NaN0.0NaNNaNNaNNaNNaN<31NaN
2013-06-162013616NaN22.4NaN13.2NaN17.8NaN...0.0NaN0.0NaNNaNNaN20.0NaN39NaN
2013-06-172013617NaN22.8NaN14.1NaN18.5NaN...0.0NaN0.0NaNNaNNaN21.0NaN44NaN
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (C) Max Temp Flag \\\n", "Date/Time \n", "2013-06-13 2013 6 13 NaN 19.3 NaN \n", "2013-06-14 2013 6 14 NaN 18.3 NaN \n", "2013-06-15 2013 6 15 NaN 19.9 NaN \n", "2013-06-16 2013 6 16 NaN 22.4 NaN \n", "2013-06-17 2013 6 17 NaN 22.8 NaN \n", "\n", " Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "2013-06-13 8.9 NaN 14.1 NaN ... \n", "2013-06-14 11.5 NaN 14.9 NaN ... \n", "2013-06-15 7.8 NaN 13.9 NaN ... \n", "2013-06-16 13.2 NaN 17.8 NaN ... \n", "2013-06-17 14.1 NaN 18.5 NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "2013-06-13 0.0 NaN 0.0 \n", "2013-06-14 0.0 NaN 0.4 \n", "2013-06-15 0.0 NaN 0.0 \n", "2013-06-16 0.0 NaN 0.0 \n", "2013-06-17 0.0 NaN 0.0 \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "2013-06-13 NaN NaN NaN \n", "2013-06-14 NaN NaN NaN \n", "2013-06-15 NaN NaN NaN \n", "2013-06-16 NaN NaN NaN \n", "2013-06-17 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "2013-06-13 20.0 NaN \n", "2013-06-14 12.0 NaN \n", "2013-06-15 NaN NaN \n", "2013-06-16 20.0 NaN \n", "2013-06-17 21.0 NaN \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "2013-06-13 33 NaN \n", "2013-06-14 35 NaN \n", "2013-06-15 <31 NaN \n", "2013-06-16 39 NaN \n", "2013-06-17 44 NaN \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Update the column names in the DataFrame\n", "data.columns = columns\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process All Years\n", "\n", "Let's consolidate the above code into a function and then use it to process each year and concatenate the years into a single DataFrame." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def process_data(datafile, skiprows=24):\n", " \"\"\"Process data for a single year.\"\"\"\n", " df_in = pd.read_csv(datafile, skiprows=24, index_col=0, parse_dates=True)\n", "\n", " # Create a series which is True for any row that has at least one\n", " # non-null measurement and False otherwise\n", " not_null = df_in.loc[:, \"Data Quality\":].notnull().any(axis=1)\n", "\n", " # Extract the subset of df_in where the not_null Series is True\n", " data = df_in[not_null]\n", "\n", " # Create list of column names with degree symbols removed\n", " columns = [nm.replace(\"\\xb0\", \"\") for nm in data.columns]\n", "\n", " # Update the column names in the DataFrame\n", " data.columns = columns\n", "\n", " return data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (C)Max Temp FlagMin Temp (C)Min Temp FlagMean Temp (C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
2013-06-132013613NaN19.3NaN8.9NaN14.1NaN...0.0NaN0.0NaNNaNNaN20.0NaN33NaN
2013-06-142013614NaN18.3NaN11.5NaN14.9NaN...0.0NaN0.4NaNNaNNaN12.0NaN35NaN
2013-06-152013615NaN19.9NaN7.8NaN13.9NaN...0.0NaN0.0NaNNaNNaNNaNNaN<31NaN
2013-06-162013616NaN22.4NaN13.2NaN17.8NaN...0.0NaN0.0NaNNaNNaN20.0NaN39NaN
2013-06-172013617NaN22.8NaN14.1NaN18.5NaN...0.0NaN0.0NaNNaNNaN21.0NaN44NaN
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (C) Max Temp Flag \\\n", "Date/Time \n", "2013-06-13 2013 6 13 NaN 19.3 NaN \n", "2013-06-14 2013 6 14 NaN 18.3 NaN \n", "2013-06-15 2013 6 15 NaN 19.9 NaN \n", "2013-06-16 2013 6 16 NaN 22.4 NaN \n", "2013-06-17 2013 6 17 NaN 22.8 NaN \n", "\n", " Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "2013-06-13 8.9 NaN 14.1 NaN ... \n", "2013-06-14 11.5 NaN 14.9 NaN ... \n", "2013-06-15 7.8 NaN 13.9 NaN ... \n", "2013-06-16 13.2 NaN 17.8 NaN ... \n", "2013-06-17 14.1 NaN 18.5 NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "2013-06-13 0.0 NaN 0.0 \n", "2013-06-14 0.0 NaN 0.4 \n", "2013-06-15 0.0 NaN 0.0 \n", "2013-06-16 0.0 NaN 0.0 \n", "2013-06-17 0.0 NaN 0.0 \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "2013-06-13 NaN NaN NaN \n", "2013-06-14 NaN NaN NaN \n", "2013-06-15 NaN NaN NaN \n", "2013-06-16 NaN NaN NaN \n", "2013-06-17 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "2013-06-13 20.0 NaN \n", "2013-06-14 12.0 NaN \n", "2013-06-15 NaN NaN \n", "2013-06-16 20.0 NaN \n", "2013-06-17 21.0 NaN \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "2013-06-13 33 NaN \n", "2013-06-14 35 NaN \n", "2013-06-15 <31 NaN \n", "2013-06-16 39 NaN \n", "2013-06-17 44 NaN \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Test the function on the data file from above\n", "test = process_data(datafile)\n", "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- First, concatenate all the data from the first set of years.\n", " - *Note: The YVR data from 1937 is a bit wonky, so we'll exclude it here.*\n", "- Then, concatenate the data from recent years" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Initialize an empty DataFrame\n", "data_all = pd.DataFrame()\n", "\n", "# Early data (1938 to mid 2013)\n", "stn_id_early = 889\n", "years_early = range(1938, 2014)\n", "\n", "# Recent data (mid 2013 to 2017)\n", "stn_id_recent = 51442\n", "years_recent = range(2013, 2020)\n", "\n", "# Loop over station IDs and years, using Python's zip function\n", "for stn, years_list in zip([stn_id_early, stn_id_recent], [years_early, years_recent]):\n", " for year in years_list:\n", " filename = context.raw_dir / Path(f\"weather_daily_{station}_{stn}_{year}.csv\")\n", " data_in = process_data(filename)\n", "\n", " # Use the append method to append the new data\n", " data_all = data_all.append(data_in)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (C)Max Temp FlagMin Temp (C)Min Temp FlagMean Temp (C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
1938-01-01193811NaN9.4NaN-0.6NaN4.4NaN...NaNM0.3NaNNaNNaNNaNNaNNaNNaN
1938-01-02193812NaN7.2NaN1.7NaN4.5NaN...NaNM0.5NaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (C) Max Temp Flag \\\n", "Date/Time \n", "1938-01-01 1938 1 1 NaN 9.4 NaN \n", "1938-01-02 1938 1 2 NaN 7.2 NaN \n", "\n", " Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "1938-01-01 -0.6 NaN 4.4 NaN ... \n", "1938-01-02 1.7 NaN 4.5 NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "1938-01-01 NaN M 0.3 \n", "1938-01-02 NaN M 0.5 \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "1938-01-01 NaN NaN NaN \n", "1938-01-02 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "1938-01-01 NaN NaN \n", "1938-01-02 NaN NaN \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "1938-01-01 NaN NaN \n", "1938-01-02 NaN NaN \n", "\n", "[2 rows x 26 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_all.head(2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayData QualityMax Temp (C)Max Temp FlagMin Temp (C)Min Temp FlagMean Temp (C)Mean Temp Flag...Total Snow (cm)Total Snow FlagTotal Precip (mm)Total Precip FlagSnow on Grnd (cm)Snow on Grnd FlagDir of Max Gust (10s deg)Dir of Max Gust FlagSpd of Max Gust (km/h)Spd of Max Gust Flag
Date/Time
2019-03-172019317NaN11.9NaN1.4NaN6.7NaN...0.0NaN0.0NaNNaNNaNNaNNaNNaNM
2019-03-182019318NaN14.0NaN1.2NaN7.6NaN...0.0NaN0.0NaNNaNNaNNaNMNaNM
\n", "

2 rows × 26 columns

\n", "
" ], "text/plain": [ " Year Month Day Data Quality Max Temp (C) Max Temp Flag \\\n", "Date/Time \n", "2019-03-17 2019 3 17 NaN 11.9 NaN \n", "2019-03-18 2019 3 18 NaN 14.0 NaN \n", "\n", " Min Temp (C) Min Temp Flag Mean Temp (C) Mean Temp Flag ... \\\n", "Date/Time ... \n", "2019-03-17 1.4 NaN 6.7 NaN ... \n", "2019-03-18 1.2 NaN 7.6 NaN ... \n", "\n", " Total Snow (cm) Total Snow Flag Total Precip (mm) \\\n", "Date/Time \n", "2019-03-17 0.0 NaN 0.0 \n", "2019-03-18 0.0 NaN 0.0 \n", "\n", " Total Precip Flag Snow on Grnd (cm) Snow on Grnd Flag \\\n", "Date/Time \n", "2019-03-17 NaN NaN NaN \n", "2019-03-18 NaN NaN NaN \n", "\n", " Dir of Max Gust (10s deg) Dir of Max Gust Flag \\\n", "Date/Time \n", "2019-03-17 NaN NaN \n", "2019-03-18 NaN M \n", "\n", " Spd of Max Gust (km/h) Spd of Max Gust Flag \n", "Date/Time \n", "2019-03-17 NaN M \n", "2019-03-18 NaN M \n", "\n", "[2 rows x 26 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_all.tail(2)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(29632, 26)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_all.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_daily_YVR_1938-2019_all.csv\n" ] } ], "source": [ "# Save the full set of concatenated data to file\n", "year_min, year_max = data_all[\"Year\"].min(), data_all[\"Year\"].max()\n", "savefile = (context.processed_dir \n", " / Path(f\"weather_daily_{station}_{year_min}-{year_max}_all.csv\"))\n", "print(f\"Saving to {savefile}\")\n", "data_all.to_csv(savefile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a subset of the data\n", "\n", "For demos we'll use a subset of the data columns and rename some of them for convenience." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMonthDayT_mean (C)T_high (C)T_low (C)Rain (mm)Snow (cm)Total Precip (mm)
Date
1938-01-011938114.49.4-0.6NaNNaN0.3
1938-01-021938124.57.21.7NaNNaN0.5
1938-01-031938131.77.2-3.90.00.00.0
1938-01-041938142.27.2-2.80.00.00.0
1938-01-051938152.27.2-2.80.00.00.0
\n", "
" ], "text/plain": [ " Year Month Day T_mean (C) T_high (C) T_low (C) Rain (mm) \\\n", "Date \n", "1938-01-01 1938 1 1 4.4 9.4 -0.6 NaN \n", "1938-01-02 1938 1 2 4.5 7.2 1.7 NaN \n", "1938-01-03 1938 1 3 1.7 7.2 -3.9 0.0 \n", "1938-01-04 1938 1 4 2.2 7.2 -2.8 0.0 \n", "1938-01-05 1938 1 5 2.2 7.2 -2.8 0.0 \n", "\n", " Snow (cm) Total Precip (mm) \n", "Date \n", "1938-01-01 NaN 0.3 \n", "1938-01-02 NaN 0.5 \n", "1938-01-03 0.0 0.0 \n", "1938-01-04 0.0 0.0 \n", "1938-01-05 0.0 0.0 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract subset with columns of interest and rename some columns\n", "\n", "columns = [\n", " \"Year\",\n", " \"Month\",\n", " \"Day\",\n", " \"Mean Temp (C)\",\n", " \"Max Temp (C)\",\n", " \"Min Temp (C)\",\n", " \"Total Rain (mm)\",\n", " \"Total Snow (cm)\",\n", " \"Total Precip (mm)\",\n", "]\n", "\n", "cols_dict = {\n", " \"Mean Temp (C)\": \"T_mean (C)\",\n", " \"Max Temp (C)\": \"T_high (C)\",\n", " \"Min Temp (C)\": \"T_low (C)\",\n", " \"Total Rain (mm)\": \"Rain (mm)\",\n", " \"Total Snow (cm)\": \"Snow (cm)\",\n", "}\n", "data_subset = data_all[columns].rename(columns=cols_dict)\n", "data_subset.index.name = \"Date\"\n", "data_subset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write the subset out to a csv file" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saving to /Users/phil/repos/eosc213_students/notebooks/pandas/data/processed/weather_YVR.csv\n" ] } ], "source": [ "savefile2 = context.processed_dir / Path(f\"weather_{station}.csv\")\n", "print(f\"Saving to {savefile2}\")\n", "data_subset.to_csv(savefile2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all", "notebook_metadata_filter": "-language_info", "text_representation": { "extension": ".py", "format_name": "percent", "format_version": "1.2", "jupytext_version": "1.0.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "nbsphinx": { "execute": "never" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }