{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import numpy.random as rn\n", "import numpy.testing as nt\n", "import datetime as dt\n", "from datetime import timezone as tz\n", "import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataframe practice\n", "\n", "Learning objectives\n", "\n", "- Be able to use the apply method to execute a function on every row of a dataframe\n", "- Be able to add a column to a data frame and use groupby with that column to \n", " group dataframe rows into subset dataframes\n", "- Be able to do simple statistics (mean, median, max, min, summary) on dataframes and dataframe series\n", "- Be able to construct dataframes from lists of tuples, lists of dictionaries, or numpy arrays using\n", " from_records member function\n", "- Be able to create and work with python datetime objects\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a simulated dataset\n", "\n", "This cell creates 3 columns: column 0 is temperatures\n", "in the range 263-283 K, column 1 is veloc in the range\n", "-100 to +100 dm/s, column 2 is mass in the range " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seed = 5\n", "rn.seed(my_seed)\n", "uni_dist = rn.rand(25,3)\n", "uni_dist[:,:2] = (uni_dist[:,:2] - 0.5)*2.\n", "temps = uni_dist[:,0]*10. + 273.\n", "veloc = uni_dist[:,1]*100.\n", "uni_dist[:,0]=temps\n", "uni_dist[:,1]=veloc\n", "print(uni_dist[:5,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert to a dataframe\n", "\n", "with columns=['temperature','veloc','mass']\n", "\n", "### Q1: Turn uni_dist into a dataframe called df\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "#Q1 answer\n", "columns=['temperature','veloc','mass']\n", "df=pd.DataFrame(uni_dist,columns=columns)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It should pass this test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nt.assert_allclose(df['veloc'][1],-2.3,rtol=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a datetime for each row\n", "\n", "Assume that each row represents a measurement, and that\n", "the measurements happened every 10 minutes atarting at\n", "5 am UCT on March 25, 2019" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time=dt.datetime(2019,3,25,5,0,0,tzinfo=tz.utc)\n", "interval=dt.timedelta(minutes=10)\n", "print(f'start and time interval: {start_time} -- {interval}')\n", "time_list=[start_time]\n", "for tick in range(len(df)-1):\n", " time_list.append(time_list[tick] + interval)\n", "pprint.pprint(time_list[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make time_list the index to the dataframe\n", "\n", "### Add it as a new column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['date']=time_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2 -- set this column as an index\n", "\n", "google pandas.DataFrame.set_index for method arguments" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "#Q2 answer\n", "df.set_index('date',inplace=True)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your dataframe should pass this assert." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)\n", "nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe grouping\n", "\n", "### Q3 write a function that rounds a column value to 1 decimal\n", "\n", "i.e. complete this function\n", "\n", "```\n", "def coarse_grain(row,colname):\n", " #\n", " # input -- dataframe row called row and string colname\n", " #\n", " # returns -- row[colname] rounded to 1 decimal\n", " return col_val\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 1 }, "outputs": [], "source": [ "# Q3 answer\n", "def coarse_grain(row,colname):\n", " col_val = np.around(row[colname],decimals=1)\n", " return col_val" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your function should pass this assert" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "the_row = df.iloc[2]\n", "the_round = coarse_grain(the_row,'mass')\n", "np.testing.assert_allclose(the_round,0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q4 Add a new column called \"mass_tenths\" using coarse_grain\n", "\n", "i.e. use df.apply to run coarse_grain on the dataframe and add the result\n", "as the 'mass_tenths' column\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "#Q4 answer\n", "df['mass_tenths']=\\\n", " df.apply(coarse_grain,args=('mass',),axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You dataframe should now pass the following assert" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nt.assert_allclose(df['mass_tenths'][3],0.7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q5: Group the measurements on mass_tenths" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Use groups=df.groupby to create a DataFrameGroupBy object\n", "called groups indexed on mass_tenths\n", "\n", "Then find the median temperature, speed and mass of each group\n", "using the the_median=groups.median() method\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Q5 answer\n", "groups=df.groupby('mass_tenths')\n", "the_median=groups.median()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nt.assert_allclose(the_median.loc[0.2]['veloc'],1.49,rtol=0.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "groups.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe input/output\n", "\n", "### Q6 Round trip\n", "\n", "Add a cell that fixes the following problem -- new_df is not\n", "the same as our original df after writing it out to csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_csv('test.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_df=pd.read_csv('test.csv')\n", "new_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "#Q6 answer\n", "new_df=pd.read_csv('test.csv',index_col='date')\n", "new_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)\n", "nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all", "notebook_metadata_filter": "all,-language_info" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "nbsphinx": { "execute": "never" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "273.438px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }