Table of Contents

  • 1  Create a simulated dataset

  • 2  Convert to a dataframe

    • 2.1  Q1: Turn uni_dist into a dataframe called df

  • 3  Create a datetime for each row

  • 4  Make time_list the index to the dataframe

    • 4.1  Add it as a new column

    • 4.2  Q2 – set this column as an index

  • 5  Dataframe grouping

    • 5.1  Q3 write a function that rounds a column value to 1 decimal

    • 5.2  Q4 Add a new column called “mass_tenths” using coarse_grain

    • 5.3  Q5: Group the measurements on mass_tenths

  • 6  Dataframe input/output

    • 6.1  Q6 Round trip

[ ]:
import pandas as pd
import numpy as np
import numpy.random as rn
import numpy.testing as nt
import datetime as dt
from datetime import timezone as tz
import pprint

Dataframe practice

Learning objectives

  • Be able to use the apply method to execute a function on every row of a dataframe

  • Be able to add a column to a data frame and use groupby with that column to group dataframe rows into subset dataframes

  • Be able to do simple statistics (mean, median, max, min, summary) on dataframes and dataframe series

  • Be able to construct dataframes from lists of tuples, lists of dictionaries, or numpy arrays using from_records member function

  • Be able to create and work with python datetime objects

Create a simulated dataset

This cell creates 3 columns: column 0 is temperatures in the range 263-283 K, column 1 is veloc in the range -100 to +100 dm/s, column 2 is mass in the range

[ ]:
my_seed = 5
rn.seed(my_seed)
uni_dist = rn.rand(25,3)
uni_dist[:,:2] = (uni_dist[:,:2] - 0.5)*2.
temps = uni_dist[:,0]*10. + 273.
veloc = uni_dist[:,1]*100.
uni_dist[:,0]=temps
uni_dist[:,1]=veloc
print(uni_dist[:5,:])

Convert to a dataframe

with columns=[‘temperature’,‘veloc’,‘mass’]

Q1: Turn uni_dist into a dataframe called df

[ ]:
#Q1 answer
columns=['temperature','veloc','mass']
df=pd.DataFrame(uni_dist,columns=columns)
df.head()

It should pass this test

[ ]:
nt.assert_allclose(df['veloc'][1],-2.3,rtol=0.1)

Create a datetime for each row

Assume that each row represents a measurement, and that the measurements happened every 10 minutes atarting at 5 am UCT on March 25, 2019

[ ]:
start_time=dt.datetime(2019,3,25,5,0,0,tzinfo=tz.utc)
interval=dt.timedelta(minutes=10)
print(f'start and time interval: {start_time} -- {interval}')
time_list=[start_time]
for tick in range(len(df)-1):
    time_list.append(time_list[tick] + interval)
pprint.pprint(time_list[:5])

Make time_list the index to the dataframe

Add it as a new column

[ ]:
df['date']=time_list

Q2 – set this column as an index

google pandas.DataFrame.set_index for method arguments

[ ]:
#Q2 answer
df.set_index('date',inplace=True)
df.head()

Your dataframe should pass this assert.

[ ]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)

Dataframe grouping

Q3 write a function that rounds a column value to 1 decimal

i.e. complete this function

def coarse_grain(row,colname):
    #
    # input -- dataframe row called row and string colname
    #
    # returns -- row[colname] rounded to 1 decimal
    return col_val
[ ]:
# Q3 answer
def coarse_grain(row,colname):
    col_val = np.around(row[colname],decimals=1)
    return col_val

Your function should pass this assert

[ ]:
the_row = df.iloc[2]
the_round = coarse_grain(the_row,'mass')
np.testing.assert_allclose(the_round,0.3)

Q4 Add a new column called “mass_tenths” using coarse_grain

i.e. use df.apply to run coarse_grain on the dataframe and add the result as the ‘mass_tenths’ column

[ ]:
#Q4 answer
df['mass_tenths']=\
    df.apply(coarse_grain,args=('mass',),axis=1)

You dataframe should now pass the following assert

[ ]:
nt.assert_allclose(df['mass_tenths'][3],0.7)

Q5: Group the measurements on mass_tenths

Use groups=df.groupby to create a DataFrameGroupBy object called groups indexed on mass_tenths Then find the median temperature, speed and mass of each group using the the_median=groups.median() method
[ ]:
#Q5 answer
groups=df.groupby('mass_tenths')
the_median=groups.median()
[ ]:
nt.assert_allclose(the_median.loc[0.2]['veloc'],1.49,rtol=0.1)
[ ]:
groups.median()

Dataframe input/output

Q6 Round trip

Add a cell that fixes the following problem – new_df is not the same as our original df after writing it out to csv

[ ]:
df.to_csv('test.csv')
[ ]:
new_df=pd.read_csv('test.csv')
new_df.head()
[ ]:
#Q6 answer
new_df=pd.read_csv('test.csv',index_col='date')
new_df.head()
[ ]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)