Table of Contents

  • 1  Dataframe practice

    • 1.1  Create a simulated dataset

    • 1.2  Convert to a dataframe

      • 1.2.1  Q1: Turn uni_dist into a dataframe called df

    • 1.3  Create a datetime for each row

    • 1.4  Make time_list the index to the dataframe

      • 1.4.1  Add it as a new column

      • 1.4.2  Q2 – set this column as an index

      • 1.4.3  Q3 write a function that rounds a column value to 1 decimal

      • 1.4.4  Q4 Add a new column called “mass_tenths” using coarse_grain

      • 1.4.5  Q5: Group the measurements on mass_tenths

      • 1.4.6  Q6 Round trip

[1]:
import pandas as pd
import numpy as np
import numpy.random as rn
import numpy.testing as nt
import datetime as dt
from datetime import timezone as tz
import pprint

Dataframe practice

Learning objectives

  • Be able to use the apply method to execute a function on every row of a dataframe

  • Be able to add a column to a data frame and use groupby with that column to group dataframe rows into subset dataframes

  • Be able to do simple statistics (mean, median, max, min, summary) on dataframes and dataframe series

  • Be able to construct dataframes from lists of tuples, lists of dictionaries, or numpy arrays using from_records member function

  • Be able to create and work with python datetime objects

Create a simulated dataset

This cell creates 3 columns: column 0 is temperatures in the range 263-283 K, column 1 is veloc in the range -100 to +100 dm/s, column 2 is mass in the range

[2]:
my_seed = 5
rn.seed(my_seed)
uni_dist = rn.rand(25,3)
uni_dist[:,:2] = (uni_dist[:,:2] - 0.5)*2.
temps = uni_dist[:,0]*10. + 273.
veloc = uni_dist[:,1]*100.
uni_dist[:,0]=temps
uni_dist[:,1]=veloc
print(uni_dist[:5,:])
[[ 2.67439863e+02  7.41464612e+01  2.06719155e-01]
 [ 2.81372218e+02 -2.31776224e+00  6.11743863e-01]
 [ 2.78318157e+02  3.68359757e+00  2.96800502e-01]
 [ 2.66754425e+02 -8.38517462e+01  7.38440296e-01]
 [ 2.71826184e+02 -6.83380265e+01  8.79937031e-01]]

Convert to a dataframe

with columns=[‘temperature’,‘veloc’,‘mass’]

Q1: Turn uni_dist into a dataframe called df

[3]:
#Q1 answer
columns=['temperature','veloc','mass']
df=pd.DataFrame(uni_dist,columns=columns)
df.head()
[3]:
temperature veloc mass
0 267.439863 74.146461 0.206719
1 281.372218 -2.317762 0.611744
2 278.318157 3.683598 0.296801
3 266.754425 -83.851746 0.738440
4 271.826184 -68.338026 0.879937

It should pass this test

[4]:
nt.assert_allclose(df['veloc'][1],-2.3,rtol=0.1)

Create a datetime for each row

Assume that each row represents a measurement, and that the measurements happened every 10 minutes atarting at 5 am UCT on March 25, 2019

[5]:
start_time=dt.datetime(2019,3,25,5,0,0,tzinfo=tz.utc)
interval=dt.timedelta(minutes=10)
print(f'start and time interval: {start_time} -- {interval}')
time_list=[start_time]
for tick in range(len(df)-1):
    time_list.append(time_list[tick] + interval)
pprint.pprint(time_list[:5])
start and time interval: 2019-03-25 05:00:00+00:00 -- 0:10:00
[datetime.datetime(2019, 3, 25, 5, 0, tzinfo=datetime.timezone.utc),
 datetime.datetime(2019, 3, 25, 5, 10, tzinfo=datetime.timezone.utc),
 datetime.datetime(2019, 3, 25, 5, 20, tzinfo=datetime.timezone.utc),
 datetime.datetime(2019, 3, 25, 5, 30, tzinfo=datetime.timezone.utc),
 datetime.datetime(2019, 3, 25, 5, 40, tzinfo=datetime.timezone.utc)]

Make time_list the index to the dataframe

Add it as a new column

[6]:
df['date']=time_list

Q2 – set this column as an index

google pandas.DataFrame.set_index for method arguments

[7]:
#Q2 answer
df['date']=time_list
df.set_index('date',inplace=True)
df.head()
print(df.head().to_latex())
\begin{tabular}{lrrr}
\toprule
{} &  temperature &      veloc &      mass \\
date                      &              &            &           \\
\midrule
2019-03-25 05:00:00+00:00 &   267.439863 &  74.146461 &  0.206719 \\
2019-03-25 05:10:00+00:00 &   281.372218 &  -2.317762 &  0.611744 \\
2019-03-25 05:20:00+00:00 &   278.318157 &   3.683598 &  0.296801 \\
2019-03-25 05:30:00+00:00 &   266.754425 & -83.851746 &  0.738440 \\
2019-03-25 05:40:00+00:00 &   271.826184 & -68.338026 &  0.879937 \\
\bottomrule
\end{tabular}

Your dataframe should pass this assert.

[8]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)

Q3 write a function that rounds a column value to 1 decimal

i.e. complete this function

def coarse_grain(row,colname):
    #
    # input -- dataframe row called row and string colname
    #
    # returns -- row[colname] rounded to 1 decimal
    return col_val
[9]:
# Q3 answer
def coarse_grain(row,colname):
    col_val = np.around(row[colname],decimals=1)
    return col_val

Your function should pass this assert

[10]:
the_row = df.iloc[2]
the_round = coarse_grain(the_row,'mass')
np.testing.assert_allclose(the_round,0.3)

Q4 Add a new column called “mass_tenths” using coarse_grain

i.e. use df.apply to run coarse_grain on the dataframe and add the result as the ‘mass_tenths’ column

[11]:
#Q4 answer
df['mass_tenths']=\
    df.apply(coarse_grain,args=('mass',),axis=1)

You dataframe should now pass the following assert

[12]:
nt.assert_allclose(df['mass_tenths'][3],0.7)

Q5: Group the measurements on mass_tenths

Use groups=df.groupby to create a DataFrameGroupBy object called groups indexed on mass_tenths Then find the median temperature, speed and mass of each group using the the_median=groups.median() method
[13]:
#Q5 answer
groups=df.groupby('mass_tenths')
the_median=groups.median()
[14]:
nt.assert_allclose(the_median.loc[0.2]['veloc'],1.49,rtol=0.1)
[15]:
groups.median()
[15]:
temperature veloc mass
mass_tenths
0.0 280.409662 84.549923 0.002214
0.1 276.975578 -14.607772 0.090364
0.2 274.237067 1.489661 0.188541
0.3 268.481729 -17.152996 0.296080
0.4 272.389767 96.293748 0.398945
0.5 270.220968 -92.898194 0.546358
0.6 275.635321 3.094522 0.599929
0.7 265.120278 -71.470318 0.719142
0.8 279.274650 -48.180481 0.795087
0.9 273.766654 -2.978774 0.903310
1.0 265.229076 -49.750978 0.964915

Q6 Round trip

Add a cell that fixes the following problem – new_df is not the same as our original df after writing it out to csv

[16]:
df.to_csv('test.csv')
[20]:
new_df=pd.read_csv('test.csv')
new_df.head()
print(df.head().to_latex())
\begin{tabular}{lrrrr}
\toprule
{} &  temperature &      veloc &      mass &  mass\_tenths \\
date                      &              &            &           &              \\
\midrule
2019-03-25 05:00:00+00:00 &   267.439863 &  74.146461 &  0.206719 &          0.2 \\
2019-03-25 05:10:00+00:00 &   281.372218 &  -2.317762 &  0.611744 &          0.6 \\
2019-03-25 05:20:00+00:00 &   278.318157 &   3.683598 &  0.296801 &          0.3 \\
2019-03-25 05:30:00+00:00 &   266.754425 & -83.851746 &  0.738440 &          0.7 \\
2019-03-25 05:40:00+00:00 &   271.826184 & -68.338026 &  0.879937 &          0.9 \\
\bottomrule
\end{tabular}

[18]:
#Q6 answer
new_df=pd.read_csv('test.csv',index_col='date')
new_df.head()
[18]:
temperature veloc mass mass_tenths
date
2019-03-25 05:00:00+00:00 267.439863 74.146461 0.206719 0.2
2019-03-25 05:10:00+00:00 281.372218 -2.317762 0.611744 0.6
2019-03-25 05:20:00+00:00 278.318157 3.683598 0.296801 0.3
2019-03-25 05:30:00+00:00 266.754425 -83.851746 0.738440 0.7
2019-03-25 05:40:00+00:00 271.826184 -68.338026 0.879937 0.9
[19]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)