Table of Contents
1 Dataframe practice
1.1 Create a simulated dataset
1.2 Convert to a dataframe
1.2.1 Q1: Turn uni_dist into a dataframe called df
1.3 Create a datetime for each row
1.4 Make time_list the index to the dataframe
1.4.1 Add it as a new column
1.4.2 Q2 – set this column as an index
1.4.3 Q3 write a function that rounds a column value to 1 decimal
1.4.4 Q4 Add a new column called “mass_tenths” using coarse_grain
1.4.5 Q5: Group the measurements on mass_tenths
1.4.6 Q6 Round trip
[1]:
import pandas as pd
import numpy as np
import numpy.random as rn
import numpy.testing as nt
import datetime as dt
from datetime import timezone as tz
import pprint
Dataframe practice¶
Learning objectives
Be able to use the apply method to execute a function on every row of a dataframe
Be able to add a column to a data frame and use groupby with that column to group dataframe rows into subset dataframes
Be able to do simple statistics (mean, median, max, min, summary) on dataframes and dataframe series
Be able to construct dataframes from lists of tuples, lists of dictionaries, or numpy arrays using from_records member function
Be able to create and work with python datetime objects
Create a simulated dataset¶
This cell creates 3 columns: column 0 is temperatures in the range 263-283 K, column 1 is veloc in the range -100 to +100 dm/s, column 2 is mass in the range
[2]:
my_seed = 5
rn.seed(my_seed)
uni_dist = rn.rand(25,3)
uni_dist[:,:2] = (uni_dist[:,:2] - 0.5)*2.
temps = uni_dist[:,0]*10. + 273.
veloc = uni_dist[:,1]*100.
uni_dist[:,0]=temps
uni_dist[:,1]=veloc
print(uni_dist[:5,:])
[[ 2.67439863e+02 7.41464612e+01 2.06719155e-01]
[ 2.81372218e+02 -2.31776224e+00 6.11743863e-01]
[ 2.78318157e+02 3.68359757e+00 2.96800502e-01]
[ 2.66754425e+02 -8.38517462e+01 7.38440296e-01]
[ 2.71826184e+02 -6.83380265e+01 8.79937031e-01]]
Convert to a dataframe¶
with columns=[‘temperature’,‘veloc’,‘mass’]
Q1: Turn uni_dist into a dataframe called df¶
[3]:
#Q1 answer
columns=['temperature','veloc','mass']
df=pd.DataFrame(uni_dist,columns=columns)
df.head()
[3]:
temperature | veloc | mass | |
---|---|---|---|
0 | 267.439863 | 74.146461 | 0.206719 |
1 | 281.372218 | -2.317762 | 0.611744 |
2 | 278.318157 | 3.683598 | 0.296801 |
3 | 266.754425 | -83.851746 | 0.738440 |
4 | 271.826184 | -68.338026 | 0.879937 |
It should pass this test
[4]:
nt.assert_allclose(df['veloc'][1],-2.3,rtol=0.1)
Create a datetime for each row¶
Assume that each row represents a measurement, and that the measurements happened every 10 minutes atarting at 5 am UCT on March 25, 2019
[5]:
start_time=dt.datetime(2019,3,25,5,0,0,tzinfo=tz.utc)
interval=dt.timedelta(minutes=10)
print(f'start and time interval: {start_time} -- {interval}')
time_list=[start_time]
for tick in range(len(df)-1):
time_list.append(time_list[tick] + interval)
pprint.pprint(time_list[:5])
start and time interval: 2019-03-25 05:00:00+00:00 -- 0:10:00
[datetime.datetime(2019, 3, 25, 5, 0, tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 3, 25, 5, 10, tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 3, 25, 5, 20, tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 3, 25, 5, 30, tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 3, 25, 5, 40, tzinfo=datetime.timezone.utc)]
Make time_list the index to the dataframe¶
Add it as a new column¶
[6]:
df['date']=time_list
Q2 – set this column as an index¶
google pandas.DataFrame.set_index for method arguments
[7]:
#Q2 answer
df['date']=time_list
df.set_index('date',inplace=True)
df.head()
print(df.head().to_latex())
\begin{tabular}{lrrr}
\toprule
{} & temperature & veloc & mass \\
date & & & \\
\midrule
2019-03-25 05:00:00+00:00 & 267.439863 & 74.146461 & 0.206719 \\
2019-03-25 05:10:00+00:00 & 281.372218 & -2.317762 & 0.611744 \\
2019-03-25 05:20:00+00:00 & 278.318157 & 3.683598 & 0.296801 \\
2019-03-25 05:30:00+00:00 & 266.754425 & -83.851746 & 0.738440 \\
2019-03-25 05:40:00+00:00 & 271.826184 & -68.338026 & 0.879937 \\
\bottomrule
\end{tabular}
Your dataframe should pass this assert.
[8]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)
Q3 write a function that rounds a column value to 1 decimal¶
i.e. complete this function
def coarse_grain(row,colname):
#
# input -- dataframe row called row and string colname
#
# returns -- row[colname] rounded to 1 decimal
return col_val
[9]:
# Q3 answer
def coarse_grain(row,colname):
col_val = np.around(row[colname],decimals=1)
return col_val
Your function should pass this assert
[10]:
the_row = df.iloc[2]
the_round = coarse_grain(the_row,'mass')
np.testing.assert_allclose(the_round,0.3)
Q4 Add a new column called “mass_tenths” using coarse_grain¶
i.e. use df.apply to run coarse_grain on the dataframe and add the result as the ‘mass_tenths’ column
[11]:
#Q4 answer
df['mass_tenths']=\
df.apply(coarse_grain,args=('mass',),axis=1)
You dataframe should now pass the following assert
[12]:
nt.assert_allclose(df['mass_tenths'][3],0.7)
Q5: Group the measurements on mass_tenths¶
Use groups=df.groupby to create a DataFrameGroupBy object called groups indexed on mass_tenths Then find the median temperature, speed and mass of each group using the the_median=groups.median() method[13]:
#Q5 answer
groups=df.groupby('mass_tenths')
the_median=groups.median()
[14]:
nt.assert_allclose(the_median.loc[0.2]['veloc'],1.49,rtol=0.1)
[15]:
groups.median()
[15]:
temperature | veloc | mass | |
---|---|---|---|
mass_tenths | |||
0.0 | 280.409662 | 84.549923 | 0.002214 |
0.1 | 276.975578 | -14.607772 | 0.090364 |
0.2 | 274.237067 | 1.489661 | 0.188541 |
0.3 | 268.481729 | -17.152996 | 0.296080 |
0.4 | 272.389767 | 96.293748 | 0.398945 |
0.5 | 270.220968 | -92.898194 | 0.546358 |
0.6 | 275.635321 | 3.094522 | 0.599929 |
0.7 | 265.120278 | -71.470318 | 0.719142 |
0.8 | 279.274650 | -48.180481 | 0.795087 |
0.9 | 273.766654 | -2.978774 | 0.903310 |
1.0 | 265.229076 | -49.750978 | 0.964915 |
Q6 Round trip¶
Add a cell that fixes the following problem – new_df is not the same as our original df after writing it out to csv
[16]:
df.to_csv('test.csv')
[20]:
new_df=pd.read_csv('test.csv')
new_df.head()
print(df.head().to_latex())
\begin{tabular}{lrrrr}
\toprule
{} & temperature & veloc & mass & mass\_tenths \\
date & & & & \\
\midrule
2019-03-25 05:00:00+00:00 & 267.439863 & 74.146461 & 0.206719 & 0.2 \\
2019-03-25 05:10:00+00:00 & 281.372218 & -2.317762 & 0.611744 & 0.6 \\
2019-03-25 05:20:00+00:00 & 278.318157 & 3.683598 & 0.296801 & 0.3 \\
2019-03-25 05:30:00+00:00 & 266.754425 & -83.851746 & 0.738440 & 0.7 \\
2019-03-25 05:40:00+00:00 & 271.826184 & -68.338026 & 0.879937 & 0.9 \\
\bottomrule
\end{tabular}
[18]:
#Q6 answer
new_df=pd.read_csv('test.csv',index_col='date')
new_df.head()
[18]:
temperature | veloc | mass | mass_tenths | |
---|---|---|---|---|
date | ||||
2019-03-25 05:00:00+00:00 | 267.439863 | 74.146461 | 0.206719 | 0.2 |
2019-03-25 05:10:00+00:00 | 281.372218 | -2.317762 | 0.611744 | 0.6 |
2019-03-25 05:20:00+00:00 | 278.318157 | 3.683598 | 0.296801 | 0.3 |
2019-03-25 05:30:00+00:00 | 266.754425 | -83.851746 | 0.738440 | 0.7 |
2019-03-25 05:40:00+00:00 | 271.826184 | -68.338026 | 0.879937 | 0.9 |
[19]:
target_row=start_time=dt.datetime(2019,3,25,5,40,0,tzinfo=tz.utc)
nt.assert_allclose(df.loc[target_row]['temperature'],271.8,rtol=1.e-2)