Table of Contents
1 Learning objectives
2 Folder setup
2.1 Subfolder creation
3 Pandas dataframes vs. numpy arrays
4 Intro to Pandas
5 Why Pandas?
6 Reading a CSV file
6.1 The pathlib module
7 Data at a Glance
7.1 Number of rows and columns:
7.2 Simple Summary Statistics
8 Exercise
8.1 Data Overview
8.2 Bonus exercises
Reading & Summarizing CSV Data¶
Credit: the notebooks in this folder are lightly modified versions of work by Jennifer Walker presented at the EOAS python workshop in October, 2018: https://github.com/jenfly/eoas-python
[1]:
from pathlib import Path
import pandas
Learning objectives¶
Learn how to read, process and write data in csv/xlsx/tabular format using pandas
In parts 2 and parts 3:
Learn how to download tabular data from websites with a “restful api”:
Learn how to clean data by filtering missing values, renaming columns, and writing out processed files for further work
Folder setup¶
As our project grows more complicated, it’s good to have a central module that keeps track of important files and sets your scripts up so that they can import functions and classes from you modules. If you were planning to distribute your project using conda, then you would need to write an installation script, which is a fair amount of work. At this stage, it’s easier and more flexible to store that information in a file that travels along with your notebook. We set the “context” for this notebook by importing: context_pandas1.py
[2]:
import context_pandas1
in context_pandas1.py, setting root_dir to
/Users/phil/repos/eosc213_students/notebooks/pandas
setting data_dir to
/Users/phil/repos/eosc213_students/notebooks/pandas/data
Subfolder creation¶
Subsequent notebooks are going to need a place to put raw and processed data. We create those folders in the cell below
[3]:
processed_dir = context_pandas1.data_dir / "processed"
raw_dir = context_pandas1.data_dir / "raw"
processed_dir.mkdir(parents=True, exist_ok=True)
raw_dir.mkdir(parents=True, exist_ok=True)
Pandas dataframes vs. numpy arrays¶
Dataframes are column oriented, arrays are row oriented
Array items are all of the same dtype (i.e. numpy.float32), dataframe columns can have different types (e.g.strings vs. integers)
Dataframe columns can be indexed by name (e.g. “Total area of basin”) or by integer index
Dataframe rows can be indexed by number of by a special index (e.g. postal code)
Dataframe objects have dozens of methods to summarize and manipulate the data they hold, making them similar in features to a lightweight relational database.
Intro to Pandas¶
pandas
= Python Data Analysis LibraryBest book: Python for data analysis by Wes McKinney
Jennifer Walker’s Pandas cheatsheet
Library for working with labelled tabular data (1-D and 2-D)
Data formats include: comma separated values (CSV) and other text files, Excel spreadsheets, HDF5, and others
With
pandas
you can do pretty much everything you would in a spreadsheet, plus a whole lot more!If you’re working with higher dimensional data and/or netCDF files, check out the excellent xarray library, which brings the labelled data power of
pandas
to N-dimensional arrays
Why Pandas?¶
Working with large data files and complex calculations
Dealing with messy and missing data
Merging data from multiple files
Timeseries analysis
Automate repetitive tasks
Combine with other Python libraries to create beautiful and fully customized visualizations
Reading a CSV file¶
We’ll be working with the file weather_YVR.csv
in the data
sub-folder. - Environment Canada daily weather measurements at Vancouver Airport from 1938-2017.
Now let’s read the CSV file into our notebook with the function read_csv
from the pandas
library. - To access functions in the library, we use dot notation again: pandas.read_csv()
- Our input to the read_csv function is the file path as a string: 'data/weather_YVR.csv'
We’ll store the data as a dataframe called weather
.
The pathlib module¶
The cell below constructs a Path object. Note the direction of the “/” separator. This would not be the way that we would specify a file on windows (that would be ’data:nbsphinx-math:weather_YVR.csv’ – Path objects hide this complexity by understanding whether we are working on windows, linux or macos and just doing the right thing.
[4]:
weather_file = context_pandas1.data_dir / "weather_YVR.csv"
[6]:
weather
[6]:
Date | Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1938-01-01 | 1938 | 1 | 1 | 4.4 | 9.4 | -0.6 | NaN | NaN | 0.3 |
1 | 1938-01-02 | 1938 | 1 | 2 | 4.5 | 7.2 | 1.7 | NaN | NaN | 0.5 |
2 | 1938-01-03 | 1938 | 1 | 3 | 1.7 | 7.2 | -3.9 | 0.0 | 0.0 | 0.0 |
3 | 1938-01-04 | 1938 | 1 | 4 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
4 | 1938-01-05 | 1938 | 1 | 5 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
5 | 1938-01-06 | 1938 | 1 | 6 | 2.3 | 5.6 | -1.1 | 0.0 | 0.0 | 0.0 |
6 | 1938-01-07 | 1938 | 1 | 7 | 2.2 | 4.4 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 1938-01-08 | 1938 | 1 | 8 | 2.5 | 6.7 | -1.7 | NaN | NaN | 0.0 |
8 | 1938-01-09 | 1938 | 1 | 9 | 4.2 | 5.6 | 2.8 | NaN | NaN | 4.3 |
9 | 1938-01-10 | 1938 | 1 | 10 | 5.8 | 7.2 | 4.4 | NaN | NaN | 3.6 |
10 | 1938-01-11 | 1938 | 1 | 11 | 4.2 | 6.7 | 1.7 | NaN | NaN | 0.0 |
11 | 1938-01-12 | 1938 | 1 | 12 | 7.8 | 10.6 | 5.0 | NaN | NaN | 5.3 |
12 | 1938-01-13 | 1938 | 1 | 13 | 6.1 | 8.3 | 3.9 | NaN | NaN | 28.7 |
13 | 1938-01-14 | 1938 | 1 | 14 | 6.7 | 8.3 | 5.0 | NaN | NaN | 18.3 |
14 | 1938-01-15 | 1938 | 1 | 15 | 4.5 | 8.3 | 0.6 | 0.0 | 0.0 | 0.0 |
15 | 1938-01-16 | 1938 | 1 | 16 | 4.5 | 6.7 | 2.2 | NaN | NaN | 6.6 |
16 | 1938-01-17 | 1938 | 1 | 17 | 2.8 | 7.2 | -1.7 | 0.0 | 0.0 | 0.0 |
17 | 1938-01-18 | 1938 | 1 | 18 | 4.8 | 7.8 | 1.7 | NaN | NaN | 5.8 |
18 | 1938-01-19 | 1938 | 1 | 19 | 3.4 | 6.7 | 0.0 | 0.0 | 0.0 | 0.0 |
19 | 1938-01-20 | 1938 | 1 | 20 | 2.8 | 6.1 | -0.6 | NaN | NaN | 8.9 |
20 | 1938-01-21 | 1938 | 1 | 21 | 5.0 | 8.3 | 1.7 | NaN | NaN | 2.5 |
21 | 1938-01-22 | 1938 | 1 | 22 | 3.1 | 8.3 | -2.2 | NaN | NaN | 3.8 |
22 | 1938-01-23 | 1938 | 1 | 23 | 2.8 | 7.8 | -2.2 | 0.0 | 0.0 | 0.0 |
23 | 1938-01-24 | 1938 | 1 | 24 | 3.4 | 6.7 | 0.0 | NaN | NaN | 0.0 |
24 | 1938-01-25 | 1938 | 1 | 25 | 4.8 | 7.8 | 1.7 | 0.0 | 0.0 | 0.0 |
25 | 1938-01-26 | 1938 | 1 | 26 | 0.3 | 3.9 | -3.3 | 0.0 | 0.0 | 0.0 |
26 | 1938-01-27 | 1938 | 1 | 27 | 2.8 | 6.1 | -0.6 | 0.0 | 0.0 | 0.0 |
27 | 1938-01-28 | 1938 | 1 | 28 | 4.5 | 6.7 | 2.2 | NaN | NaN | 8.9 |
28 | 1938-01-29 | 1938 | 1 | 29 | 1.7 | 7.8 | -4.4 | 0.0 | 0.0 | 0.0 |
29 | 1938-01-30 | 1938 | 1 | 30 | -2.3 | 1.1 | -5.6 | NaN | NaN | 5.6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29160 | 2017-12-02 | 2017 | 12 | 2 | 5.3 | 6.8 | 3.7 | 5.2 | 0.0 | 5.2 |
29161 | 2017-12-03 | 2017 | 12 | 3 | 3.6 | 6.0 | 1.1 | 0.0 | 0.0 | 0.0 |
29162 | 2017-12-04 | 2017 | 12 | 4 | 3.5 | 5.8 | 1.1 | 0.0 | 0.0 | 0.0 |
29163 | 2017-12-05 | 2017 | 12 | 5 | 2.6 | 6.3 | -1.2 | 0.0 | 0.0 | 0.0 |
29164 | 2017-12-06 | 2017 | 12 | 6 | 2.2 | 5.8 | -1.5 | 0.0 | 0.0 | 0.0 |
29165 | 2017-12-07 | 2017 | 12 | 7 | 2.9 | 4.2 | 1.5 | 0.0 | 0.0 | 0.0 |
29166 | 2017-12-08 | 2017 | 12 | 8 | 1.9 | 3.5 | 0.2 | 0.0 | 0.0 | 0.0 |
29167 | 2017-12-09 | 2017 | 12 | 9 | 1.9 | 4.5 | -0.8 | 0.0 | 0.0 | 0.0 |
29168 | 2017-12-10 | 2017 | 12 | 10 | 4.1 | 7.7 | 0.5 | 0.0 | 0.0 | 0.0 |
29169 | 2017-12-11 | 2017 | 12 | 11 | 3.8 | 8.9 | -1.3 | 0.0 | 0.0 | 0.0 |
29170 | 2017-12-12 | 2017 | 12 | 12 | 4.8 | 7.6 | 1.9 | 0.0 | 0.0 | 0.0 |
29171 | 2017-12-13 | 2017 | 12 | 13 | 4.4 | 7.9 | 0.8 | 0.0 | 0.0 | 0.0 |
29172 | 2017-12-14 | 2017 | 12 | 14 | 3.0 | 6.5 | -0.5 | 0.0 | 0.0 | 0.0 |
29173 | 2017-12-15 | 2017 | 12 | 15 | 4.2 | 6.9 | 1.4 | 0.8 | 0.0 | 0.8 |
29174 | 2017-12-16 | 2017 | 12 | 16 | 4.1 | 6.5 | 1.6 | 5.8 | 0.0 | 5.8 |
29175 | 2017-12-17 | 2017 | 12 | 17 | 6.0 | 7.4 | 4.6 | 40.6 | 0.0 | 40.6 |
29176 | 2017-12-18 | 2017 | 12 | 18 | 6.2 | 7.3 | 5.0 | 3.0 | 0.0 | 3.0 |
29177 | 2017-12-19 | 2017 | 12 | 19 | 2.8 | 5.4 | 0.1 | 42.6 | 1.0 | 44.2 |
29178 | 2017-12-20 | 2017 | 12 | 20 | 1.7 | 5.0 | -1.7 | 0.0 | 0.0 | 0.0 |
29179 | 2017-12-21 | 2017 | 12 | 21 | -0.2 | 2.3 | -2.7 | 0.0 | 0.0 | 0.0 |
29180 | 2017-12-22 | 2017 | 12 | 22 | 1.5 | 5.5 | -2.6 | 0.0 | 0.0 | 0.0 |
29181 | 2017-12-23 | 2017 | 12 | 23 | -1.5 | 2.8 | -5.8 | 0.0 | 0.0 | 0.0 |
29182 | 2017-12-24 | 2017 | 12 | 24 | -2.0 | 0.9 | -4.9 | 0.0 | 0.0 | 0.0 |
29183 | 2017-12-25 | 2017 | 12 | 25 | -0.3 | 1.0 | -1.5 | 0.0 | 0.0 | 0.0 |
29184 | 2017-12-26 | 2017 | 12 | 26 | -0.4 | 0.7 | -1.4 | 0.0 | 0.0 | 0.0 |
29185 | 2017-12-27 | 2017 | 12 | 27 | 0.0 | 1.9 | -1.9 | 3.0 | 0.2 | 3.2 |
29186 | 2017-12-28 | 2017 | 12 | 28 | 3.4 | 5.7 | 1.1 | 20.8 | 0.0 | 20.8 |
29187 | 2017-12-29 | 2017 | 12 | 29 | 2.8 | 4.7 | 0.9 | 27.6 | 0.0 | 27.6 |
29188 | 2017-12-30 | 2017 | 12 | 30 | 1.3 | 4.3 | -1.8 | 2.2 | 0.0 | 2.2 |
29189 | 2017-12-31 | 2017 | 12 | 31 | -0.1 | 3.8 | -3.9 | 0.0 | 0.0 | 0.0 |
29190 rows × 10 columns
Only the first 30 and last 30 rows are displayed (but the data is all there in our
weather
variable)You may notice some weird
NaN
values—these represent missing data (NaN
= “not a number”)
What type of object is weather
?
[7]:
type(weather)
[7]:
pandas.core.frame.DataFrame
weather
is a DataFrame, a data structure from thepandas
libraryA DataFrame is a 2-dimensional array (organized into rows and columns, like a table in a spreadsheet)
When we display
weather
, the integer numbers in bold on the left are the DataFrame’s indexIn this case, the index is simply a range of integers corresponding with the row numbers
[8]:
weather
[8]:
Date | Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1938-01-01 | 1938 | 1 | 1 | 4.4 | 9.4 | -0.6 | NaN | NaN | 0.3 |
1 | 1938-01-02 | 1938 | 1 | 2 | 4.5 | 7.2 | 1.7 | NaN | NaN | 0.5 |
2 | 1938-01-03 | 1938 | 1 | 3 | 1.7 | 7.2 | -3.9 | 0.0 | 0.0 | 0.0 |
3 | 1938-01-04 | 1938 | 1 | 4 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
4 | 1938-01-05 | 1938 | 1 | 5 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
5 | 1938-01-06 | 1938 | 1 | 6 | 2.3 | 5.6 | -1.1 | 0.0 | 0.0 | 0.0 |
6 | 1938-01-07 | 1938 | 1 | 7 | 2.2 | 4.4 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 1938-01-08 | 1938 | 1 | 8 | 2.5 | 6.7 | -1.7 | NaN | NaN | 0.0 |
8 | 1938-01-09 | 1938 | 1 | 9 | 4.2 | 5.6 | 2.8 | NaN | NaN | 4.3 |
9 | 1938-01-10 | 1938 | 1 | 10 | 5.8 | 7.2 | 4.4 | NaN | NaN | 3.6 |
10 | 1938-01-11 | 1938 | 1 | 11 | 4.2 | 6.7 | 1.7 | NaN | NaN | 0.0 |
11 | 1938-01-12 | 1938 | 1 | 12 | 7.8 | 10.6 | 5.0 | NaN | NaN | 5.3 |
12 | 1938-01-13 | 1938 | 1 | 13 | 6.1 | 8.3 | 3.9 | NaN | NaN | 28.7 |
13 | 1938-01-14 | 1938 | 1 | 14 | 6.7 | 8.3 | 5.0 | NaN | NaN | 18.3 |
14 | 1938-01-15 | 1938 | 1 | 15 | 4.5 | 8.3 | 0.6 | 0.0 | 0.0 | 0.0 |
15 | 1938-01-16 | 1938 | 1 | 16 | 4.5 | 6.7 | 2.2 | NaN | NaN | 6.6 |
16 | 1938-01-17 | 1938 | 1 | 17 | 2.8 | 7.2 | -1.7 | 0.0 | 0.0 | 0.0 |
17 | 1938-01-18 | 1938 | 1 | 18 | 4.8 | 7.8 | 1.7 | NaN | NaN | 5.8 |
18 | 1938-01-19 | 1938 | 1 | 19 | 3.4 | 6.7 | 0.0 | 0.0 | 0.0 | 0.0 |
19 | 1938-01-20 | 1938 | 1 | 20 | 2.8 | 6.1 | -0.6 | NaN | NaN | 8.9 |
20 | 1938-01-21 | 1938 | 1 | 21 | 5.0 | 8.3 | 1.7 | NaN | NaN | 2.5 |
21 | 1938-01-22 | 1938 | 1 | 22 | 3.1 | 8.3 | -2.2 | NaN | NaN | 3.8 |
22 | 1938-01-23 | 1938 | 1 | 23 | 2.8 | 7.8 | -2.2 | 0.0 | 0.0 | 0.0 |
23 | 1938-01-24 | 1938 | 1 | 24 | 3.4 | 6.7 | 0.0 | NaN | NaN | 0.0 |
24 | 1938-01-25 | 1938 | 1 | 25 | 4.8 | 7.8 | 1.7 | 0.0 | 0.0 | 0.0 |
25 | 1938-01-26 | 1938 | 1 | 26 | 0.3 | 3.9 | -3.3 | 0.0 | 0.0 | 0.0 |
26 | 1938-01-27 | 1938 | 1 | 27 | 2.8 | 6.1 | -0.6 | 0.0 | 0.0 | 0.0 |
27 | 1938-01-28 | 1938 | 1 | 28 | 4.5 | 6.7 | 2.2 | NaN | NaN | 8.9 |
28 | 1938-01-29 | 1938 | 1 | 29 | 1.7 | 7.8 | -4.4 | 0.0 | 0.0 | 0.0 |
29 | 1938-01-30 | 1938 | 1 | 30 | -2.3 | 1.1 | -5.6 | NaN | NaN | 5.6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29160 | 2017-12-02 | 2017 | 12 | 2 | 5.3 | 6.8 | 3.7 | 5.2 | 0.0 | 5.2 |
29161 | 2017-12-03 | 2017 | 12 | 3 | 3.6 | 6.0 | 1.1 | 0.0 | 0.0 | 0.0 |
29162 | 2017-12-04 | 2017 | 12 | 4 | 3.5 | 5.8 | 1.1 | 0.0 | 0.0 | 0.0 |
29163 | 2017-12-05 | 2017 | 12 | 5 | 2.6 | 6.3 | -1.2 | 0.0 | 0.0 | 0.0 |
29164 | 2017-12-06 | 2017 | 12 | 6 | 2.2 | 5.8 | -1.5 | 0.0 | 0.0 | 0.0 |
29165 | 2017-12-07 | 2017 | 12 | 7 | 2.9 | 4.2 | 1.5 | 0.0 | 0.0 | 0.0 |
29166 | 2017-12-08 | 2017 | 12 | 8 | 1.9 | 3.5 | 0.2 | 0.0 | 0.0 | 0.0 |
29167 | 2017-12-09 | 2017 | 12 | 9 | 1.9 | 4.5 | -0.8 | 0.0 | 0.0 | 0.0 |
29168 | 2017-12-10 | 2017 | 12 | 10 | 4.1 | 7.7 | 0.5 | 0.0 | 0.0 | 0.0 |
29169 | 2017-12-11 | 2017 | 12 | 11 | 3.8 | 8.9 | -1.3 | 0.0 | 0.0 | 0.0 |
29170 | 2017-12-12 | 2017 | 12 | 12 | 4.8 | 7.6 | 1.9 | 0.0 | 0.0 | 0.0 |
29171 | 2017-12-13 | 2017 | 12 | 13 | 4.4 | 7.9 | 0.8 | 0.0 | 0.0 | 0.0 |
29172 | 2017-12-14 | 2017 | 12 | 14 | 3.0 | 6.5 | -0.5 | 0.0 | 0.0 | 0.0 |
29173 | 2017-12-15 | 2017 | 12 | 15 | 4.2 | 6.9 | 1.4 | 0.8 | 0.0 | 0.8 |
29174 | 2017-12-16 | 2017 | 12 | 16 | 4.1 | 6.5 | 1.6 | 5.8 | 0.0 | 5.8 |
29175 | 2017-12-17 | 2017 | 12 | 17 | 6.0 | 7.4 | 4.6 | 40.6 | 0.0 | 40.6 |
29176 | 2017-12-18 | 2017 | 12 | 18 | 6.2 | 7.3 | 5.0 | 3.0 | 0.0 | 3.0 |
29177 | 2017-12-19 | 2017 | 12 | 19 | 2.8 | 5.4 | 0.1 | 42.6 | 1.0 | 44.2 |
29178 | 2017-12-20 | 2017 | 12 | 20 | 1.7 | 5.0 | -1.7 | 0.0 | 0.0 | 0.0 |
29179 | 2017-12-21 | 2017 | 12 | 21 | -0.2 | 2.3 | -2.7 | 0.0 | 0.0 | 0.0 |
29180 | 2017-12-22 | 2017 | 12 | 22 | 1.5 | 5.5 | -2.6 | 0.0 | 0.0 | 0.0 |
29181 | 2017-12-23 | 2017 | 12 | 23 | -1.5 | 2.8 | -5.8 | 0.0 | 0.0 | 0.0 |
29182 | 2017-12-24 | 2017 | 12 | 24 | -2.0 | 0.9 | -4.9 | 0.0 | 0.0 | 0.0 |
29183 | 2017-12-25 | 2017 | 12 | 25 | -0.3 | 1.0 | -1.5 | 0.0 | 0.0 | 0.0 |
29184 | 2017-12-26 | 2017 | 12 | 26 | -0.4 | 0.7 | -1.4 | 0.0 | 0.0 | 0.0 |
29185 | 2017-12-27 | 2017 | 12 | 27 | 0.0 | 1.9 | -1.9 | 3.0 | 0.2 | 3.2 |
29186 | 2017-12-28 | 2017 | 12 | 28 | 3.4 | 5.7 | 1.1 | 20.8 | 0.0 | 20.8 |
29187 | 2017-12-29 | 2017 | 12 | 29 | 2.8 | 4.7 | 0.9 | 27.6 | 0.0 | 27.6 |
29188 | 2017-12-30 | 2017 | 12 | 30 | 1.3 | 4.3 | -1.8 | 2.2 | 0.0 | 2.2 |
29189 | 2017-12-31 | 2017 | 12 | 31 | -0.1 | 3.8 | -3.9 | 0.0 | 0.0 | 0.0 |
29190 rows × 10 columns
For large DataFrames, it’s often useful to display just the first few or last few rows:
[9]:
weather.head()
[9]:
Date | Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1938-01-01 | 1938 | 1 | 1 | 4.4 | 9.4 | -0.6 | NaN | NaN | 0.3 |
1 | 1938-01-02 | 1938 | 1 | 2 | 4.5 | 7.2 | 1.7 | NaN | NaN | 0.5 |
2 | 1938-01-03 | 1938 | 1 | 3 | 1.7 | 7.2 | -3.9 | 0.0 | 0.0 | 0.0 |
3 | 1938-01-04 | 1938 | 1 | 4 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
4 | 1938-01-05 | 1938 | 1 | 5 | 2.2 | 7.2 | -2.8 | 0.0 | 0.0 | 0.0 |
The head
method returns a new DataFrame consisting of the first n
rows (default 5)
Pro Tips! - To display the documentation for this method, you can run the command
weather.head?
in your Jupyter notebook - To see other methods available for the DataFrame, typeweather.
followed by Tab for auto-complete options
First two rows:
[10]:
weather.head(2)
[10]:
Date | Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1938-01-01 | 1938 | 1 | 1 | 4.4 | 9.4 | -0.6 | NaN | NaN | 0.3 |
1 | 1938-01-02 | 1938 | 1 | 2 | 4.5 | 7.2 | 1.7 | NaN | NaN | 0.5 |
[11]:
# Last four rows:
[12]:
weather.tail(4)
[12]:
Date | Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|
29186 | 2017-12-28 | 2017 | 12 | 28 | 3.4 | 5.7 | 1.1 | 20.8 | 0.0 | 20.8 |
29187 | 2017-12-29 | 2017 | 12 | 29 | 2.8 | 4.7 | 0.9 | 27.6 | 0.0 | 27.6 |
29188 | 2017-12-30 | 2017 | 12 | 30 | 1.3 | 4.3 | -1.8 | 2.2 | 0.0 | 2.2 |
29189 | 2017-12-31 | 2017 | 12 | 31 | -0.1 | 3.8 | -3.9 | 0.0 | 0.0 | 0.0 |
Data at a Glance¶
pandas
provides many ways to quickly and easily summarize your data: - How many rows and columns are there? - What are all the column names and what type of data is in each column?
Numerical data: What is the average and range of the values?
Text data: What are the unique values and how often does each occur?
How many values are missing in each column or row?
Number of rows and columns:¶
[13]:
weather.shape
[13]:
(29190, 10)
[14]:
# - The DataFrame `weather` has 29190 rows and 10 columns
# - The index does not count as a column
# - Notice there are no parentheses at the end of `weather.shape`
# - `shape` is a **data attribute** of the variable `weather`
[15]:
type(weather.shape)
[15]:
tuple
[16]:
# The data in the `shape` attribute is stored as a **tuple**, which is similar to a list.
#
# - Items in a tuple are enclosed in `()` instead of `[]`
# - Tuples are immutable - you can't modify individual items inside a tuple
Within a column of a DataFrame, the data must all be of the same type
We can find out the names and data types of each column from the
dtypes
attribute:
[17]:
weather.dtypes
[17]:
Date object
Year int64
Month int64
Day int64
T_mean (C) float64
T_high (C) float64
T_low (C) float64
Rain (mm) float64
Snow (cm) float64
Total Precip (mm) float64
dtype: object
In a pandas
DataFrame, a column containing text data (or containing a mix of text and numbers) is assigned a dtype
of object
and is treated as a column of strings.
int64
and float64
are integer and float, respectively. - The 64 at the end means that they are stored as 64-bit numbers in memory - These data types are equivalent to int
and float
in Python (pandas
is a just a bit more explicit in how it names them)
If we just want a list of the column names, we can use the columns
attribute:
[18]:
weather.columns
[18]:
Index(['Date', 'Year', 'Month', 'Day', 'T_mean (C)', 'T_high (C)', 'T_low (C)',
'Rain (mm)', 'Snow (cm)', 'Total Precip (mm)'],
dtype='object')
Simple Summary Statistics¶
The describe
method computes simple summary statistics and returns them as a DataFrame:
[19]:
weather.describe()
[19]:
Year | Month | Day | T_mean (C) | T_high (C) | T_low (C) | Rain (mm) | Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|
count | 29190.000000 | 29190.000000 | 29190.000000 | 29167.000000 | 29169.000000 | 29180.000000 | 29126.000000 | 29138.000000 | 29166.000000 |
mean | 1977.479753 | 6.522336 | 15.729222 | 10.133692 | 13.780767 | 6.440062 | 2.985731 | 0.121779 | 3.107351 |
std | 23.093891 | 3.448584 | 8.801167 | 5.803549 | 6.408815 | 5.514129 | 6.231114 | 1.146298 | 6.325946 |
min | 1938.000000 | 1.000000 | 1.000000 | -14.500000 | -11.100000 | -17.800000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1957.000000 | 4.000000 | 8.000000 | 5.900000 | 8.900000 | 2.600000 | 0.000000 | 0.000000 | 0.000000 |
50% | 1977.000000 | 7.000000 | 16.000000 | 10.000000 | 13.300000 | 6.700000 | 0.000000 | 0.000000 | 0.000000 |
75% | 1997.750000 | 10.000000 | 23.000000 | 15.000000 | 18.900000 | 11.100000 | 3.200000 | 0.000000 | 3.400000 |
max | 2017.000000 | 12.000000 | 31.000000 | 28.400000 | 34.400000 | 22.400000 | 91.600000 | 41.000000 | 91.600000 |
The describe
method is a way to quickly summarize the averages, extremes, and variability of each numerical data column.
You can look at each statistic individually with methods such as mean
, median
, min
, max
,std
, and count
[20]:
weather.mean()
[20]:
Year 1977.479753
Month 6.522336
Day 15.729222
T_mean (C) 10.133692
T_high (C) 13.780767
T_low (C) 6.440062
Rain (mm) 2.985731
Snow (cm) 0.121779
Total Precip (mm) 3.107351
dtype: float64
Exercise¶
For this exercise, we will explore data about countries around the world, combined from multiple sources by the Gapminder foundation.
Gapminder is an independent Swedish foundation that fights devastating misconceptions about global development and promotes a fact-based world view through the production of free teaching and data exploration resources.
Data Overview¶
The columns of data/gapminder_world_data_2018.csv
are:
Column |
Description |
---|---|
country |
Country name |
population |
Population in the country |
region |
Continent the country belongs to |
sub_region |
Sub regions as defined by |
income_group |
Income group as specified by the world bank |
life_expectancy |
The average number of years a newborn child would live if mortality patterns were to stay the same |
gdp_per_capita |
GDP per capita (in USD) adjusted for differences in purchasing power |
children_per_woman |
Number of children born to each woman |
child_mortality |
Deaths of children under 5 years of age per 1000 live births |
pop_density |
Average number of people per km\(^2\) |
You may want to refer to the Pandas cheatsheet as you work through the exercises.
a) Read the file context_pandas1.data_dir / "gapminder_world_data_2018.csv"
into a new DataFrame world
and display the first 10 rows.
b) How many rows and columns does world
have?
c) Display the names and data types of each column.
d) Display summary statistics with the describe
method. What are the lowest and highest populations? How about lowest/highest population densities? Any guesses which countries these might be? (We’ll find out the answers in Lesson 4!)
Bonus exercises¶
e) Data wrangling - dealing with header rows
The file raw_dir / "weather_YVR_1938.csv"
contains the daily weather data for 1938, in the original format downloaded from Environment Canada. Open this file in cocalc.
Note that a CSV viewer wouldn’t able to parse the data correctly because of the extra header rows at the beginning.
Now try reading the file into your notebook with
pandas.read_csv
and see what happens.
If you look at the documentation for pandas.read_csv
, you’ll see a skiprows
input buried amongst a few dozen other inputs for this function. This input tells read_csv
how many rows to skip at the beginning of the file. - Try reading data/raw/weather_YVR_1938.csv
again, but this time using a value of 24
for the skiprows
keyword argument, and display the first 5 rows of the resulting DataFrame.