Table of Contents

  • 1  Exploring data using NumPy

  • 2  What is NumPy?

  • 3  Reading a data file with pandas

    • 3.1  Importing NumPy and pandas

    • 3.2  Reading data

    • 3.3  Reading a data file

    • 3.4  Converting to a numpy ndarray

  • 4  Exploring our dataset

    • 4.1  Checking the array data type

    • 4.2  Checking the data array type

    • 4.3  Checking the size of the dataset

  • 5  Working with our data - Index slicing

    • 5.1  Subsetting a set of rows in an array

      • 5.1.1  Differences from matlab

    • 5.2  Slicing our data into columns

    • 5.3  Checking the data in memory

  • 6  Basic data calculations in NumPy

      • 6.0.1  Data type conversions

Exploring data using NumPy

Our first task in this week’s lesson is to learn how to read and explore data files using NumPy. We are also going to need another module called Pandas to get a csv file into numpy. We are not going discuss pandas in this lesson, but you’ll hear more about it later in the course.

What is NumPy?

NumPy is a library for Python designed for efficient scientific (numerical) computing. It is an essential library in Python that is used under the hood in many other modules (including Pandas). Here, we will get a sense of a few things NumPy can do.

Reading a data file with pandas

Importing NumPy and pandas

Now we’re ready to read in our temperature data file. First, we need to import the NumPy and pandas module.

[ ]:
import numpy as np  # noqa
import pandas as pd

That’s it! NumPy is now ready to use. Notice that we have imported the NumPy module with the name np.

Reading data

Below we will read the file data/Kumpula-June-2016-w-metadata.txt

Use your cocalc file browser to open the file data/Kumpula-June-2016-w-metadata.txt in the data folder. You should see this:

# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki # for June 1-30, 2016 # Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND # Data processing: Extracted temperatures from raw data file, converted to # comma-separated format # # David Whipp - 02.10.2017 YEARMODA,TEMP,MAX,MIN 20160601,65.5,73.6,54.7 20160602,65.8,80.8,55.0 ...

Note the format – there is some header information we want to skip, the column names on line 9 and then rows of data. We will get a DataFrame with 4 columns – the year/month/day, temperature on that day, and the historical maximum and minimum.

Reading a data file

Now we’ll read the file data into a pandasDataFrame object. Remember that python counts from 0, so the header on line 9 is at row 8 in python.

[ ]:
data_frame = pd.read_csv(
    "data/Kumpula-June-2016-w-metadata.txt", header=[8], skip_blank_lines=False
)
data_frame.head()

Converting to a numpy ndarray

Since this notebook is focussed on numpy, not pandas, we want to get the data out of the pandas DataFrame into a numpy array to work with. That’s done using the values attribute

[ ]:
data = data_frame.values

Note, here we are doing two things:

  1. creating the new object data (unlike C where variables are explicitly declared before they are used here, we create it dynamically, or on the fly)

  2. We take the values from the dataframe and assign them to data Also note that jupyter does not print out anything to see when we assign a variable using an equals sign. Only the last item in a cell is printed automatically. Printing any other variable requires calling it with the print() function.

Exploring our dataset

A normal first step when you load new data is to explore the dataset a bit to understand what is there and its format.

Checking the array data type

Perhaps as a first step, we can check the type of data we have in our NumPy array data. Try using google to find out how to check an object type in python.

[ ]:


You should cofirm that it’s a NumPy ndarray, not much of a surprise here.

Checking the data array type

The ndarray is a container – it essentially points to a region of computer memory that holds a sequence of numbers represented as bit patterns. Those bits can be interpreted floating point, integer, or boolean, and each number may occupy 1, 8, 16, 32 or 64 bits. The ndarray object stores that information in an attribute called its dtype or data tipe

Let’s now have a look at the data type in our ndarray. We can find this in the dtype attribute that is part of the ndarray data type, something that is known automatically for this kind of data.

To summarize – all python objects have a type. Objects of type ndarray hold data with a particular dtype. Every data element in an ndarray has the same dtype.

Print out the type of the data ndarray below.

[ ]:
# YOUR CODE HERE
raise NotImplementedError()

Here we see the data are floating point values with 64-bit precision.

Note:

There are some exceptions, but normal NumPy arrays will all have the same data type.

Checking the size of the dataset

We can also check to see how many rows and columns we have in the dataset using the shape attribute.

[ ]:
# YOUR CODE HERE
raise NotImplementedError()

Here we see how there are 30 rows of data and 4 columns.

Working with our data - Index slicing

Let’s have another quick look at our data.

Need an explicit instruction. Maybe

Just enter the name of the variable, data and run the cell to see the array.

[ ]:


This is OK, but we can certainly make it easier to work with. We can start by slicing our data up into different columns and creating new variables with the column data. Slices from our array can be extracted using the index values. In our case, we have two indices in our 2D data structure. For example, the index values [2,0]

[ ]:


give us the value at row 2, column 0.

We can also use ranges of rows and columns using the : character. To get, for example the values from the first 3 columns of row 12, we could type

data[12, 0:3]

Below, enter the code to get the first 5 rows of values in column zero

[ ]:


Not bad, right?

In fact, we don’t even necessarily need the lower bound for this slice of data because NumPy will assume it for us if we don’t list it. Let’s see another example.

Subsetting a set of rows in an array

Can you figure out how to see the first 5 rows of data (all columns)? Note, if you put the colon : in the array index, python will give you all rows or columns. Eg data[1,:] will give you all columns from row 1.

Differences from matlab

Note a couple of important difference from matlab. In python:

  1. The first item in an array is at index 0, not 1

  2. The slice [0:5] has 5 elements: 0,1,2,3,4.

  3. Index numbers can be negative: if an array a contains a= [0,1,2,3,4], then a[-1] contains 4, a[-2] contains 3, etc. Try creating a array below with numpy.arange and accessing various elements in the cell below. What happens if you try an index beyond the end of the array?

[ ]:
import numpy

Here, the lower bound of the index range 0 is assumed and we get all rows up to (but not including) index 5.

Slicing our data into columns

Now let’s use the ideas of index slicing to cut our 2D data into 4 separate columns that will be easier to work with.

As you might recall from the header of the file, we have 4 data values: YEARMODA, TEMP, MAX, and MIN. We can exract all of the values from a given column by not listing and upper or lower bound for the index slice.

Create a new array called dateYMD from the first column of the data.

[ ]:


Note, as mentioned above, python does not print out anything when we assign a variable with an equals sign.

[ ]:


OK, this looks promising. Let’s quickly handle the others.

[ ]:
temp = data[:, 1]
temp_max = data[:, 2]
temp_min = data[:, 3]

Now we have 4 variables, one for each column in the data file. This should make life easier when we want to perform some calculations on our data.

Checking the data in memory

We can see the types of data we have defined at this point, the variable names, and their types using the %whos magic command. This is quite handy. Enter the command below and see what you get. For technical reasons we need to comment out the line below when we generate this notebook. Uncomment it and run the cell.

[ ]:
#%whos

Basic data calculations in NumPy

NumPy ndarrays have a set of attributes they know about themselves and methods they can use to make calculations using the data. Useful methods include mean(), min(), max(), and std() (the standard deviation). The general format to use a method is objectname.method(). Our objects include the numpy arrays that we created above.

For example, can you find the mean temperature in the dataset? Do that in the cell below and save it in a variable called mean_temp

[ ]:
# YOUR CODE HERE
raise NotImplementedError()
[ ]:

How about the maximum maximum temperature? Remember that you’ve read the daily climatological maximum from column 2 above. Save the maximum of that column as the variable temp_max below.

[ ]:
# YOUR CODE HERE
raise NotImplementedError()
[ ]:

Lastly, we can convert our ndarray data from one data type to another using the astype() method. As we see in the output from %whos above, our date array contains float64 data, but it was simply integer values in the data file.

For example to convert to a numpy array named varname1 to float64 and assign it to varname2 use varname2 = varname1.astype("float64"). Note, in python, you can do this in place, that is if varname1 and varname2 are the same, the net effect is to convert the type of varname1.

Can you figure out how to convert dateYMD to integers using astype()?

[ ]:





Check the dtype attribute to confirm

[ ]:


Now we see our dates as whole integer values.

[ ]:
dateYMD
[ ]: