{ "cells": [ { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "# Exploring data using NumPy\n", "\n", "Our first task in this week's lesson is to learn how to read and explore data files using [NumPy](http://www.numpy.org/).\n", "We are also going to need another module called [Pandas](https://pandas.pydata.org/) to get a csv file into numpy. We are\n", "not going discuss pandas in this lesson, but you'll hear more about it later in the course.\n", "\n", "# What is NumPy?\n", "\n", "NumPy is a library for Python designed for efficient scientific (numerical) computing.\n", "It is an essential library in Python that is used under the hood in many other modules (including [Pandas](https://pandas.pydata.org/)).\n", "Here, we will get a sense of a few things NumPy can do." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "# Reading a data file with pandas\n", "\n", "## Importing NumPy and pandas\n", "\n", "Now we're ready to read in our temperature data file.\n", "First, we need to import the NumPy and pandas module." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "import numpy as np # noqa\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it!\n", "NumPy is now ready to use.\n", "Notice that we have imported the NumPy module with the name `np`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "\n", "Below we will read the file data/Kumpula-June-2016-w-metadata.txt\n", "\n", "Use your cocalc file browser to open the file data/Kumpula-June-2016-w-metadata.txt in the data folder. You should see this:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki\n", "# for June 1-30, 2016\n", "# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND\n", "# Data processing: Extracted temperatures from raw data file, converted to\n", "# comma-separated format\n", "#\n", "# David Whipp - 02.10.2017\n", "\n", "YEARMODA,TEMP,MAX,MIN\n", "20160601,65.5,73.6,54.7\n", "20160602,65.8,80.8,55.0\n", "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the format -- there is some header information we want to skip, the column names on line 9\n", "and then rows of data. We will get a DataFrame with 4 columns -- the year/month/day, temperature on that day, and the historical maximum and minimum." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading a data file\n", "\n", "Now we'll read the file data into a pandas`DataFrame` object. Remember that python counts from 0, so the header\n", "on line 9 is at row 8 in python." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "data_frame = pd.read_csv(\n", " \"data/Kumpula-June-2016-w-metadata.txt\", header=[8], skip_blank_lines=False\n", ")\n", "data_frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting to a numpy ndarray\n", "\n", "Since this notebook is focussed on numpy, not pandas, we want to get \n", "the data out of the pandas `DataFrame` into a numpy array to work with. That's done using\n", "the values attribute" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "data = data_frame.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, here we are doing two things: \n", "\n", "1. creating the new object `data`\n", " (unlike C where variables are explicitly declared before they are \n", " used here, we create it dynamically, or on the fly)\n", "\n", "2. We take the values from the dataframe and assign them to `data`\n", " Also note that jupyter does not print out anything to see when we\n", " assign a variable using an equals sign. Only the last item in a\n", " cell is printed automatically. Printing any other variable requires\n", " calling it with the print() function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring our dataset\n", "\n", "A normal first step when you load new data is to explore the dataset a bit to understand what is there and its format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the array data type\n", "\n", "Perhaps as a first step, we can check the type of data we have in our NumPy array `data`. Try using google to find out how to check an object type in python. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should cofirm that it's a NumPy *ndarray*, not much of a surprise here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the data array type\n", "\n", "The ndarray is a container -- it essentially points to a region of computer memory that holds a sequence of numbers represented as bit patterns. Those bits can be interpreted floating point, integer, or boolean, and each number may occupy 1, 8, 16, 32 or 64 bits. The ndarray object stores that information in an [attribute](http://blog.thedigitalcatonline.com/blog/2015/01/12/accessing-attributes-in-python/) called its `dtype` or data tipe\n", "\n", "Let’s now have a look at the data type in our ndarray.\n", "We can find this in the `dtype` attribute that is part of the ndarray data type, something that is known automatically for this kind of data.\n", "\n", "To summarize -- all python objects have a type. Objects of type ndarray hold data with a particular dtype. Every data element in an ndarray has the same dtype." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print out the type of the data ndarray below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "lines_to_next_cell": 2, "nbgrader": { "checksum": "2cadd4c7ca4e8a6459448932e321495f", "grade": true, "grade_id": "cell-f6b1242c9096947a", "locked": false, "points": 1, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see the data are floating point values with 64-bit precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Note:**\n", "\n", "There are some exceptions, but normal NumPy arrays will all have the same data type.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the size of the dataset\n", "\n", "We can also check to see how many rows and columns we have in the dataset using the `shape` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "lines_to_next_cell": 2, "nbgrader": { "checksum": "87c950fae79acb78e5080e8c51e7c832", "grade": true, "grade_id": "cell-8e8a48fd40a92660", "locked": false, "points": 1, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see how there are 30 rows of data and 4 columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with our data - Index slicing\n", "\n", "Let's have another quick look at our data.\n", "\n", "Need an explicit instruction. Maybe\n", "\n", "Just enter the name of the variable, `data` and run the cell to see the array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is OK, but we can certainly make it easier to work with.\n", "We can start by slicing our data up into different columns and creating new variables with the column data.\n", "Slices from our array can be extracted using the *index values*.\n", "In our case, we have two indices in our 2D data structure.\n", "For example, the index values `[2,0]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "give us the value at row 2, column 0.\n", "\n", "We can also use ranges of rows and columns using the `:` character. To get, for example\n", "the values from the first 3 columns of row 12, we could type\n", "\n", "`data[12, 0:3]`\n", "\n", "Below, enter the code to get the first 5 rows of values in column zero " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad, right?\n", "\n", "In fact, we don't even necessarily need the lower bound for this slice of data because NumPy will assume it for us if we don't list it.\n", "Let's see another example. \n", "\n", "## Subsetting a set of rows in an array\n", "\n", "Can you figure out how to see the first 5 rows of data (all columns)? Note, if you put the colon `:` in the array index, python will give you all rows or columns. Eg `data[1,:]` will give you all columns from row 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Differences from matlab\n", "\n", "Note a couple of important difference from matlab. In python:\n", " \n", "1. The first item in an array is at index 0, not 1\n", "\n", "1. The slice [0:5] has 5 elements: 0,1,2,3,4. \n", " \n", "1. Index numbers can be negative: if an array a contains `a= [0,1,2,3,4]`,\n", " then `a[-1]` contains 4, `a[-2]` contains 3, etc. Try creating a\n", " array below with numpy.arange and accessing various elements in the\n", " cell below. What happens if you try an index beyond the end of the\n", " array?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "import numpy \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the lower bound of the index range `0` is assumed and we get all rows up to (but not including) index `5`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slicing our data into columns\n", "\n", "Now let's use the ideas of index slicing to cut our 2D data into 4 separate columns that will be easier to work with.\n", "\n", "As you might recall from the header of the file, we have 4 data values: `YEARMODA`, `TEMP`, `MAX`, and `MIN`.\n", "We can exract all of the values from a given column by not listing and upper or lower bound for the index slice.\n", "\n", "Create a new array called `dateYMD` from the first column of the `data`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, as mentioned above, python does not print out anything when we assign a variable with an equals sign." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, this looks promising.\n", "Let's quickly handle the others." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp = data[:, 1]\n", "temp_max = data[:, 2]\n", "temp_min = data[:, 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have 4 variables, one for each column in the data file.\n", "This should make life easier when we want to perform some calculations on our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the data in memory\n", "\n", "We can see the types of data we have defined at this point, the variable names, and their types using the `%whos` magic command.\n", "This is quite handy. Enter the command below and see what you get.\n", "For technical reasons we need to comment out the line below when we generate this notebook. Uncomment it and run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "#%whos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic data calculations in NumPy\n", "\n", "NumPy ndarrays have a set of attributes they know about themselves and methods they can use to make calculations using the data.\n", "Useful methods include `mean()`, `min()`, `max()`, and `std()` (the standard deviation). The general format to use a method is `objectname.method()`. Our objects include the numpy arrays that we created above.\n", "\n", "For example, can you find the mean temperature in the dataset? Do that in the cell below and save it in a variable called `mean_temp`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "lines_to_next_cell": 2, "nbgrader": { "checksum": "43b79b746157c9914dc1f1b22b944e4d", "grade": false, "grade_id": "cell-30fa61c4d9885e32", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "139becefcdbacb3e90022117e1e369e6", "grade": true, "grade_id": "cell-2228c250f4ae5124", "locked": true, "points": 2, "schema_version": 1, "solution": false } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about the maximum maximum temperature? Remember that you've read\n", "the daily climatological maximum from column 2 above. Save the maximum\n", "of that column as the variable temp_max below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "checksum": "1a0c8f566a4d13940d33d1e24521a023", "grade": false, "grade_id": "cell-82f413cb6f6dd708", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "fa61a0331a955bffa4a1001fa405a468", "grade": true, "grade_id": "cell-d02f83d1a2eb8333", "locked": true, "points": 2, "schema_version": 1, "solution": false } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data type conversions\n", "\n", "Lastly, we can convert our ndarray data from one data type to another using the `astype()` method. \n", "As we see in the output from `%whos` above, our `date` array contains `float64` data, but it was simply integer values in the data file. \n", "\n", "For example to convert to a `numpy` array named `varname1` to `float64` and assign it to `varname2` use \n", "`varname2 = varname1.astype(\"float64\")`. Note, in python, you can do this in place, that is if `varname1` and `varname2`\n", "are the same, the net effect is to convert the type of `varname1`.\n", "\n", "Can you figure out how to convert `dateYMD` to integers using `astype()`? " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Check the dtype attribute to confirm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see our dates as whole integer values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dateYMD" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n" ] } ], "metadata": { "jupytext": { "formats": "ipynb", "metadata_filter": { "cells": { "additional": "all" }, "notebook": { "additional": "all" } }, "text_representation": { "extension": ".py", "format_name": "percent", "format_version": "1.2", "jupytext_version": "0.8.6" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "nbsphinx": { "execute": "never" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "474px", "left": "119px", "top": "245px", "width": "200.594px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }