{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Importing-pandas-Library\" data-toc-modified-id=\"Importing-pandas-Library-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Importing <code>pandas</code> Library</a></span></li><li><span><a href=\"#DataFrames-and-Series\" data-toc-modified-id=\"DataFrames-and-Series-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>DataFrames and Series</a></span></li><li><span><a href=\"#Reading-a-CSV-File\" data-toc-modified-id=\"Reading-a-CSV-File-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Reading a CSV File</a></span></li><li><span><a href=\"#Saving-to-a-CSV-File\" data-toc-modified-id=\"Saving-to-a-CSV-File-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Saving to a CSV File</a></span></li><li><span><a href=\"#Quick-and-Easy-Summaries-of-a-DataFrame\" data-toc-modified-id=\"Quick-and-Easy-Summaries-of-a-DataFrame-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Quick and Easy Summaries of a DataFrame</a></span><ul class=\"toc-item\"><li><span><a href=\"#Summary-Statistics\" data-toc-modified-id=\"Summary-Statistics-5.1\"><span class=\"toc-item-num\">5.1&nbsp;&nbsp;</span>Summary Statistics</a></span></li></ul></li><li><span><a href=\"#Working-with-DataFrame-Columns\" data-toc-modified-id=\"Working-with-DataFrame-Columns-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Working with DataFrame Columns</a></span><ul class=\"toc-item\"><li><span><a href=\"#Single-Columns\" data-toc-modified-id=\"Single-Columns-6.1\"><span class=\"toc-item-num\">6.1&nbsp;&nbsp;</span>Single Columns</a></span></li><li><span><a href=\"#Multiple-Columns\" data-toc-modified-id=\"Multiple-Columns-6.2\"><span class=\"toc-item-num\">6.2&nbsp;&nbsp;</span>Multiple Columns</a></span></li></ul></li><li><span><a href=\"#Plots\" data-toc-modified-id=\"Plots-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Plots</a></span></li><li><span><a href=\"#Unique-Values-&amp;-Counting\" data-toc-modified-id=\"Unique-Values-&amp;-Counting-8\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Unique Values &amp; Counting</a></span></li><li><span><a href=\"#Sorting\" data-toc-modified-id=\"Sorting-9\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Sorting</a></span></li><li><span><a href=\"#Filtering\" data-toc-modified-id=\"Filtering-10\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Filtering</a></span></li><li><span><a href=\"#Aggregation\" data-toc-modified-id=\"Aggregation-11\"><span class=\"toc-item-num\">11&nbsp;&nbsp;</span>Aggregation</a></span></li><li><span><a href=\"#Setting-the-Index-of-a-DataFrame\" data-toc-modified-id=\"Setting-the-Index-of-a-DataFrame-12\"><span class=\"toc-item-num\">12&nbsp;&nbsp;</span>Setting the Index of a DataFrame</a></span></li><li><span><a href=\"#Anatomy-of-a-DataFrame\" data-toc-modified-id=\"Anatomy-of-a-DataFrame-13\"><span class=\"toc-item-num\">13&nbsp;&nbsp;</span>Anatomy of a DataFrame</a></span></li><li><span><a href=\"#Selection-Methods\" data-toc-modified-id=\"Selection-Methods-14\"><span class=\"toc-item-num\">14&nbsp;&nbsp;</span>Selection Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#1.-Using-only-square-brackets\" data-toc-modified-id=\"1.-Using-only-square-brackets-14.1\"><span class=\"toc-item-num\">14.1&nbsp;&nbsp;</span>1. Using only square brackets</a></span></li><li><span><a href=\"#2.-Positional-indexing-with-iloc\" data-toc-modified-id=\"2.-Positional-indexing-with-iloc-14.2\"><span class=\"toc-item-num\">14.2&nbsp;&nbsp;</span>2. Positional indexing with <code>iloc</code></a></span></li><li><span><a href=\"#3.-Label-based-indexing-with-loc\" data-toc-modified-id=\"3.-Label-based-indexing-with-loc-14.3\"><span class=\"toc-item-num\">14.3&nbsp;&nbsp;</span>3. Label-based indexing with <code>loc</code></a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pandas Cheatsheet\n",
    "\n",
    "Downloaded from https://github.com/jenfly/eoas-python by Jennifer Walker\n",
    "\n",
    "### Importing `pandas` Library\n",
    "```python\n",
    "import pandas\n",
    "```\n",
    "> In general, it's good practice to collect all your `import` commands together and put them at the start of the notebook.\n",
    "\n",
    "### DataFrames and Series\n",
    "\n",
    "Data in `pandas` is organized into DataFrames and Series.\n",
    "\n",
    "- **DataFrame:** 2-dimensional array, like a table in a spreadsheet\n",
    "  - The rows are axis 0\n",
    "  - The columns are axis 1\n",
    "- **Series:** 1-dimensional array, like a single column or row in a spreadsheet\n",
    "  - Each individual column or row of a DataFrame is represented as a Series\n",
    "\n",
    "### Reading a CSV File\n",
    "\n",
    "To read a CSV file and store it as a DataFrame variable:\n",
    "```python\n",
    "df = pandas.read_csv('some_cool_data.csv')\n",
    "```\n",
    "\n",
    "Missing data in a DataFrame or Series is represented as `NaN` (\"not a number\").\n",
    "\n",
    "### Saving to a CSV File\n",
    "\n",
    "To save a DataFrame to a CSV file: \n",
    "```python\n",
    "df.to_csv('cool_output.csv', index=False)\n",
    "```\n",
    "- To include the DataFrame's index as a column in the CSV file, omit the `index=False` keyword argument."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Quick and Easy Summaries of a DataFrame\n",
    "\n",
    "|**feature** |dataframe attribute/method|\n",
    "|---|---|\n",
    "|**Useful Attributes** |\n",
    "|Number of rows and columns (rows first, columns second) | `df.shape` |\n",
    "|Names and data types of each column |  `df.dtypes` \n",
    "|Just the names of each column | `df.columns` \n",
    "|**Rows at a Glance** |\n",
    "|First `n` rows (default 5) |`df.head(n)`\n",
    "|Last `n` rows (default 5) | `df.tail(n)`\n",
    "|A random sampling of `n` rows (default 1) | `df.sample(n)`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Summary Statistics\n",
    "\n",
    "Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:\n",
    "```python\n",
    "df.describe()\n",
    "```\n",
    "\n",
    "Mean value of each column:\n",
    "```python\n",
    "df.mean()\n",
    "```\n",
    "\n",
    "And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Working with DataFrame Columns\n",
    "\n",
    "#### Single Columns\n",
    "\n",
    "Each column of a DataFrame is a Series.\n",
    "```python\n",
    "series_X = df['X']\n",
    "```\n",
    "\n",
    "Most DataFrame methods can be applied to a Series, for example:\n",
    "```python\n",
    "df['X'].head()\n",
    "df['X'].max()\n",
    "```\n",
    "\n",
    "Basic calculations with a Series and adding a new column to a DataFrame: \n",
    "```python\n",
    "df['Double X'] = 2 * df['X']\n",
    "```\n",
    "\n",
    "#### Multiple Columns\n",
    "\n",
    "Use a list of column names to select several columns of a DataFrame, in a specified order:\n",
    "```python\n",
    "df_subset = df[['E', 'A', 'C']]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Plots\n",
    "\n",
    "Create quick and easy plots of Series and DataFrames with the `plot` method, for example:\n",
    "- Line plot of one column of a DataFrame:\n",
    "```python\n",
    "df['A'].plot()\n",
    "```\n",
    "- Line plot of column `'B'` vs. column `'A'` of a DataFrame:\n",
    "```python\n",
    "df.plot(x='A', y='B')\n",
    "```\n",
    "- Histogram of one column of a DataFrame:\n",
    "```python\n",
    "df['A'].plot(kind='hist', bins=30)\n",
    "```\n",
    "- Plot all columns of a DataFrame, with each column as a line on the same plot:\n",
    "```python\n",
    "df.plot()\n",
    "```\n",
    "- Plot bar charts of each column of a DataFrame, with a separate subplot for each column:\n",
    "```python\n",
    "df.plot(kind='bar', subplots=True)\n",
    "```\n",
    "\n",
    "Plots can be adjusted with keyword arguments such as `figsize`, `fontsize`, `title`, `colormap`. \n",
    "- To adjust the size of a graph, use the `figsize` keyword argument to the `plot` method, where `figsize` is a tuple of (width, height). For example to create 6\" wide by 8\" tall figure with subplots:\n",
    "```python\n",
    "df.plot(kind='bar', subplots=True, figsize=(6, 8))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Unique Values & Counting\n",
    "\n",
    "For a column `df['A']` which contains many repeated values (such as categories), some useful summary methods are:\n",
    "\n",
    "|**feature**|method|\n",
    "|---|---|\n",
    "|Unique values | `df['A'].unique()`|\n",
    "|Number of unique values | `df['A'].nunique()`|\n",
    "|Counts of each unique value | `df['A'].value_counts()`|\n",
    "\n",
    "\n",
    "> Note: The `unique`, `nunique`, and `value_counts` methods can only be applied to a Series (not a DataFrame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Sorting\n",
    "\n",
    "Sorting a DataFrame based on the values in the column `'B'`:\n",
    "```python\n",
    "df.sort_values('B')\n",
    "```\n",
    "To sort in descending order, use the keyword argument `ascending=False`.\n",
    "\n",
    "Sorting a Series:\n",
    "```python\n",
    "series_a.sort_values()\n",
    "```\n",
    "\n",
    "### Filtering\n",
    "\n",
    "To select a subset of rows with a filter:\n",
    "  - Create a filter (Boolean Series) using a comparison operator or other functions (such as the `isnull` method)\n",
    "  - Use the filter to extract the desired rows from the DataFrame\n",
    "\n",
    "Example: comparison operator\n",
    "```python\n",
    "warm = weather_data['Temperature (C)'] > 25\n",
    "weather_warm = weather_data[warm]\n",
    "```\n",
    "\n",
    "You can also use Boolean Series and DataFrames to count occurrences of specified criteria, for example: count the missing values in each column of a DataFrame with `df.isnull().sum()`\n",
    "\n",
    "\n",
    "### Aggregation\n",
    "\n",
    "For basic aggregation operations, use the `groupby` method chained with an aggregation method (e.g., `sum`, `mean`, `sum`, `max`, etc.).\n",
    "\n",
    "For example, to find the sum totals of column `'population'` grouped by column `'region'`: `\n",
    "```python\n",
    "world.groupby('region')['population'].sum()\n",
    "```\n",
    "\n",
    "You can also group by multiple columns:\n",
    "```python\n",
    "world.groupby(['region', 'income_group'])['population'].sum()\n",
    "```\n",
    "\n",
    "For more complex aggregations, you can use the `agg` method.\n",
    "- Specify a list of aggregation statistics, for example: \n",
    "```python\n",
    "world.groupby('region')['population'].agg(['sum', 'min', 'max'])\n",
    "```\n",
    "- Use a dictionary to specify different aggregation statistics for different columns, for example:\n",
    "\n",
    "```python\n",
    "agg_dict = {'population' : 'sum', \n",
    "            'life_expectancy' : ['min', 'max']}\n",
    "world.groupby('region').agg(agg_dict)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "---\n",
    "### Setting the Index of a DataFrame\n",
    "\n",
    "The index of a DataFrame can be set with the `set_index` method. For example, to set the column `'country'` as the index of DataFrame `world` and save the output as a new DataFrame:\n",
    "```python\n",
    "world_new = world.set_index('country')\n",
    "```\n",
    "\n",
    "### Anatomy of a DataFrame\n",
    "\n",
    "![dataframe3](img/dataframe3.png)\n",
    "\n",
    "- Each cell in a DataFrame is associated with:\n",
    "  - A row position and column position (integers, starting from 0), and\n",
    "  - A row label and column label\n",
    "- Row and column labels are part of the DataFrame's metadata, stored in attributes:\n",
    "  - `df.index` contains the labels for the DataFrame's rows\n",
    "  - `df.columns` contains the labels for the DataFrame's columns\n",
    "- `df.values` contains the underlying data array\n",
    "\n",
    "### Selection Methods\n",
    "\n",
    "There are three main ways of selecting subsets of a DataFrame (or Series):\n",
    "<ol>\n",
    "    <li>Using only square brackets `[]`</li>\n",
    "    <li>Positional indexing with `.iloc[]`</li>\n",
    "    <li>Label-based indexing with `.loc[]`</li>\n",
    "</ol>\n",
    "\n",
    "\n",
    "#### 1. Using only square brackets\n",
    "|Bracket indexing||\n",
    "|---|----|\n",
    "|Use a Boolean Series to select rows matching some criteria | `df[bool_series]`|\n",
    "|Use a list of column names to select several columns in a specified order | `df[['E', 'A', 'C']]`|\n",
    "|Select a single column (as a Series) | `series_A = df['A']`|\n",
    "|Select a single column (as a DataFrame) | `df_A = df[['A']]`|\n",
    "\n",
    "\n",
    "#### 2. Positional indexing with `iloc`\n",
    "\n",
    "With `iloc`, we specify the row selection first, followed by column selection, separated by a comma. For example, selecting a single cell at row 2, column 4:\n",
    "```python\n",
    "df.iloc[2, 4]\n",
    "```\n",
    "\n",
    "With `iloc` can select:\n",
    "- A single cell\n",
    "- A single row or column\n",
    "- Slices of rows and/or columns\n",
    "  - As with Python lists, `iloc` slices are inclusive of the start bound and exclusive of the stop bound\n",
    "- Multiple rows and/or columns in arbitrary order, using a list of positions\n",
    "\n",
    "A few examples:\n",
    "```python\n",
    "df.iloc[100:201:2, [5, 2, 4]]\n",
    "df.iloc[[4, 2], 3:7]\n",
    "df.iloc[175:, :]\n",
    "```\n",
    "\n",
    "\n",
    "#### 3. Label-based indexing with `loc`\n",
    "\n",
    "As with `iloc`, we specify the row selection first, followed by column selection, separated by a comma, but using the row and column labels instead of their positions. For example, selecting a single cell at a row labelled `'Algeria'` and a column labelled `'life_expectancy'`:\n",
    "```python\n",
    "world_new.loc['Algeria', 'life_expectancy']\n",
    "```\n",
    "\n",
    "Similar to `iloc`, with `loc` we can select:\n",
    "- A single cell\n",
    "- A single row or column\n",
    "- Slices of rows and/or columns\n",
    "  - However, unlike `iloc` and Python list slicing, `loc` slices are inclusive of the start bound but also **inclusive** of the stop bound.\n",
    "  - Example:\n",
    "```python\n",
    "world_new.loc['Canada':'Denmark', 'income_group':'gdp_per_capita']\n",
    "```\n",
    "- Multiple rows and/or columns in arbitrary order, using a list of labels\n",
    "\n",
    "We can also select rows and/or columns with a Boolean Series, and mix and match these with the other selection options listed above, for example:\n",
    "```python\n",
    "densely_populated = world_new['pop_density'] > 500\n",
    "world_new.loc[densely_populated, ['region', 'pop_density', 'population']]\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "all",
   "notebook_metadata_filter": "all",
   "text_representation": {
    "extension": ".py",
    "format_name": "percent",
    "format_version": "1.2",
    "jupytext_version": "1.0.3"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  },
  "nbsphinx": {
   "execute": "never"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}