{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pandas Cheatsheet\n", "\n", "Downloaded from https://github.com/jenfly/eoas-python by Jennifer Walker\n", "\n", "### Importing `pandas` Library\n", "```python\n", "import pandas\n", "```\n", "> In general, it's good practice to collect all your `import` commands together and put them at the start of the notebook.\n", "\n", "### DataFrames and Series\n", "\n", "Data in `pandas` is organized into DataFrames and Series.\n", "\n", "- **DataFrame:** 2-dimensional array, like a table in a spreadsheet\n", " - The rows are axis 0\n", " - The columns are axis 1\n", "- **Series:** 1-dimensional array, like a single column or row in a spreadsheet\n", " - Each individual column or row of a DataFrame is represented as a Series\n", "\n", "### Reading a CSV File\n", "\n", "To read a CSV file and store it as a DataFrame variable:\n", "```python\n", "df = pandas.read_csv('some_cool_data.csv')\n", "```\n", "\n", "Missing data in a DataFrame or Series is represented as `NaN` (\"not a number\").\n", "\n", "### Saving to a CSV File\n", "\n", "To save a DataFrame to a CSV file: \n", "```python\n", "df.to_csv('cool_output.csv', index=False)\n", "```\n", "- To include the DataFrame's index as a column in the CSV file, omit the `index=False` keyword argument." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Quick and Easy Summaries of a DataFrame\n", "\n", "|**feature** |dataframe attribute/method|\n", "|---|---|\n", "|**Useful Attributes** |\n", "|Number of rows and columns (rows first, columns second) | `df.shape` |\n", "|Names and data types of each column | `df.dtypes` \n", "|Just the names of each column | `df.columns` \n", "|**Rows at a Glance** |\n", "|First `n` rows (default 5) |`df.head(n)`\n", "|Last `n` rows (default 5) | `df.tail(n)`\n", "|A random sampling of `n` rows (default 1) | `df.sample(n)`\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Summary Statistics\n", "\n", "Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:\n", "```python\n", "df.describe()\n", "```\n", "\n", "Mean value of each column:\n", "```python\n", "df.mean()\n", "```\n", "\n", "And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Working with DataFrame Columns\n", "\n", "#### Single Columns\n", "\n", "Each column of a DataFrame is a Series.\n", "```python\n", "series_X = df['X']\n", "```\n", "\n", "Most DataFrame methods can be applied to a Series, for example:\n", "```python\n", "df['X'].head()\n", "df['X'].max()\n", "```\n", "\n", "Basic calculations with a Series and adding a new column to a DataFrame: \n", "```python\n", "df['Double X'] = 2 * df['X']\n", "```\n", "\n", "#### Multiple Columns\n", "\n", "Use a list of column names to select several columns of a DataFrame, in a specified order:\n", "```python\n", "df_subset = df[['E', 'A', 'C']]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Plots\n", "\n", "Create quick and easy plots of Series and DataFrames with the `plot` method, for example:\n", "- Line plot of one column of a DataFrame:\n", "```python\n", "df['A'].plot()\n", "```\n", "- Line plot of column `'B'` vs. column `'A'` of a DataFrame:\n", "```python\n", "df.plot(x='A', y='B')\n", "```\n", "- Histogram of one column of a DataFrame:\n", "```python\n", "df['A'].plot(kind='hist', bins=30)\n", "```\n", "- Plot all columns of a DataFrame, with each column as a line on the same plot:\n", "```python\n", "df.plot()\n", "```\n", "- Plot bar charts of each column of a DataFrame, with a separate subplot for each column:\n", "```python\n", "df.plot(kind='bar', subplots=True)\n", "```\n", "\n", "Plots can be adjusted with keyword arguments such as `figsize`, `fontsize`, `title`, `colormap`. \n", "- To adjust the size of a graph, use the `figsize` keyword argument to the `plot` method, where `figsize` is a tuple of (width, height). For example to create 6\" wide by 8\" tall figure with subplots:\n", "```python\n", "df.plot(kind='bar', subplots=True, figsize=(6, 8))\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Unique Values & Counting\n", "\n", "For a column `df['A']` which contains many repeated values (such as categories), some useful summary methods are:\n", "\n", "|**feature**|method|\n", "|---|---|\n", "|Unique values | `df['A'].unique()`|\n", "|Number of unique values | `df['A'].nunique()`|\n", "|Counts of each unique value | `df['A'].value_counts()`|\n", "\n", "\n", "> Note: The `unique`, `nunique`, and `value_counts` methods can only be applied to a Series (not a DataFrame)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Sorting\n", "\n", "Sorting a DataFrame based on the values in the column `'B'`:\n", "```python\n", "df.sort_values('B')\n", "```\n", "To sort in descending order, use the keyword argument `ascending=False`.\n", "\n", "Sorting a Series:\n", "```python\n", "series_a.sort_values()\n", "```\n", "\n", "### Filtering\n", "\n", "To select a subset of rows with a filter:\n", " - Create a filter (Boolean Series) using a comparison operator or other functions (such as the `isnull` method)\n", " - Use the filter to extract the desired rows from the DataFrame\n", "\n", "Example: comparison operator\n", "```python\n", "warm = weather_data['Temperature (C)'] > 25\n", "weather_warm = weather_data[warm]\n", "```\n", "\n", "You can also use Boolean Series and DataFrames to count occurrences of specified criteria, for example: count the missing values in each column of a DataFrame with `df.isnull().sum()`\n", "\n", "\n", "### Aggregation\n", "\n", "For basic aggregation operations, use the `groupby` method chained with an aggregation method (e.g., `sum`, `mean`, `sum`, `max`, etc.).\n", "\n", "For example, to find the sum totals of column `'population'` grouped by column `'region'`: `\n", "```python\n", "world.groupby('region')['population'].sum()\n", "```\n", "\n", "You can also group by multiple columns:\n", "```python\n", "world.groupby(['region', 'income_group'])['population'].sum()\n", "```\n", "\n", "For more complex aggregations, you can use the `agg` method.\n", "- Specify a list of aggregation statistics, for example: \n", "```python\n", "world.groupby('region')['population'].agg(['sum', 'min', 'max'])\n", "```\n", "- Use a dictionary to specify different aggregation statistics for different columns, for example:\n", "\n", "```python\n", "agg_dict = {'population' : 'sum', \n", " 'life_expectancy' : ['min', 'max']}\n", "world.groupby('region').agg(agg_dict)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "---\n", "### Setting the Index of a DataFrame\n", "\n", "The index of a DataFrame can be set with the `set_index` method. For example, to set the column `'country'` as the index of DataFrame `world` and save the output as a new DataFrame:\n", "```python\n", "world_new = world.set_index('country')\n", "```\n", "\n", "### Anatomy of a DataFrame\n", "\n", "![dataframe3](img/dataframe3.png)\n", "\n", "- Each cell in a DataFrame is associated with:\n", " - A row position and column position (integers, starting from 0), and\n", " - A row label and column label\n", "- Row and column labels are part of the DataFrame's metadata, stored in attributes:\n", " - `df.index` contains the labels for the DataFrame's rows\n", " - `df.columns` contains the labels for the DataFrame's columns\n", "- `df.values` contains the underlying data array\n", "\n", "### Selection Methods\n", "\n", "There are three main ways of selecting subsets of a DataFrame (or Series):\n", "
    \n", "
  1. Using only square brackets `[]`
  2. \n", "
  3. Positional indexing with `.iloc[]`
  4. \n", "
  5. Label-based indexing with `.loc[]`
  6. \n", "
\n", "\n", "\n", "#### 1. Using only square brackets\n", "|Bracket indexing||\n", "|---|----|\n", "|Use a Boolean Series to select rows matching some criteria | `df[bool_series]`|\n", "|Use a list of column names to select several columns in a specified order | `df[['E', 'A', 'C']]`|\n", "|Select a single column (as a Series) | `series_A = df['A']`|\n", "|Select a single column (as a DataFrame) | `df_A = df[['A']]`|\n", "\n", "\n", "#### 2. Positional indexing with `iloc`\n", "\n", "With `iloc`, we specify the row selection first, followed by column selection, separated by a comma. For example, selecting a single cell at row 2, column 4:\n", "```python\n", "df.iloc[2, 4]\n", "```\n", "\n", "With `iloc` can select:\n", "- A single cell\n", "- A single row or column\n", "- Slices of rows and/or columns\n", " - As with Python lists, `iloc` slices are inclusive of the start bound and exclusive of the stop bound\n", "- Multiple rows and/or columns in arbitrary order, using a list of positions\n", "\n", "A few examples:\n", "```python\n", "df.iloc[100:201:2, [5, 2, 4]]\n", "df.iloc[[4, 2], 3:7]\n", "df.iloc[175:, :]\n", "```\n", "\n", "\n", "#### 3. Label-based indexing with `loc`\n", "\n", "As with `iloc`, we specify the row selection first, followed by column selection, separated by a comma, but using the row and column labels instead of their positions. For example, selecting a single cell at a row labelled `'Algeria'` and a column labelled `'life_expectancy'`:\n", "```python\n", "world_new.loc['Algeria', 'life_expectancy']\n", "```\n", "\n", "Similar to `iloc`, with `loc` we can select:\n", "- A single cell\n", "- A single row or column\n", "- Slices of rows and/or columns\n", " - However, unlike `iloc` and Python list slicing, `loc` slices are inclusive of the start bound but also **inclusive** of the stop bound.\n", " - Example:\n", "```python\n", "world_new.loc['Canada':'Denmark', 'income_group':'gdp_per_capita']\n", "```\n", "- Multiple rows and/or columns in arbitrary order, using a list of labels\n", "\n", "We can also select rows and/or columns with a Boolean Series, and mix and match these with the other selection options listed above, for example:\n", "```python\n", "densely_populated = world_new['pop_density'] > 500\n", "world_new.loc[densely_populated, ['region', 'pop_density', 'population']]\n", "```" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all", "notebook_metadata_filter": "all", "text_representation": { "extension": ".py", "format_name": "percent", "format_version": "1.2", "jupytext_version": "1.0.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "nbsphinx": { "execute": "never" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }