"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Pandas Cheatsheet\n",
"\n",
"Downloaded from https://github.com/jenfly/eoas-python by Jennifer Walker\n",
"\n",
"### Importing `pandas` Library\n",
"```python\n",
"import pandas\n",
"```\n",
"> In general, it's good practice to collect all your `import` commands together and put them at the start of the notebook.\n",
"\n",
"### DataFrames and Series\n",
"\n",
"Data in `pandas` is organized into DataFrames and Series.\n",
"\n",
"- **DataFrame:** 2-dimensional array, like a table in a spreadsheet\n",
" - The rows are axis 0\n",
" - The columns are axis 1\n",
"- **Series:** 1-dimensional array, like a single column or row in a spreadsheet\n",
" - Each individual column or row of a DataFrame is represented as a Series\n",
"\n",
"### Reading a CSV File\n",
"\n",
"To read a CSV file and store it as a DataFrame variable:\n",
"```python\n",
"df = pandas.read_csv('some_cool_data.csv')\n",
"```\n",
"\n",
"Missing data in a DataFrame or Series is represented as `NaN` (\"not a number\").\n",
"\n",
"### Saving to a CSV File\n",
"\n",
"To save a DataFrame to a CSV file: \n",
"```python\n",
"df.to_csv('cool_output.csv', index=False)\n",
"```\n",
"- To include the DataFrame's index as a column in the CSV file, omit the `index=False` keyword argument."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Quick and Easy Summaries of a DataFrame\n",
"\n",
"|**feature** |dataframe attribute/method|\n",
"|---|---|\n",
"|**Useful Attributes** |\n",
"|Number of rows and columns (rows first, columns second) | `df.shape` |\n",
"|Names and data types of each column | `df.dtypes` \n",
"|Just the names of each column | `df.columns` \n",
"|**Rows at a Glance** |\n",
"|First `n` rows (default 5) |`df.head(n)`\n",
"|Last `n` rows (default 5) | `df.tail(n)`\n",
"|A random sampling of `n` rows (default 1) | `df.sample(n)`\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Summary Statistics\n",
"\n",
"Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:\n",
"```python\n",
"df.describe()\n",
"```\n",
"\n",
"Mean value of each column:\n",
"```python\n",
"df.mean()\n",
"```\n",
"\n",
"And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Working with DataFrame Columns\n",
"\n",
"#### Single Columns\n",
"\n",
"Each column of a DataFrame is a Series.\n",
"```python\n",
"series_X = df['X']\n",
"```\n",
"\n",
"Most DataFrame methods can be applied to a Series, for example:\n",
"```python\n",
"df['X'].head()\n",
"df['X'].max()\n",
"```\n",
"\n",
"Basic calculations with a Series and adding a new column to a DataFrame: \n",
"```python\n",
"df['Double X'] = 2 * df['X']\n",
"```\n",
"\n",
"#### Multiple Columns\n",
"\n",
"Use a list of column names to select several columns of a DataFrame, in a specified order:\n",
"```python\n",
"df_subset = df[['E', 'A', 'C']]\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Plots\n",
"\n",
"Create quick and easy plots of Series and DataFrames with the `plot` method, for example:\n",
"- Line plot of one column of a DataFrame:\n",
"```python\n",
"df['A'].plot()\n",
"```\n",
"- Line plot of column `'B'` vs. column `'A'` of a DataFrame:\n",
"```python\n",
"df.plot(x='A', y='B')\n",
"```\n",
"- Histogram of one column of a DataFrame:\n",
"```python\n",
"df['A'].plot(kind='hist', bins=30)\n",
"```\n",
"- Plot all columns of a DataFrame, with each column as a line on the same plot:\n",
"```python\n",
"df.plot()\n",
"```\n",
"- Plot bar charts of each column of a DataFrame, with a separate subplot for each column:\n",
"```python\n",
"df.plot(kind='bar', subplots=True)\n",
"```\n",
"\n",
"Plots can be adjusted with keyword arguments such as `figsize`, `fontsize`, `title`, `colormap`. \n",
"- To adjust the size of a graph, use the `figsize` keyword argument to the `plot` method, where `figsize` is a tuple of (width, height). For example to create 6\" wide by 8\" tall figure with subplots:\n",
"```python\n",
"df.plot(kind='bar', subplots=True, figsize=(6, 8))\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Unique Values & Counting\n",
"\n",
"For a column `df['A']` which contains many repeated values (such as categories), some useful summary methods are:\n",
"\n",
"|**feature**|method|\n",
"|---|---|\n",
"|Unique values | `df['A'].unique()`|\n",
"|Number of unique values | `df['A'].nunique()`|\n",
"|Counts of each unique value | `df['A'].value_counts()`|\n",
"\n",
"\n",
"> Note: The `unique`, `nunique`, and `value_counts` methods can only be applied to a Series (not a DataFrame)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Sorting\n",
"\n",
"Sorting a DataFrame based on the values in the column `'B'`:\n",
"```python\n",
"df.sort_values('B')\n",
"```\n",
"To sort in descending order, use the keyword argument `ascending=False`.\n",
"\n",
"Sorting a Series:\n",
"```python\n",
"series_a.sort_values()\n",
"```\n",
"\n",
"### Filtering\n",
"\n",
"To select a subset of rows with a filter:\n",
" - Create a filter (Boolean Series) using a comparison operator or other functions (such as the `isnull` method)\n",
" - Use the filter to extract the desired rows from the DataFrame\n",
"\n",
"Example: comparison operator\n",
"```python\n",
"warm = weather_data['Temperature (C)'] > 25\n",
"weather_warm = weather_data[warm]\n",
"```\n",
"\n",
"You can also use Boolean Series and DataFrames to count occurrences of specified criteria, for example: count the missing values in each column of a DataFrame with `df.isnull().sum()`\n",
"\n",
"\n",
"### Aggregation\n",
"\n",
"For basic aggregation operations, use the `groupby` method chained with an aggregation method (e.g., `sum`, `mean`, `sum`, `max`, etc.).\n",
"\n",
"For example, to find the sum totals of column `'population'` grouped by column `'region'`: `\n",
"```python\n",
"world.groupby('region')['population'].sum()\n",
"```\n",
"\n",
"You can also group by multiple columns:\n",
"```python\n",
"world.groupby(['region', 'income_group'])['population'].sum()\n",
"```\n",
"\n",
"For more complex aggregations, you can use the `agg` method.\n",
"- Specify a list of aggregation statistics, for example: \n",
"```python\n",
"world.groupby('region')['population'].agg(['sum', 'min', 'max'])\n",
"```\n",
"- Use a dictionary to specify different aggregation statistics for different columns, for example:\n",
"\n",
"```python\n",
"agg_dict = {'population' : 'sum', \n",
" 'life_expectancy' : ['min', 'max']}\n",
"world.groupby('region').agg(agg_dict)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"---\n",
"### Setting the Index of a DataFrame\n",
"\n",
"The index of a DataFrame can be set with the `set_index` method. For example, to set the column `'country'` as the index of DataFrame `world` and save the output as a new DataFrame:\n",
"```python\n",
"world_new = world.set_index('country')\n",
"```\n",
"\n",
"### Anatomy of a DataFrame\n",
"\n",
"![dataframe3](img/dataframe3.png)\n",
"\n",
"- Each cell in a DataFrame is associated with:\n",
" - A row position and column position (integers, starting from 0), and\n",
" - A row label and column label\n",
"- Row and column labels are part of the DataFrame's metadata, stored in attributes:\n",
" - `df.index` contains the labels for the DataFrame's rows\n",
" - `df.columns` contains the labels for the DataFrame's columns\n",
"- `df.values` contains the underlying data array\n",
"\n",
"### Selection Methods\n",
"\n",
"There are three main ways of selecting subsets of a DataFrame (or Series):\n",
"\n",
"
Using only square brackets `[]`
\n",
"
Positional indexing with `.iloc[]`
\n",
"
Label-based indexing with `.loc[]`
\n",
"\n",
"\n",
"\n",
"#### 1. Using only square brackets\n",
"|Bracket indexing||\n",
"|---|----|\n",
"|Use a Boolean Series to select rows matching some criteria | `df[bool_series]`|\n",
"|Use a list of column names to select several columns in a specified order | `df[['E', 'A', 'C']]`|\n",
"|Select a single column (as a Series) | `series_A = df['A']`|\n",
"|Select a single column (as a DataFrame) | `df_A = df[['A']]`|\n",
"\n",
"\n",
"#### 2. Positional indexing with `iloc`\n",
"\n",
"With `iloc`, we specify the row selection first, followed by column selection, separated by a comma. For example, selecting a single cell at row 2, column 4:\n",
"```python\n",
"df.iloc[2, 4]\n",
"```\n",
"\n",
"With `iloc` can select:\n",
"- A single cell\n",
"- A single row or column\n",
"- Slices of rows and/or columns\n",
" - As with Python lists, `iloc` slices are inclusive of the start bound and exclusive of the stop bound\n",
"- Multiple rows and/or columns in arbitrary order, using a list of positions\n",
"\n",
"A few examples:\n",
"```python\n",
"df.iloc[100:201:2, [5, 2, 4]]\n",
"df.iloc[[4, 2], 3:7]\n",
"df.iloc[175:, :]\n",
"```\n",
"\n",
"\n",
"#### 3. Label-based indexing with `loc`\n",
"\n",
"As with `iloc`, we specify the row selection first, followed by column selection, separated by a comma, but using the row and column labels instead of their positions. For example, selecting a single cell at a row labelled `'Algeria'` and a column labelled `'life_expectancy'`:\n",
"```python\n",
"world_new.loc['Algeria', 'life_expectancy']\n",
"```\n",
"\n",
"Similar to `iloc`, with `loc` we can select:\n",
"- A single cell\n",
"- A single row or column\n",
"- Slices of rows and/or columns\n",
" - However, unlike `iloc` and Python list slicing, `loc` slices are inclusive of the start bound but also **inclusive** of the stop bound.\n",
" - Example:\n",
"```python\n",
"world_new.loc['Canada':'Denmark', 'income_group':'gdp_per_capita']\n",
"```\n",
"- Multiple rows and/or columns in arbitrary order, using a list of labels\n",
"\n",
"We can also select rows and/or columns with a Boolean Series, and mix and match these with the other selection options listed above, for example:\n",
"```python\n",
"densely_populated = world_new['pop_density'] > 500\n",
"world_new.loc[densely_populated, ['region', 'pop_density', 'population']]\n",
"```"
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "all",
"notebook_metadata_filter": "all",
"text_representation": {
"extension": ".py",
"format_name": "percent",
"format_version": "1.2",
"jupytext_version": "1.0.3"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"nbsphinx": {
"execute": "never"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": true,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}