{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture07 Manipulating Data and Joins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Includes and Standard Magic...\n", "### Standard Magic and startup initializers.\n", "\n", "# Load Numpy\n", "import numpy as np\n", "# Load MatPlotLib\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "# Load Pandas\n", "import pandas as pd\n", "\n", "# This lets us show plots inline and also save PDF plots if we want them\n", "%matplotlib inline\n", "from matplotlib.backends.backend_pdf import PdfPages\n", "matplotlib.style.use('fivethirtyeight')\n", "\n", "# These two things are for Pandas, it widens the notebook and lets us display data easily.\n", "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))\n", "\n", "# Show a ludicrus number of rows and columns\n", "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importance of using np.nan\n", "\n", "Note that we have to import numpy.nan but once we do we can use the groupby and other methods without having to worry about what to do with missing data.\n", "\n", "For the example below, what happens when we don't use NaNs in the data table?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a data frame from a lists\n", "# Try replacing 'XX' with various values..\n", "\n", "df = pd.DataFrame({'age': [12.2, 11.0, 15.6, '--'],\n", " 'wgt_kg': [42.3, 40.8, 65.3, 84.2],\n", " 'hgt_cm': [145.1, 143.8, 165.3, 185.8],\n", " 'sex': ['male', 'female', 'male', 'male'],\n", " 'country': ['USA', 'AUS', 'EU', 'USA']})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['age'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['age'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Groupby Command\n", "\n", "Below we see what happens in Pandas when we use the groupby command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({'age': [12.2, 11.0, 15.6, 35.1],\n", " 'wgt_kg': [42.3, 40.8, 65.3, 84.2],\n", " 'hgt_cm': [145.1, 143.8, 165.3, 185.8],\n", " 'sex': ['male', 'female', 'male', 'male'],\n", " 'country': ['USA', 'AUS', 'EU', 'USA']})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(['sex']).describe()\n", "\n", "# ?? Try describing it..?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can pass operators to groupby to get better results.\n", "df.groupby(['sex']).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Can also group by multiple columns\n", "df.groupby(['sex', 'country']).describe()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = df.groupby(['sex', 'country'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get access to a group that we made, use the [get_group() command](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped.groups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped.get_group(('female', 'AUS'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get access to the actual tuples and they're types that are generated by the `groupby` commands" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "grouped.groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins in Pandas\n", "\n", "Let's see a few [joins in Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) and how to execute them.\n", "\n", "**Note:** We can also use the [merge command in pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) to joins as well. We'll explore this more fully in the lab. The key idea is that merge let's us pick columns to do the merge/join while join will always join on the index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Careful here! We have to set the index!\n", "\n", "df1 = pd.DataFrame([(1, 'foo', 3), (2, 'bar', 2), (3, 'foo', 4), (4, 'foo', 3)],\n", " columns=['ID','A', 'B'])\n", "df1.set_index('ID', inplace=True)\n", "display(df1)\n", "\n", "df2 = pd.DataFrame([(1, 1.2), (2, 2.5), (3, 2.3), (5, 8.0)], \n", " columns=['ID','C'])\n", "df2.set_index('ID', inplace=True)\n", "display(df2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inner join\n", "df1.join(df2, how='inner')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Can do inner, outter, left, right.\n", "df1.join(df2, how='right')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we don't have the same column names we can tell Pandas that we want to join on a specific column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame([(1, 'foo', 3), (2, 'bar', 2), (3, 'foo', 4), (4, 'foo', 3)],\n", " columns=['leftid','A', 'B'])\n", "display(df1)\n", "\n", "df2 = pd.DataFrame([(1, 1.2), (2, 2.5), (3, 2.3), (5, 8.0)], \n", " columns=['rightid','C'])\n", "df2.set_index('rightid', inplace=True)\n", "display(df2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inner join\n", "df1.join(df2, how='inner')\n", "# Is this what we wanted?!?!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We could have set the index but we can also tell Pandas to do it for us.\n", "df1.join(df2, how='inner', on='leftid')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Join\n", "\n", "Or we can skip all this foolishness by using the [pandas merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) command..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Or we can skip all this foolishness by using the merge command...\n", "df1 = pd.DataFrame([(1, 'foo', 3), (2, 'bar', 2), (3, 'foo', 4), (4, 'foo', 3)],\n", " columns=['ID','A', 'B'])\n", "display(df1)\n", "\n", "df2 = pd.DataFrame([(1, 1.2), (2, 2.5), (3, 2.3), (5, 8.0)], \n", " columns=['ID','C'])\n", "display(df2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df1.merge(df2, left_on=\"ID\", right_on=\"ID\", how='inner')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }