Introduction to Data Science Fall 2019

CMPS-3660-02: Introduction to Data Science - Fall 2019 - 3 Credit Hours

View site on Github or Github Pages

Time and Location

Instructor and TA Information

Instructor: Dr. Nicholas Mattei

Teaching Assistant: Arie Glazier

Catalog / Course Description

Technically this course is CMPS 3660: Special Topics in Computer Science (1-3 Credit Hours)

This course varies from time to time, focusing on topics of interest to the faculty and students.

But in reality this course is: CMPS 3660-02: Introduction to Data Science

Prerequisite: Concurrent with CMPS 2200 or consent of instructor.

Course Objectives and Overview:

Data Science is an interdisciplinary set of topics that includes everything you need to create data driven answers and solutions to specific business, scientific, or sociological questions. The goal of data science is to improve decision making based on insights from data. As a field, Data Science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from datasets.

This course will cover:

  1. Data management systems;
  2. Exploratory and statistical data analysis;
  3. Data and information visualization; and
  4. Presentation and communication of results.

The course will use Python and be largely project and case study driven with students expected to analyze real datasets and post an analysis/tutorial publicly on GitHub at the end of the course.

Students should be comfortable with programming in at least one language (preferably Python) and have had a reasonable amount of math including some linear algebra and algorithms (we’re going to dissect some algorithms later in the class including PCA so you should know what a matrix is and you should be comfortable with everything in CMPS/MATH 2170).

We’ll be using Anaconda and a number of packages including NumPy, SciPy, SciKit, and Pandas. You’ll be turning in projects using GitHub and Git so make sure you have an account!

Course Learning Outcomes:

At the conclusion of this course students will be able to:

Note to Students: This is a brand new course and we’re trying lots of new things! Please work with us on making this course a success!

Program-Level Outcomes:

This course fulfills the requirement of one of the CMPS 3000-level or above courses required for the coordinate major in computer science. Students need to complete three such courses in order to complete the requirements for a coordinate major. For more information on the coordinate major please see the requirements at the Registrar’s Website

Required and Suggested Student Resources

There is no required textbook for this course. However, we will make extensive use of online textbooks and articles for the required reading that you will be quizzed on. You will also need access to a computer that you can install the required software (Anaconda, Docker). If you do not have access to a computer please see me ASAP.

Online Books:

Evaluation Procedures and Grading Criteria

This course will consist of four distinct grading areas. Note that all point values described below for individual assignments are subject to change, the area percentages will remain the same.

Weekly Questions: I will post a multiple choice and short answer quiz (about) every Tuesday on Canvas which will be due the following Tuesday before midnight. The questions will cover various topics discussed in class as well as items from the readings that will be assigned. (~10 points each)

Mini-Projects: There will be between 4-6 “mini-projects” assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The projects will be posted at https://github.com/TulaneIntroDataScience/fall2019 and will be assigned in Canvas. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry. Posting solutions publicly online without the staff’s express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted. ~25 points for the first project, ~100 for all others).

Labs: We will run Labs in class a couple of times throughout the semester and give you an opportunity to work through problems hands on with me and the TAs. On these days it will be important to bring a laptop to class to participate in the work. Labs will be worth ~10 points each and graded based on completion.

Midterm Exam: This will be a written, close book, in class exam. You are allowed one hand-written cheat sheet, front and back, 8.5x11 inch paper. This will be turned in with your exam.

Final Project: In the interest of building students’ public portfolios, and in the spirit of “learning by doing”, students will create a self-contained online tutorial to be posted publicly and a 5-minute presentation in class. This tutorial can be created individually or in a small group (Max 2 people). This assignment will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. We will have several milestones associated with the final project including the following.

  1. Identifying a dataset and establishing a GitHub.io Site. (~25 points)
  2. Extraction, Transform, and Load (ETL) + Exploratory Data Analysis (EDA) Your notebook from Part 1 but expanded to include the data being loaded and showing that you have figured out how to get the data into your system. In addition, include some graphs, visualizations, and stats that show you can manipulate your data and understand the dataset you are working with. (~50 points)
  3. A final, in class presentation. (~50 points)
  4. A final tutorial and website. (~75 points)

Late Work Policy: All work must be turned in on time unless explicit consent for outstanding circumstances is given beforehand (or in the case of illness, with a documented absence after). Any late work will be penalized at 10% of the total assignment value per day up to 5 days late, after which it will not be accepted. The exception to this rule is if work solutions are presented in class (as is the case with labs and quizzes). After the work is presented in class, no late work will be accepted.

Final Grade Policy: The weighted average will determine your letter grade roughly as follows, +/- grades will be given for borderline cases.

All grades will be posted on Canvas throughout the semester.

Attendance

Students are required to attend all classes and labs unless they are ill or prevented from attending by exceptional circumstances and with a valid excuse note. Students are responsible for notifying instructors about absences that result from serious illnesses, injuries, or critical personal problems. Students with frequent absences will be reported and/or removed from the course according to university policy.

Use of Electronic Devices

Please silence your cellphones during class. If you want to use a laptop or other device with a large screen for note taking please sit in the back rows of the classroom – it’s distracting to other students https://www.scientificamerican.com/article/students-are-better-off-without-a-laptop-in-the-classroom/

Note: There will be a few “interactive lab days” where we will explicitly encourage you to bring your laptop (if available) and follow along with the tutorial.

Schedule

This schedule is subject to change throughout the semester, please check it often. This class and the assignments are maintained in a public GitHub so you can access any of these whenever you want.

Week Date Topic / Slides Extra Resources Required Readings Assignments
1 8/27 What is Data Science   Economist Article on Python

FiveThirtyEight - What the Fox Knows
Project 0 Posted - Setting up your Environment

Question Set 1 Out (Canvas)
8/29 Tools & Python Basic Notebook and Markdown Getting Started with Anaconda  
2 9/3 What is Data? / Intro to Notebooks Simple Data and Graphing Notebook   Question Set 1 Due; Question Set 2 Out (Canvas)
9/5 Intro to Git   Git Workflows Overview

Intro to Docker
Project0 Due (Canvas)
3 9/10 Lab Day: Hands on Pandas Lab 1

Lab 2
Introduction to Pandas Question Set 2 Due; 3 Out (Canvas)

Tutorial Milestone 1 Out
9/12 Scraping Data Scraping Notebook

Tutorial on Beautiful Soup
What happens when you type google.com into your browser’s address box and press enter? Lab 1 + 2 Due (Canvas)
4 9/17 Visualizing Data - Prof. Summa Lab 3   Question Set 3 Due; 4 Out
9/19 Lab Day: Manipulating and Filtering Data (Arie Glazier) Lab 4

Lab 5
  Lab 3 Due (Canvas)

Project 1 Posted - Fly Me To The Moon
5 9/24 Lab Day: Filtering Data / Review Old Labs     Question Set 4 Due; 5 Out (Canvas)

Lab 4 + 5 Due (Canvas)
9/27 Munging and Tidy Data I Munging and Tidy Data Notebook

Hadley Wickham. “Tidy Data.”
Hould, Tidy Data in Python  
6 10/1 Munging and Tidy Data II + Lab Day! Lab 6

Pandas Tutorials
  Question Set 5 Due
10/3 Munging and Tidy Data III + Lab Day! Lab 7

pandassql

SQLite
  Lab 6 + Lab 7 Due (Sunday!) (Canvas)
7 10/8 Midterm Exam - In Class      
10/10 Fall Break - No Class      
8 10/15 Test Review + Ethical and Legal Issues - Prof. Bock, Tulane Law.   What Does GDPR Mean For Me? Tutorial Milestone 1 Due (Canvas)

Question Set 6 Out (Canvas)

Tutorial Milestone 2 Out
10/17 Missing Data and Entity Resolution I Entity Resolution Tutorial, VLDB 2012   Project 1 Due (Canvas)

Project 2 Posted - Moneyball
9 10/22 Entity Resolution + Lab Review     Question Set 6 Due; 7 Out (Canvas)
10/24 Missing Data Missing Data and Linear Regression Notebook Five Thirty Eight: Science Isn’t Broken  
10 10/29 Missing Data II     Question Set 7 Due; 8 Out (Canvas)
10/31 Lab Day: Relationships Between Variables and Observations Lab 8 + 9    
11 11/5 Summary Stats I     Question Set 8 Due; 9 Out (Canvas)
11/7 Summary Stats II / Machine Learning I   Introduction to Machine Learning (From: A Course in Machine Learning by Hal Daumé III) Project 2 Due (Canvas)

Project 3 Posted - Data and Maps!

Grad Project 4 Posted - Contextualizing Data Science!
12 11/12 Machine Learning II     Question Set 9 Due; 10 Out (Canvas)

Tutorial Milestone 2 Due (Canvas)
11/14 Lab Day: Hands on Machine Learning Lab 10 + 11    
13 11/19 Decision Trees     Question Set 10 Due
11/21 Decision Trees II      
14 11/26 Catchup / Course Wrap Up Bonus Lab 12 Posted   Project 3 Due (Canvas)
11/28 Thanksgiving Break - No Class      
15 12/3 Final Presentations I     Question Set 11 Due

Final Tutorial Rubric
12/3 Final Presentations II     Tutorial Slides Due (Canvas)
Final 12/8 Final Links Due     Final Tutorial Due (Canvas)

Final Grad Writeups Due (Canvas)

Additional Resources for Data Science Students

Over the course of the semester I will update this section with additional links and resources that may be useful for you.

These books come from a more statistical background and are mainly taught in R, however, they are considered to be some of the best texts for Statistics and Data Science. The first is an introduction, the second is appropiate for a graduate course.

Some Fun readings about Data Science and some key figures.

A large debt for this course is owed to John P. Dickerson at UMD and his course (http://jpdickerson.com/).

ADA / Accessibility Statement

Any students with disabilities or other needs, who need special accommodations in this course, are invited to share these concerns or requests with the instructor and should contact Goldman Center for Student Accessibility: http://accessibility.tulane.edu or 504.862.8433.

Code of Academic Conduct and Academic** Integrity**

This course will follow Tulane’s Code of Academic Conduct. Cheating will be reported to the Associate Dean of Newcomb-Tulane College. Discussion is encouraged. However, what you turn in must be your own. You may not read another classmate’s solutions or copy a solution from the web. I will be running checks on the code turned in for plagiarism. If plagiarism is detected the minimum penalty is a 0 on the assignment and being reported, however, you may automatically fail this course at my discretion.

The Code of Academic Conduct applies to all undergraduate students, full-time and part-time, at Tulane University. Tulane University expects and requires behavior compatible with its high standards of scholarship. By accepting admission to the university, a student accepts its regulations (i.e., Code of Academic Conduct and the Code of Student Conduct) and acknowledges the right of the university to take disciplinary action, including suspension or expulsion, for conduct judged unsatisfactory or disruptive.

Title IX

Tulane University recognizes the inherent dignity of all individuals and promotes respect for all people. As such, Tulane is committed to providing an environment free of all forms of discrimination including sexual and gender-based discrimination, harassment, and violence like sexual assault, intimate partner violence, and stalking. If you (or someone you know) has experienced or is experiencing these types of behaviors, know that you are not alone. Resources and support are available: you can learn more at http://allin.tulane.edu. Any and all of your communications on these matters will be treated as either “Confidential” or “Private” as explained in the chart below. Please know that if you choose to confide in me I am mandated by the university to report to the Title IX Coordinator, as Tulane and I want to be sure you are connected with all the support the university can offer. You do not need to respond to outreach from the university if you do not want. You can also make a report yourself, including an anonymous report, through the form http://tulane.edu/concerns.

Confidential Private
Except in extreme circumstances, involving imminent danger to one’s self or others, nothing will be shared without your explicit permission. Conversations are kept as confidential as possible, but information is shared with key staff members so the University can offer resources and accommodations and take action if necessary for safety reasons.
Counseling and Psychological Services (CAPS): (504) 314-2277 or The Line (24/7): (504) 264-6074 Case Management and Victim Support Services: (504) 314-2160 or srss@tulane.edu
Student Health Center: (504) 865-5255 Tulane University Police (TUPD): Uptown - (504) 865-5911. Downtown – (504) 988-5531
Sexual Aggression Peer Hotline and Education (SAPHE): (504) 654-9543 Title IX Coordinator: (504) 314-2160 or msmith76@tulane.edu