Introduction to Data Science Fall 2019
CMPS-3660-02: Introduction to Data Science - Fall 2019 - 3 Credit Hours
View site on Github or Github Pages
Time and Location
- Lectures: Tuesdays and Thursdays
- Room: Stanley Thomas 302 (Building 10)
- Time: 1100 - 1215
- Webpage: https://rebrand.ly/TUDataScience
Instructor and TA Information
Instructor: Dr. Nicholas Mattei
- Email: nsmattei@tulane.edu
- Office: Stanley Thomas 402B
- Office Hours: T 1400-1500 and R 1600-1700
Teaching Assistant: Arie Glazier
- Email: aglazier@tulane.edu
- Office: Stanley Thomas 309
- Office Hours: MW 1400-1500
Catalog / Course Description
Technically this course is CMPS 3660: Special Topics in Computer Science (1-3 Credit Hours)
This course varies from time to time, focusing on topics of interest to the faculty and students.
But in reality this course is: CMPS 3660-02: Introduction to Data Science
Prerequisite: Concurrent with CMPS 2200 or consent of instructor.
Course Objectives and Overview:
Data Science is an interdisciplinary set of topics that includes everything you need to create data driven answers and solutions to specific business, scientific, or sociological questions. The goal of data science is to improve decision making based on insights from data. As a field, Data Science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from datasets.
This course will cover:
- Data management systems;
- Exploratory and statistical data analysis;
- Data and information visualization; and
- Presentation and communication of results.
The course will use Python and be largely project and case study driven with students expected to analyze real datasets and post an analysis/tutorial publicly on GitHub at the end of the course.
Students should be comfortable with programming in at least one language (preferably Python) and have had a reasonable amount of math including some linear algebra and algorithms (we’re going to dissect some algorithms later in the class including PCA so you should know what a matrix is and you should be comfortable with everything in CMPS/MATH 2170).
We’ll be using Anaconda and a number of packages including NumPy, SciPy, SciKit, and Pandas. You’ll be turning in projects using GitHub and Git so make sure you have an account!
Course Learning Outcomes:
At the conclusion of this course students will be able to:
- Open, load, and manipulate data using industry standard tools.
- Have a basic understanding of data management and storage systems.
- Be able to clean data and perform basic statistical analysis of the data including visualization.
- Have an understanding of statistical hypothesis testing including t-tests and bootstrapped confidence intervals.
- Be able to use one or more machine learning algorithms for classification and regression.
- Be able to present the results of a complete data analysis in written, visual, and presentation.
Note to Students: This is a brand new course and we’re trying lots of new things! Please work with us on making this course a success!
Program-Level Outcomes:
This course fulfills the requirement of one of the CMPS 3000-level or above courses required for the coordinate major in computer science. Students need to complete three such courses in order to complete the requirements for a coordinate major. For more information on the coordinate major please see the requirements at the Registrar’s Website
Required and Suggested Student Resources
There is no required textbook for this course. However, we will make extensive use of online textbooks and articles for the required reading that you will be quizzed on. You will also need access to a computer that you can install the required software (Anaconda, Docker). If you do not have access to a computer please see me ASAP.
Online Books:
- Python Data Science Handbook: Essential Tools for Working with Data, Jake VanderPlass. O’Reilly Media Inc., 2016. Available online for free at: https://github.com/jakevdp/PythonDataScienceHandbook
- This textbook also has the entire book as a notebook, with examples on this GitHub page.
- Computational and Inferential Thinking: The Foundations of Data Science, Ani Adhikari and John DeNero. A free online textbook that includes interactive Jupyter notebooks and public data sets for all examples at: https://www.inferentialthinking.com/chapters/intro
Evaluation Procedures and Grading Criteria
This course will consist of four distinct grading areas. Note that all point values described below for individual assignments are subject to change, the area percentages will remain the same.
- 15% - Weekly Questions
- 40% - Mini-Projects and Labs
- 15% - Midterm Exam
- 30% - Final Project
Weekly Questions: I will post a multiple choice and short answer quiz (about) every Tuesday on Canvas which will be due the following Tuesday before midnight. The questions will cover various topics discussed in class as well as items from the readings that will be assigned. (~10 points each)
Mini-Projects: There will be between 4-6 “mini-projects” assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The projects will be posted at https://github.com/TulaneIntroDataScience/fall2019 and will be assigned in Canvas. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry. Posting solutions publicly online without the staff’s express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted. ~25 points for the first project, ~100 for all others).
Labs: We will run Labs in class a couple of times throughout the semester and give you an opportunity to work through problems hands on with me and the TAs. On these days it will be important to bring a laptop to class to participate in the work. Labs will be worth ~10 points each and graded based on completion.
Midterm Exam: This will be a written, close book, in class exam. You are allowed one hand-written cheat sheet, front and back, 8.5x11 inch paper. This will be turned in with your exam.
Final Project: In the interest of building students’ public portfolios, and in the spirit of “learning by doing”, students will create a self-contained online tutorial to be posted publicly and a 5-minute presentation in class. This tutorial can be created individually or in a small group (Max 2 people). This assignment will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. We will have several milestones associated with the final project including the following.
- Identifying a dataset and establishing a GitHub.io Site. (~25 points)
- Extraction, Transform, and Load (ETL) + Exploratory Data Analysis (EDA) Your notebook from Part 1 but expanded to include the data being loaded and showing that you have figured out how to get the data into your system. In addition, include some graphs, visualizations, and stats that show you can manipulate your data and understand the dataset you are working with. (~50 points)
- A final, in class presentation. (~50 points)
- A final tutorial and website. (~75 points)
Late Work Policy: All work must be turned in on time unless explicit consent for outstanding circumstances is given beforehand (or in the case of illness, with a documented absence after). Any late work will be penalized at 10% of the total assignment value per day up to 5 days late, after which it will not be accepted. The exception to this rule is if work solutions are presented in class (as is the case with labs and quizzes). After the work is presented in class, no late work will be accepted.
Final Grade Policy: The weighted average will determine your letter grade roughly as follows, +/- grades will be given for borderline cases.
- A >= 90%
- B >= 80%
- C >= 70%
- D >= 60%
- F < 60%
All grades will be posted on Canvas throughout the semester.
Attendance
Students are required to attend all classes and labs unless they are ill or prevented from attending by exceptional circumstances and with a valid excuse note. Students are responsible for notifying instructors about absences that result from serious illnesses, injuries, or critical personal problems. Students with frequent absences will be reported and/or removed from the course according to university policy.
Use of Electronic Devices
Please silence your cellphones during class. If you want to use a laptop or other device with a large screen for note taking please sit in the back rows of the classroom – it’s distracting to other students https://www.scientificamerican.com/article/students-are-better-off-without-a-laptop-in-the-classroom/
Note: There will be a few “interactive lab days” where we will explicitly encourage you to bring your laptop (if available) and follow along with the tutorial.
Schedule
This schedule is subject to change throughout the semester, please check it often. This class and the assignments are maintained in a public GitHub so you can access any of these whenever you want.
- Link to Slides Directory
- Link to Demonstration Notebooks
- Links to Labs, Projects, and Final Tutorial
Additional Resources for Data Science Students
Over the course of the semester I will update this section with additional links and resources that may be useful for you.
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython Wes McKinney, O’Reilly Media, 2017. Code and notebooks but not text available on GitHub
- Data Science from Scratch: First Principles with Python, Joel Grus. O’Reilly Media, 2015. Code but not text available on GitHub
- Introduction to Machine Learning with Python: A Guide for Data Scientists, Andreas C. Müller and Sarah Guido, O’Reilly Media, 2016. Code but not text available on GitHub
These books come from a more statistical background and are mainly taught in R, however, they are considered to be some of the best texts for Statistics and Data Science. The first is an introduction, the second is appropiate for a graduate course.
- An Introduction to Statistical Learning with Applications in R. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Springer. Available online for free at the authors website.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Springer. Available online for free at the authors website.
Some Fun readings about Data Science and some key figures.
- John W. Tukey: His Life and Professional Contributions. David R. Brillinger. The Annals of Statistics Vol 30, No. 6, 2002.
- 50 Years of Data Science. David Donoho. Manuscript Based on Invited Talk, Princeton 2015.
A large debt for this course is owed to John P. Dickerson at UMD and his course (http://jpdickerson.com/).
- John P. Dickerson’s DS Class at UMD
- Dennis Sun’s DS Course at CalPoly
- Data8 Resources (Berkeley Data Science Course)
- Other Berkeley Resources:
- Course and textbook from Alan Downey at Olin College.
- Zico Kolter’s course at Carnegie Mellon University
- Setting up a simple GitHub Pages website with Markdown
- Some random nice websites with tutorials and examples for DS.
- A couple of Data Science and Machine Learning interview questions. The course covers almost the entire set of DS questions (minus the R questions).
ADA / Accessibility Statement
Any students with disabilities or other needs, who need special accommodations in this course, are invited to share these concerns or requests with the instructor and should contact Goldman Center for Student Accessibility: http://accessibility.tulane.edu or 504.862.8433.
Code of Academic Conduct and Academic** Integrity**
This course will follow Tulane’s Code of Academic Conduct. Cheating will be reported to the Associate Dean of Newcomb-Tulane College. Discussion is encouraged. However, what you turn in must be your own. You may not read another classmate’s solutions or copy a solution from the web. I will be running checks on the code turned in for plagiarism. If plagiarism is detected the minimum penalty is a 0 on the assignment and being reported, however, you may automatically fail this course at my discretion.
The Code of Academic Conduct applies to all undergraduate students, full-time and part-time, at Tulane University. Tulane University expects and requires behavior compatible with its high standards of scholarship. By accepting admission to the university, a student accepts its regulations (i.e., Code of Academic Conduct and the Code of Student Conduct) and acknowledges the right of the university to take disciplinary action, including suspension or expulsion, for conduct judged unsatisfactory or disruptive.
Title IX
Tulane University recognizes the inherent dignity of all individuals and promotes respect for all people. As such, Tulane is committed to providing an environment free of all forms of discrimination including sexual and gender-based discrimination, harassment, and violence like sexual assault, intimate partner violence, and stalking. If you (or someone you know) has experienced or is experiencing these types of behaviors, know that you are not alone. Resources and support are available: you can learn more at http://allin.tulane.edu. Any and all of your communications on these matters will be treated as either “Confidential” or “Private” as explained in the chart below. Please know that if you choose to confide in me I am mandated by the university to report to the Title IX Coordinator, as Tulane and I want to be sure you are connected with all the support the university can offer. You do not need to respond to outreach from the university if you do not want. You can also make a report yourself, including an anonymous report, through the form http://tulane.edu/concerns.
Confidential | Private |
---|---|
Except in extreme circumstances, involving imminent danger to one’s self or others, nothing will be shared without your explicit permission. | Conversations are kept as confidential as possible, but information is shared with key staff members so the University can offer resources and accommodations and take action if necessary for safety reasons. |
Counseling and Psychological Services (CAPS): (504) 314-2277 or The Line (24/7): (504) 264-6074 | Case Management and Victim Support Services: (504) 314-2160 or srss@tulane.edu |
Student Health Center: (504) 865-5255 | Tulane University Police (TUPD): Uptown - (504) 865-5911. Downtown – (504) 988-5531 |
Sexual Aggression Peer Hotline and Education (SAPHE): (504) 654-9543 | Title IX Coordinator: (504) 314-2160 or msmith76@tulane.edu |