Data Science with Real Data

Introductory Data Science and Machine Learning courses crave real world datasets to enhance student interest and enrich their learning experience. However, identifying, accessing and preparing real data can be a painstaking task. As a result, several foundational courses tend to rely on a similar subset of datasets. We hope to demonstrate that Data Commons can help increase the diversity of real world data used in such foundational courses taught across the world and enrich students’ (and instructors’) experience.

We make available an (increasing) sample of data science course assignments developed around illustrating key concepts at an introductory college level. In addition to revolving around core data science ideas, we use real world data provided by the Data Commons APIs with the aim of enhancing the pedagogical goals of each topic. Each assignment is implemented as a python notebook. These notebooks are not teaching notes; they serve as self-contained templates for implementing, interpreting and/or analyzing a subset of core concepts. The entire assignment revolves around using some publicly available dataset, most often directly using the Data Commons APIs.

Each assignment notebook should ideally be adapted to suit the needs of your curriculum and serve the needs of a complete and coherent course. We intend for them to serve as examples (templates) for you to customize extensively. We encourage course instructors and teaching assistants to use different datasets (and variables) for each iteration of their course. Luckily, Data Commons makes this easy.

All material is provided publicly and free of charge, under a Creative Commons license (CC BY). If you end up finding any of this material useful and would like to be notified of updates, do drop us a line.

FERPA Compliance

Data Commons collects no personal information (PII), records, or private information from users and can be used in compliance with FERPA. For specific questions about FERPA compliance, please contact your organization’s legal counsel for advice.

Why use this?

These materials were designed to:

  • Use Real Data. Via the Data Commons API, students engage with real world data from the get go—no more stale, synthetic datasets.

  • Be Interactive. Each concept is illustrated with examples that instructors and teaching assistants can tinker with minimal effort, allowing students to learn in a hands-on way.

  • Easy to adapt. The notebook format and Data Commons Python API makes everything modular and easy to edit.

Who is this for?

Teachers, Professors, Instructors, Teaching Assistants, and anyone else developing and teaching data science curriculum. We also believe early practitioners can benefit greatly from the exercises.

As an example, MIT’s large Introduction to Machine Learning course has adapted several of the examples covered in these notebooks to suit their pedagogical needs. From using the same datasets to dive deeper into the material, to modifying the data/variables to illustrate a similar effect, the adaptations span a wide spectrum.

How can these be used?

We strongly encourage you to change and adapt these notebooks to fit your needs! You can download any notebook in either .ipynb or .py format by clicking on its link and File > Download.

Datasets can be changed by editing the list of variables queried (see the “Data Commons for Data Science” tutorial for more on this); editing framing and questions is as easy as editing text cells.

Some ideas:

  • Add additional cells for any additional topics you want covered.
  • For students with stronger programming skills, ask to implement the methods covered on their own.
  • Some of the questions posed to students in the notebooks are open ended – these can be adapted to discussion sessions with students.

Python Notebooks

  • Data Commons for Data Science Tutorial
    A quick tutorial introducing the key concepts of working with the Data Commons Python API. Great for familiarizing yourself with how to adapt datasets to your particular needs.

  • Feature Engineering
    Explores the first steps of any data science pipeline: feature selection, data visualization, preprocessing and standardization. Pairs well with “Classification and Model Evaluation”.

  • Classification and Model Evaluation
    Explores the second half of a data science pipeline: training and test splits, cross validation, metrics for model evaluation. Focus is on classification models. Pairs well with “Feature Engineering”.

  • Regression: Basics and Prediction
    An introduction to linear regression as a tool for prediction, from a modern machine learning perspective.

  • Regression: Evaluation and Interpretation
    A more in-depth look at linear regression, with an emphasis on interpreting model parameters and evaluation metrics beyond simple accuracy. Provides a more statistical perspective.

  • Clustering
    An introduction to clustering analysis for unsupervised learning. Explores the mechanics of K-means clustering and cluster interpretation.

What’s Next?

We’re looking forward to expanding topic coverage to include the basics of time series forecasting, value interpolation, synthetic controls (causal inference), and much more.


If you use any of this material, we would love to hear from you! Please share feedback with this form. If you end up finding any of this material useful and would like to be notified of updates, do drop us a line.