Data for Projects

Author

David Kane

This article lists data sources which students might consider using for their projects, either in my course or elsewhere.

First, Some Advice

  • Do not use a univariate time series for your project. Yes, a time series of the daily stock price for Apple over the last 40 years is interesting. But it is not the same sort of interesting that works well with a project, at least in introductory classes.

  • Be wary of using a simple multivariate time series. Yes, a time series of, say, the daily stock price for all the stocks in the S&P 500 is interesting. Yet to do anything useful with data like this, you really need to study the material in a book like Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. Since you haven’t, choose different data.

  • The best way to handle cross-sectional data with a time series component is to use Andrew Gelman’s “secret weapon.” Fit the data for each time period separately and then show all the estimates/graphics together.

  • More data is better than less data. Don’t attempt a project unless you have thousands of observations. Tens of thousands or more is better.

Best Source

Data Collections

These are collections of data source created and maintained by others.

General Data Sources

  • The General Social Survey. This is the highest quality, longest running survey of attitudes among Americans. To get started, check out Kieran Healy’s gssr R package.

  • Opportunity Insights seeks to “identify barriers to economic opportunity and develop scalable solutions that will empower people throughout the United States to rise out of poverty and achieve better life outcomes.”

  • American National Election Studies. Sadly, there is no Healy-equivalent among the political scientists.

Causal Data Sources

Most data sources do not easily lend themselves to an (unconfounded) causal interpretation. In other words, treatment assignment is rarely independent of potential outcomes, even when conditioning on covariates. However, many students want to do a project which involves estimating causal effects.

Besides some of the specific projects mentioned below, a good source is the CRAN package causaldata which includes a variety of datasets from three open books about causal inference: causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021) The Effect , Cunningham, Scott (2021, ISBN-13: 978-0-300-25168-5) Causal Inference: The Mixtape, and Hernán, Miguel and James Robins (2020) Causal Inference: What If. Students interested in doing a project involving causal effects should start here.

Specific Projects with Interesting Data

  • “Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career” – link and data.

  • “Childhood cross-ethnic exposure predicts political behavior seven decades later: Evidence from linked administrative data” – link and data.

  • “Results from a 2020 field experiment encouraging voting by mail” – link and data.

Live Data

Beginning students should do projects in which the data is unchanging. Dealing with data that is regularly updated is hard. But, for more advanced (or ambitious) students, working with “live” data can be fun and challenging!