Data for Projects
This article lists data sources which students might consider using for their projects, either in my course or elsewhere.
First, Some Advice
Do not use a univariate time series for your project. Yes, a time series of the daily stock price for Apple over the last 40 years is interesting. But it is not the same sort of interesting that works well with a project, at least in introductory classes.
Be wary of using a simple multivariate time series. Yes, a time series of, say, the daily stock price for all the stocks in the S&P 500 is interesting. Yet to do anything useful with data like this, you really need to study the material in a book like Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. Since you haven’t, choose different data.
The best way to handle cross-sectional data with a time series component is to use Andrew Gelman’s “secret weapon.” Fit the data for each time period separately and then show all the estimates/graphics together.
More data is better than less data. Don’t attempt a project unless you have thousands of observations. Tens of thousands or more is better.
Best Source
- The US Census is the best single source for student projects, both because the data is so rich and because it allows for the creation of beautiful maps. If you don’t have a particular topic in mind, you should use Census data. Check out Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker, along with its associated tutorials.
Data Collections
These are collections of data source created and maintained by others.
CRAN Task Views “aim to provide guidance which packages on CRAN are relevant for tasks related to a certain topic.” If you are interested in a topics like sports, finance, or the environment, then the relevant task view is a good place to start exploring.
Google Dataset Search is useful.
Awesome Public Datasets is what it says it is.
Data Is Plural is a weekly newsletter about, and large compilation of, interesting data sets.
Analyze Survey Data for Free by Anthony Damico provides “Forty-Five Public Microdatasets To Analyze Before You Die From An Easy To Type Website.”
General Data Sources
The General Social Survey. This is the highest quality, longest running survey of attitudes among Americans. To get started, check out Kieran Healy’s gssr R package.
Opportunity Insights seeks to “identify barriers to economic opportunity and develop scalable solutions that will empower people throughout the United States to rise out of poverty and achieve better life outcomes.”
American National Election Studies. Sadly, there is no Healy-equivalent among the political scientists.
Causal Data Sources
Most data sources do not easily lend themselves to an (unconfounded) causal interpretation. In other words, treatment assignment is rarely independent of potential outcomes, even when conditioning on covariates. However, many students want to do a project which involves estimating causal effects.
Besides some of the specific projects mentioned below, a good source is the CRAN package causaldata which includes a variety of datasets from three open books about causal inference: causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021) The Effect , Cunningham, Scott (2021, ISBN-13: 978-0-300-25168-5) Causal Inference: The Mixtape, and Hernán, Miguel and James Robins (2020) Causal Inference: What If. Students interested in doing a project involving causal effects should start here.
Specific Projects with Interesting Data
“Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career” – link and data.
“Childhood cross-ethnic exposure predicts political behavior seven decades later: Evidence from linked administrative data” – link and data.
“Results from a 2020 field experiment encouraging voting by mail” – link and data.
Live Data
Beginning students should do projects in which the data is unchanging. Dealing with data that is regularly updated is hard. But, for more advanced (or ambitious) students, working with “live” data can be fun and challenging!