Data for Projects


David Kane

This article lists data source which students might consider using for their projects, either in my course or elsewhere.

First, Some Advice

  • Do not use a univariate time series for your project. Yes, a time series of the daily stock price for Apple over the last 40 years is interesting. But it is not the same sort of interesting that works well with a project, at least in introductory classes.

  • Be wary of using a simple multivariate time series. Yes, a time series of, say, the daily stock price for all the stocks in the S&P 500 is interesting. Yet to do anything useful with data like this, you really need to study the material in a book like Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. Since you haven’t, choose different data.

  • The best way to handle cross-sectional data with a time series component is to use Andrew Gelman’s “secret weapon.” Fit the data for each time period separately and then show all the estimates/graphics together.

  • More data is better than less data. Don’t attempt a project unless you have thousands of observations. Tens of thousands or more is better.

Best Source

Data Collections

These are collections of data source created and maintained by others.

General Data Sources

  • The General Social Survey. This is the highest quality, longest running survey of attitudes among Americans. To get started, check out Kieran Healy’s gssr R package.

  • Opportunity Insights seeks to “identify barriers to economic opportunity and develop scalable solutions that will empower people throughout the United States to rise out of poverty and achieve better life outcomes.”

  • American National Election Studies. Sadly, there is no Healy-equivalent among the political scientists.

Specific Projects with Interesting Data

  • “Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career” – link and data.

  • “Childhood cross-ethnic exposure predicts political behavior seven decades later: Evidence from linked administrative data” – link and data.

  • “The measurement of partisan sorting for 180 million voters” – link and data.

Live Data

Beginning students should do projects in which the data is unchanging. Dealing with data that is regularly updated is hard. But, for more advanced (or ambitious) students, working with “live” data can be fun and challenging!