Project Advice

This article provides some advice for completing data science projects in the era of AI. The more you use AI, the better off you will be. But it is also easy to make mistakes with AI, both small and large. The main point of this advice is to help you structure your work so that AI does not lead you astray.

Organization

AI is especially good at discrete tasks. The more that you can split your project into smaller parts, the better. You will rarely type out these parts yourself. But you are responsible for ensuring that they are correct. Use his structure unless you have a good reason not to.

Downloads

Have one file called download.R which, obviously, downloads your data. Note that it, like most of the files here is a script, not a Quarto document. It is run by hand, perhaps only once.

Have one directory called downloads into which all your raw data is downloaded. Remember that GitHub will generally reject a file larger than 100 mb, so any such large files need to be included in your .gitignore. And, since you can theoretically just download the data again, people will often just add downloads to the .gitignore so that nothing in the entire directory is put on GitHub.

It is often useful to just download the data for one city, month or whatever, especially if the download takes a long time. You want to make sure that you are getting the data that you think yo are getting. Once this limited download looks good, then you ask the AI to change the code to download all the data you want.

Although the AI is writing the code, you must understand, in broad outlines, what the code is doing. You should also “take charge” of how the code works. For example, you will probably need to tell the AI that all the downloaded data should go into the downloads directory.

Cleaning

Cleaning the data is a conceptually separate step from downloading the data, so it makes sense to have a cleaning.R script which performs that cleaning. You will probably run this script much more often than you run download.R, if only because getting the data clean and organized often takes more iterations. Once you clean the data, it is standard practice to save a cleaned version of the data, often in a data directory. Data cleaning, like data downloading, is a process which can take some time. You don’t want to do it every time you change a sentence in your final report.

At this point, you should notice a pattern: each step starts with the data in one location, takes that data, does something with it, and then leaves a new version of the data in a new location. A process like this makes it easier to understand the entire project, and easier to change or modify it.

Graphics

If your graphics don’t take too much time to create, then you can just create them in your final report. But, if they do take more than a 30 seconds or so, you don’t want to do that every time you change a formatting decision in the report. Instead, you should create a graphics.R script which takes the data from data, creates the graphics, and then saves those graphics in a directory named something like images.

Final Report

The two most common formats for your final report are a) a single webpage or, b) a website with several pages, most importantly the index.html page. Webpages are created from QMD files. In either case, you are creating this report for a non-technical audience. They do not want to see your computer code.

Of course, AIs are getting so good that, pretty soon, you will just be able to copy/paste these instructions . . .

Data

This article lists data sources which students might consider using for their projects, either in my course or elsewhere. Of course,

First, Some Advice

Do not use a univariate time series for your project. Yes, a time series of the daily stock price for Apple over the last 40 years is interesting. But it is not the same sort of interesting that works well with a project, at least in introductory classes.
Be wary of using a simple multivariate time series. Yes, a time series of, say, the daily stock price for all the stocks in the S&P 500 is interesting. Yet to do anything useful with data like this, you really need to study the material in a book like Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. Since you haven’t, choose different data.
The best way to handle cross-sectional data with a time series component is to use Andrew Gelman’s “secret weapon.” Fit the data for each time period separately and then show all the estimates/graphics together.
More data is better than less data. Don’t attempt a project unless you have thousands of observations. Tens of thousands or more is better.

Best Source

The US Census is the best single source for student projects, both because the data is so rich and because it allows for the creation of beautiful maps. If you don’t have a particular topic in mind, you should use Census data. Check out Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker, along with its associated tutorials.

Data Collections

These are collections of data source created and maintained by others.

CRAN Task Views “aim to provide guidance which packages on CRAN are relevant for tasks related to a certain topic.” If you are interested in a topics like sports, finance, or the environment, then the relevant task view is a good place to start exploring.
Google Dataset Search is useful.
Awesome Public Datasets is what it says it is.
Data Is Plural is a weekly newsletter about, and large compilation of, interesting data sets.
Analyze Survey Data for Free by Anthony Damico provides “Forty-Five Public Microdatasets To Analyze Before You Die From An Easy To Type Website.”

General Data Sources

The General Social Survey. This is the highest quality, longest running survey of attitudes among Americans. To get started, check out Kieran Healy’s gssr R package.
Opportunity Insights seeks to “identify barriers to economic opportunity and develop scalable solutions that will empower people throughout the United States to rise out of poverty and achieve better life outcomes.”
American National Election Studies. Sadly, there is no Healy-equivalent among the political scientists.

Causal Data Sources

Most data sources do not easily lend themselves to an (unconfounded) causal interpretation. In other words, treatment assignment is rarely independent of potential outcomes, even when conditioning on covariates. However, many students want to do a project which involves estimating causal effects.

Besides some of the specific projects mentioned below, a good source is the CRAN package causaldata which includes a variety of datasets from three open books about causal inference: causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021) The Effect , Cunningham, Scott (2021, ISBN-13: 978-0-300-25168-5) Causal Inference: The Mixtape, and Hernán, Miguel and James Robins (2020) Causal Inference: What If. Students interested in doing a project involving causal effects should start here.

Specific Projects with Interesting Data

“Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career” – link and data.
“Childhood cross-ethnic exposure predicts political behavior seven decades later: Evidence from linked administrative data” – link and data.
“Results from a 2020 field experiment encouraging voting by mail” – link and data.

Live Data

Beginning students should do projects in which the data is unchanging. Dealing with data that is regularly updated is hard. But, for more advanced (or ambitious) students, working with “live” data can be fun and challenging!