This repo accompanies our paper: Chen and Zimmermann (2021), "Open source cross-sectional asset pricing"
If you use data or code based on our work, please cite the paper:
@article{ChenZimmermann2021,
title={Open Source Cross Sectional Asset Pricing},
author={Chen, Andrew Y. and Tom Zimmermann},
journal={Critical Finance Review},
year={2022},
pages={207-264},
volume={11},
number={2}
}
If you are mostly interested in working with the data, we provide both stock-level signals (characteristics) and a bunch of different portfolio implementations for direct download at the dedicated data page. Please see the data page for answers to FAQs.
However, this repo may still be useful for understanding the data. For example, if you want to know exactly how we construct BrandInvest (Belo, Lin, and Vitorino 2014), you can just open up BrandInvest.do
in the repo's webpage for Signals/Code/Predictors/
The code is separated into three folders:
Signals/Code/
Downloads data from WRDS and elsewhere. Constructs stock-level signals (characteristics) and ouputs toSignals/Data/
. Mostly written in Stata.Portfolios/Code/
Takes in signals fromSignals/Data/
and outputs portfolios toPortfolios/Data/
. Entirely in R.Shipping/Code/
You shouldn't need this. We use this to prepare data for sharing.
We separate the code so you can choose which parts you want to run. If you only want to create signals, you can run the files in Signals/Code/
and then do your thing. If you just want to create portfolios, you can skip Signals/Code/
by directly downloading its output via the data page. The whole thing is about 15,000 lines, so you might want to pick your battles.
More details are below.
master.do
runs everything. It calls every .do file in the following folders:
DataDownloads/
downloads data from WRDS and elsewherePredictors/
construct stock-level predictors and outputs toSignals/Data/Predictors/
Placebos/
constructs "not predictors" and "indirect evidence" signals and outputs toSignals/Data/Placebos/
master.do
employs exception handling so if any of these .do files errors out (due to lack of a subscription, code being out of date, etc), it'll keep running and output as much as it can.
The whole thing takes roughly 24 hours, but the predictors will be done much sooner, probably within 12 hours. You can keep track of how it's going by checking out the log files in Signals/Logs/
.
In master.do, set pathProject
to the root directory of the project (where SignalDoc.csv
is located) and wrdsConnection
to the name you selected for your ODBC connection to WRDS (a.k.a. dsn).
If you don't have an ODBC connection to WRDS, you'll need to set it up. WRDS provides instructions for Windows users and for WRDS cloud users. Note that wrdsConnection
(name of the ODBC connection) in the WRDS cloud example is "wrds-postgres"
. If neither of these solutions works, please see our troubleshooting wiki.
The minimal setup will allow you to produce the vast majority of signals. And due to the exception handling in master.do
, the code will run even if you're not set up to produce the remainder.
But if you want signals that use IBES, 13F, OptionMetrics, FRED, or a handful of other random signals, you'll want to do the following:
-
For IBES, 13F, OptionMetrics, and bid-ask-spread signals: Run
Signals/Code/PrepScripts/master.sh
on the WRDS Cloud, and download the output toSignals/Data/Prep/
. Seemaster.sh
for more details. The most important files from this optional setup areiclink.csv
andoclink.csv
, which allows for merging of IBES, OptionMetrics, and CRSP data. The code here relies heavily on code by Luis Palacios, Rabih Moussawi, Denys Glushkov, Stacey Jacobsen, Craig Holden, Mihail Velikov, Shane Corwin, and Paul Schultz. -
For signals that use the VIX, inflation, or broker-dealer leverage, you will need to request an API key from FRED. Before you run the download scripts, save your API key in Stata (either via the context menu or via
set fredkey
). See this Stata blog entry for more details. -
For signals that use patent citations, BEA input-output tables, or Compustat customer data, the code uses Stata to call R scripts, and thus this may need some setup. If you're on a Windows machine, you will need to point
master.do
to your R installation, by settingRSCRIPT_PATH
to the path ofRscript.exe
. If you're on linux, you will need to just make sure that therscript
command is executable from the shell.
master.R
runs everything. It:
- Takes in signal data located in
Signals/Data/Predictors/
andSignals/Data/Placebos/
- Outputs portfolio data to
Portfolios/Data/Portfolios/
- Outputs exhibits found in the paper to
Results/
It also uses SignalDoc.csv
as a guide for how to run the portfolios.
By default the code skips the daily portfolios (skipdaily = T
), and takes about 8 hours, assuming you examine all 300 or so signals. However, the baseline portfolios (based on predictability results in the original papers) will be done in just 30 minutes. You can keep an eye on how it's going by checking the csvs outputted to Portfolios/Data/Portfolios/
. Every 30 minutes or so the code should output another set of portfolios. Adding the daily portfolios (skipdaily = F
) takes an additional 12ish hours.
All you need to do is set pathProject
in master.R
to the project root directory (where SignalDoc.csv
is). Then master.R
will create portfolios for Price, Size, and STreversal in Portfolios/Data/Portfolios/
.
You probably want more than Price, Size, and STreversal portfolios, and so you probably want to set up more signal data before you run master.R
.
There are a couple ways to set up this signal data:
- Run the code in
Signals/Code/
(see above) - Download
Firm Level Characteristics/Full Sets/PredictorsIndiv.zip
andFirm Level Characteristics/Full Sets/PlacebosIndiv.zip
via the data page and unzip toSignals/Data/Predictors/
andSignals/Data/Placebos/
- Download only some selected csvs via the data page and place in
Signals/Data/Predictors/
(e.g. just downloadBM.csv
,AssetGrowth.csv
, andEarningsSurprise.csv
and put them inSignals/Data/Predictors/
).
This code zips up the data, makes some quality checks, and copies files for uploading to Gdrive. You shouldn't need to use this but we keep it with the rest of the code for replicability.
Stata code was tested on both Windows and Linux. Linux was Ubuntu 18.04.5 running Stata 16.1.
R code was tested on
- Windows 10, Rstudio Version 1.4.1106, R 4.0.5, and Rtools 4.0.0
- Ubuntu 18.04.5, EMACS 26.1, ESS 17.11, and R 4.0.2
To install the Windows R setup
- Download and install R from https://cran.r-project.org/bin/windows/base/old/
- Download and install Rtools from https://cran.r-project.org/bin/windows/Rtools/history.html
- Download and install Rstudio from https://www.rstudio.com/products/rstudio/download/
- Add Rtools to path by running in R:
writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron")
see https://cran.r-project.org/bin/windows/Rtools/
If you use RStudio, take a look at Hendrik Bruns' guide to set up version control.
As a stand-alone client for Windows, we recommend Sourcetree.
If you use Git, you should definitely add the following lines to .gitignore:
Signals/Data/**
Shipping/Data/**
Portfolios/Data/**
These folders contain a ton of data and will make Git slow to a crawl or crash.
Please let us know if you find typos in the code or think that we should add additional signals. You can let us know about any suggested changes via pull requests for this repo. We will keep the code up to date for other researchers to use it.