Citations

Testing for Causal Evidence on Data Sharing and Citations

See https://osf.io/cdt8y/ for more info.

To update the research log, click here.

Draft of the entire data collection process is here.

Folder Structure

The folder structure of this repository grew organically over the development of the project. There is also a Box folder URAPshared that contains several files not contained in this repository. The purpose of the Box folder is to manage the raw data from scraping and data entry. In the following, if a folder x is contained in the Box folder, then it will be denoted as URAPshared/x.

Octoparse

Data collection started with the journals AJPS and APSR and with the use of Octoparse. The folder octoparse contains the otd files to run this data collection. Octoparse uses otd files as instructions on how to collect data.

The output from the AJPS and APSR data collection is stored in URAPshared/octoparse in the following files

apsr_article_content_2006_2014.csv
apsr_issue_toc.csv
ajps_article_content_2003_2007.csv
ajps_article_content_2008_2012.csv
ajps_article_content_2013_2016.csv
ajps_issue_toc.csv

We collected article information from the page of each journal issue's table, and then separately collected article content from the individual article pages. This is what article_content and issue_toc refers to in the file names.

The collection with Octoparse has a few issues.

For AJPS, there are two issues. First, memory issues required us to split the data collection into several year ranges. Second, the output files sometimes contain multiple rows for the same article, breaking the csv format. The script data_preparation/combine_octoparse_outputs.py fixes these issues by combining the files ajps_article_content_2003_2007.csv, ajps_article_content_2008_2012.csv and ajps_article_content_2013_2016.csv into bld/ajps_articles_2003_2016.csv. Analogously, data_preparation/create_article_info_from_issue_toc.py cleans ajps_issue_toc.csv into bld/ajps_article_info_from_issue_toc.csv.

APSR had a centennial edition which we missed in the first run of data collection (we believe due to a different page layout), and hence collected separately later on. The corresponding file is apsr_article_content_centennial.csv. The script data_preparation/clean_apsr_octoparse_output.py cleans apsr_issue_toc.csv, apsr_article_content_2006_2014.csv and apsr_article_content_centennial.csv and writes them into bld/apsr_article_info_from_issue_toc.csv, bld/apsr_article_content_2006_2014.csv and bld/apsr_centennial_article_content.csv.

Data Entry

There are five categories of data entry.

Article: What is the research field of an article, what type of data does it use?
Author website: Does an article author's website contain any data or code for the article?
Dataverse: Are there any files corresponding to the article on Dataverse?
Reference: Does the article content contain a reference to data or code that is useful to reproduce the article's results?
Link: Do URLs in the article content that reference code or data, actually lead to such?

The folder data_entry contains an md file describing the protocol on how to enter data for each of these categories. It also contains reference_coding_protocol_questions.md, a collection of clarifying questions on the protocol.

The process for data entry works as follows:

Every combination of journal and data entry category has an ods template.
Several people can work on filling out a template by copying it to their local machine and adding their initials to the file name.
The data entered from several people on the same template is then compared. Differences in entry will be detected across all filled out versions of the same template.
Detected differences are marked in a difference resolution template. Pairs of people with differing data entry meet to resolve differences, and record those in a single copy of the difference resolution file.
The resolved differences are then "harmonized" across difference resolution pairs into a single file.

Each of these steps has corresponding scripts and folders:

Create templates
- Article: data_preparation/create_article_coding_template.py
- Author: data_preparation/create_ajps_author_website_coding_template.py, data_preparation/create_apsr_author_website_coding_template.py
- Dataverse: data_preparation/extract_from_dataverse.py
- Reference: data_preparation/create_ajps_reference_coding_template.py, data_preparation/create_apsr_reference_coding_template.py, data_preparation/create_econ_reference_coding_template.py
- Link: data_preparation/create_link_coding_template.py
Upload template. Individual versions of filled out templates are stored and synchronized in URAPShared/Data.
The script data_preparation/create_diff.py can be configured to take multiple individually filled out versions of the same templates and create a single file bld/*_diff.csv listing inconsistencies for each entry across all versions.
The script data_preparation/create_diff_resolution_template.py can be configured to use the bld/*_diff.csv file to create templates for resolution pairs to resolve their differences. These templates are also stored and synchronized in URAPShared/Data, their file names contain the initials of the resolution pair as well as diff_resolution.
The script data_preparation/harmonize.py can be configured to take diff resolution files, and match them to the original templates to create a final bld/*_harmonized.csv file.

The data_preparation folder contains a few more files:

import_old_entries.py: This script can be configured to import data entry across different filled out templates. This is useful when changes in the protocol or bug fixes led to additional entries. Existing data entry could be preserved, despite a change in the template file.
select_relevant_years.py: Restrict selection of AJPS articles to articles published in years 2006 to 2014. This was necessary because we collected a wider range of articles with Octoparse.
update_template_rk.py: This is a script to import old data in response to a very specific change in the template structure.
tools.py: A collection of helper functions used across multiple scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 379 Commits
analysis		analysis
citation_scraping		citation_scraping
data_cleaning		data_cleaning
data_collection_econ		data_collection_econ
data_entry		data_entry
data_preparation		data_preparation
external		external
images		images
jel_scraping		jel_scraping
octoparse		octoparse
paper		paper
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citations

Folder Structure

Octoparse

Data Entry

CONTENTS:

About

Releases

Packages

Contributors 11

Languages

BITSS/Citations

Folders and files

Latest commit

History

Repository files navigation

Citations

Folder Structure

Octoparse

Data Entry

CONTENTS:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages