This dataset contains all data used and created as a part of the "Code Contribution and Credit in Science" article [TODO: link/doi to paper].
There are six files in this dataset:
1. rs-graph-v1-prod.db
2. rs-graph-v1-redacted.db
3. annotated-dev-author-em-resolved.csv
4. train-set.parquet
5. test-set.parquet
6. dev-author-em-misclassifications.csv
rs-graph-v1-redacted.db The rs-graph-v1-redacted.db file is a SQLite database file that contains article-repository pairs. For each article, the basic bibliometric and author information is included. For each repository, only the basic repository metadata is included.
For details as to how to load and access the data within this database, please review:
https://github.com/evamaxfield/rs-graph rs-graph-v1-prod.db The rs-graph-v1-prod.db file is a SQLite database file that contains the same basic data as the rs-graph-v1-redacted.db database file but additionally includes the repository contributor information for each repository along with each contributor's details as well as our predicted linkages between article authors and repository developers. This database file has restricted access due to it's creation of linked personally identifiable information.
For details as to how to load and access the data within this database, please review:
https://github.com/evamaxfield/rs-graph annotated-dev-author-em-resolved.csv The annotated-dev-author-em-resolved.csv CSV file stores the annotations created by our team which were used to train our author-developer-account entity matching model. Like with the rs-graph-v1-prod.db, this data has restricted access due to it's creation of linked personally identifiable information.
While the training data is kept private and available by request, we make the trained predictive model available at:
https://github.com/evamaxfield/sci-soft-models The train-set.parquet and test-set.parquet were the exact splits used for model training. The dev-author-em-misclassifications.csv is the set of misclassifications from the model on the test-set.