Description
|
After all teams pre-registered the models they planned to run during the many-analysts study
conducted by Breznau et al. 2022, the principal investigators coded their models for
specifications in basic model form (general class of regression and treatment of missing
variables), detailed model description (variance components, estimator), additional specifications
(robust clustering, bootstrapping), measurement of the six dependent variables (recoding and/or
combining of the six), data waves included (up to all five), countries included in the sample (lists
them), special features of the sample (all, only Europe, only rich), country-level controls,
individual-level controls and any interactions. They then put these together systematically into
paragraph descriptions and randomly assigned 5-7 different descriptions to each participant to
read and rank based on the question, “How confident are you that the above research design is adequate for testing the hypothesis that ‘immigration undermines social policy
preferences’ using ISSP data?”. Participants could then choose from seven ordered answers
from “unconfident” and “rather unconfident” through “rather confident” and “confident”.
As participants were free to change their models at any time, as is common in many-analysts
studies, many ended up running models that were slightly different from what they pre-
registered. Moreover, many teams ran far more models than expected, so that a single
paragraph no longer described all models in each team. For this reason, the peer review scores
were not seen as directly usable. However, as they were systematically asked, the Breznau et
al. principal investigators broke down the model descriptions into individual components
(estimator, sample, etc.) and averaged scores across all descriptions with a given component.
Then assigned scores to models that had the given component. There was also a debate that
took place in an online forum for a random split half of the sample. In this debate the participants
voted on different arguments for using various model specifications (two-way fixed effects, or not
for example). The voting scores were also rescaled and added for relevant components. Then an overall average for each model was calculated. This provided the final score used in Breznau
et al. 2022.
For this new version, we wanted to better measure the complexity of models and peer reviewer
reactions to this. Therefore, we matched the paragraphs’ specifications with each of the 1,261
models’ specifications. When 95% of the specifications matched, we assigned the given peer review score for that paragraph description. For model descriptions not achieving 95% we took
the top 5% of matched models, with a minimum cutoff of 80%. This leaves us with some models
having over 20 peer review scores and others with only four, and a few with none. In total we
have 1,223 out of 1,261 models with an average score that better reflects each single model. |