[ad_1]
Summary
The Journal Influence Issue is commonly used as a proxy measure for journal high quality, however the empirical proof is scarce. Particularly, it’s unclear how peer evaluation traits for a journal relate to its impression issue. We analysed 10,000 peer evaluation studies submitted to 1,644 biomedical journals with impression components starting from 0.21 to 74.7. Two researchers hand-coded sentences utilizing classes of content material associated to the thoroughness of the evaluation (Supplies and Strategies, Presentation and Reporting, Outcomes and Dialogue, Significance and Relevance) and helpfulness (Suggestion and Answer, Examples, Reward, Criticism). We fine-tuned and validated transformer machine studying language fashions to categorise sentences. We then examined the affiliation between the quantity and proportion of sentences addressing totally different content material classes and 10 teams outlined by the Journal Influence Issue. The median size of critiques elevated with larger impression issue, from 185 phrases (group 1) to 387 phrases (group 10). The share of sentences addressing Supplies and Strategies was higher within the highest Journal Influence Issue journals than within the lowest Journal Influence Issue group. The outcomes for Presentation and Reporting went in the other way, with the very best Journal Influence Issue journals giving much less emphasis to such content material. For helpfulness, critiques for larger impression issue journals devoted comparatively much less consideration to Suggestion and Answer than decrease impression issue journals. In conclusion, peer evaluation in journals with larger impression components tends to be extra thorough, notably in addressing research strategies whereas giving comparatively much less emphasis to presentation or suggesting options. Variations have been modest and variability excessive, indicating that the Journal Influence Issue is a nasty predictor of the standard of peer evaluation of a person manuscript.
Quotation: Severin A, Strinzel M, Egger M, Barros T, Sokolov A, Mouatt JV, et al. (2023) Relationship between journal impression issue and the thoroughness and helpfulness of peer critiques. PLoS Biol 21(8):
e3002238.
https://doi.org/10.1371/journal.pbio.3002238
Tutorial Editor: Ulrich Dirnagl, Charite Universitatsmedizin Berlin, GERMANY
Acquired: August 17, 2022; Accepted: July 6, 2023; Revealed: August 29, 2023
Copyright: © 2023 Severin et al. That is an open entry article distributed underneath the phrases of the Artistic Commons Attribution License, which allows unrestricted use, distribution, and copy in any medium, offered the unique creator and supply are credited.
Knowledge Availability: All related knowledge are inside the paper and its Supporting Info recordsdata. The fine-tuned DistilBERT fashions, knowledge, and code to confirm the reproducibility of all tables and graphs can be found at https://doi.org/10.5281/zenodo.8006829. Publons’ knowledge sharing coverage prohibits us from publishing the uncooked textual content of the critiques and the annotated sentences.
Funding: This research was supported by Swiss Nationwide Science Basis (SNSF) grant 32FP30-189498 to ME, see https://knowledge.snf.ch/grants/grant/189498) and inner SNSF assets. The funders had no position in research design, knowledge assortment and evaluation, choice to publish, or preparation of the manuscript.
Competing pursuits: I’ve learn the journal’s coverage and the authors of this manuscript have the next competing pursuits: MS and ME have been employed by the SNSF and ANS was a PhD pupil supported by the SNSF on the time of the research. TB, ALS, JVM have been employed by Publons (now part of Net of Science) on the time of the research. SM has declared that no competing pursuits exist.
Abbreviations:
CI,
confidence interval; COST,
European Cooperation in Science and Expertise; DORA,
San Francisco Declaration on Analysis Evaluation; ESI,
Important Science Indicators
Introduction
Peer evaluation is a technique of scientific appraisal by which manuscripts submitted for publication in journals are evaluated by consultants within the area for originality, rigour, and validity of strategies and potential impression [1]. Peer evaluation is a vital scientific contribution and is more and more seen on databases and researcher profiles [2,3]. In medication, practitioners depend on sound proof from scientific analysis to make a prognosis or prognosis and select a remedy. Current developments, such because the retraction of peer-reviewed COVID-19 publications in outstanding medical journals [4] or the emergence of predatory journals [5,6], have prompted issues in regards to the rigour and effectiveness of peer evaluation. Regardless of these issues, analysis into the standard of peer evaluation is scarce. Little is thought in regards to the determinants and traits of high-quality peer evaluation. The confidential nature of many peer evaluation studies and the dearth of databases and instruments for assessing their high quality have hampered larger-scale analysis on peer evaluation.
The impression issue was initially developed to assist libraries make indexing and buying selections for his or her collections. It’s a journal-based metric calculated every year by dividing the variety of citations acquired in that yr for papers printed within the 2 previous years by the variety of “citable gadgets” printed in the course of the 2 previous years [7]. The popularity of a journal, its impression issue, and the perceived high quality of peer evaluation are among the many commonest standards authors use to pick out journals to publish their work [8–10]. Assuming that quotation frequency displays a journal’s significance within the area, the impression issue is commonly used as a proxy for journal high quality [11]. Additionally it is utilized in educational promotion, hiring selections, and analysis funding allocation, main students to hunt publication in journals with excessive impression components [12].
Regardless of utilizing the Journal Influence Issue as a proxy for a journal’s high quality, empirical analysis on the impression issue as a measure of journal high quality is scarce [11]. Particularly, it’s unclear how the peer evaluation traits for a journal relate to this metric. We mixed human coding of peer evaluation studies and quantitative textual content evaluation to look at the affiliation between peer evaluation traits and Journal Influence Issue within the medical and life sciences, based mostly on a pattern of 10,000 peer evaluation studies. Particularly, we examined the impression issue’s relationship with absolutely the quantity and the chances of sentences associated to look evaluation thoroughness and helpfulness.
Outcomes
Traits of the research pattern
The pattern included 5,067 critiques from Important Science Indicators (ESI) [13] analysis area Medical Medication, 943 from Surroundings and Ecology, 942 from Biology and Biochemistry, 733 from Psychiatry and Psychology, 633 from Pharmacology and Toxicology, 576 from Neuroscience and Behaviour, 566 from Molecular Biology and Genetics, 315 from Immunology, and 225 from Microbiology.
Throughout the ten teams of journals outlined by Journal Influence Issue deciles (1 = lowest, 10 = highest), the median Journal Influence Issue ranged from 1.23 to eight.03, the minimal ranged from 0.21 to six.51 and the utmost from 1.45 to 74.70 (Desk 1). The proportion of reviewers from Asia, Africa, South America, and Australia/Oceania declined when shifting from Journal Influence Issue group 1 to group 10. In distinction, there was a development in the other way for Europe and North America. Info on the continent of affiliation was lacking for 43.5% of critiques (4,355). The median size of peer evaluation studies elevated by about 202 phrases from group 1 (median variety of phrases 185) to group 10 (387). S1 File particulars the ten journals from every Journal Influence Issue group that offered the very best variety of peer evaluation studies, provides the entire checklist of journals, and exhibits the distribution of critiques throughout the 9 ESI disciplines.
Efficiency of coders and classifiers
The coaching of coders resulted in acceptable to good between-coder settlement, with a median Krippendorff’s α throughout the 8 classes of 0.70. The ultimate analyses included 10,000 evaluation studies, comprising 188,106 sentences, which have been submitted by 9,259 reviewers to 1,644 journals. In whole, 9,590 distinctive manuscripts have been reviewed.
Within the annotated dataset, the most typical classes based mostly on human coding have been Supplies and Strategies (coded in 823 sentences or 41.2% out of two,000 sentences), Suggestion and Answer (638 sentences; 34.2%), and Presentation and Reporting (626 sentences; 31.3%). In distinction, Reward (210; 10.5%) and Significance and Relevance (175; 8.8%) have been the least frequent. On common, the coaching set had 444 sentences per class, as 1,160 sentences have been allotted to greater than 1 class. In out-of-sample predictions based mostly on DistilBERT, a transformer mannequin for textual content classification [14], precision, recall, and F1 scores (binary averages throughout each lessons [absent/present]) have been related inside classes (see S2 File). The classification was most correct for Instance and Supplies and Strategies (F1 rating 0.71) and least correct for Criticism (0.57) and Outcomes and Dialogue (0.61). The prevalence predicted from the machine studying mannequin was typically near the human coding: Level estimates didn’t differ by greater than 3 proportion factors. General, the machine studying classification carefully mirrored human coding. Additional particulars are given in S2 File.
Descriptive evaluation: Thoroughness and helpfulness of peer evaluation studies
Nearly all of sentences (107,413 sentences, 57.1%) contributed to greater than 1 content material class; a minority (23,997 sentences, 12.8%) weren’t assigned to any class. The typical variety of sentences addressing every of the 8 content material classes within the set of 10,000 critiques ranged from 1.6 sentences on Significance and Relevance to 9.2 sentences on Supplies and Strategies (higher panel of Fig 1). The share of sentences addressing every class are proven within the decrease panel of Fig 1. The content material classes Supplies and Strategies (46.7% of sentences), Suggestion and Answer (34.5%), and Presentation and Reporting (30.0%) have been most extensively coated. The class Outcomes and Dialogue was current in 16.3% of the sentences, and 13.1% have been assigned to the class Examples. In distinction, solely 8.4% of sentences addressed the Significance and Relevance of the research. Criticism (16.5%) was barely extra frequent than Reward (14.9%). Most distributions have been vast and skewed to the correct, with a peak at 0 sentence or 0% akin to critiques that didn’t deal with the content material class (Fig 1).
Fig 1. Distribution of sentences in peer evaluation studies allotted to eight content material classes.
The quantity (higher panel) and proportion of sentences (decrease panel) in a evaluation allotted to the 8 peer evaluation content material classes is proven. A sentence could possibly be allotted to no, one, or a number of classes. Vertical dashed traces present the common quantity (higher panel) and common proportion of sentences (decrease panel) after aggregating them to the extent of critiques. Evaluation based mostly on 10,000 evaluation studies. The information underlying this determine may be present in S1 Knowledge.
Fig 2 exhibits the estimated variety of sentences addressing the 8 content material classes throughout the ten Journal Influence Issue teams. For all classes, the variety of sentences elevated from Journal Influence Issue teams 1 to 10. Nevertheless, will increase have been modest on common, amounting to 2 or fewer extra sentences. The exception was Supplies and Strategies, the place the distinction between Journal Influence Issue teams 1 and 10 was 6.5 sentences on common.
Fig 2. Distribution of sentences in peer evaluation studies allotted to eight content material classes by Journal Influence Issue group.
A sentence could possibly be allotted to no, one, or a number of classes. Vertical dashed traces present the common variety of sentences after aggregating numbers to the extent of critiques. The variety of sentences are displayed on a log scale. Evaluation based mostly on 10,000 evaluation studies. The information underlying this determine may be present in S2 Knowledge.
Fig 3 exhibits the estimated proportion of sentences throughout content material classes and Journal Influence Issue teams. Amongst thoroughness classes, the share of sentences addressing Supplies and Strategies elevated from 40.4% to 51.8% from Journal Influence Issue teams 1 to 10. In distinction, consideration to Presentation and Reporting declined from 32.9% in group 1 to 25.0% in group 10. No clear traits have been evident for Outcomes and Dialogue or Significance and Relevance. For helpfulness, the share of sentences together with Suggestion and Answer declined from 36.9% in group 1 to 30.3% in group 10. The prevalence of sentences offering Examples elevated from 11.0% (group 1) to 13.3% (group 10). Reward decreased barely, whereas Criticism elevated barely when shifting from group 1 to group 10. The distributions have been broad, even inside the teams of journals with related impression components.
Fig 3. Distribution of sentences in peer evaluation studies allotted to eight content material classes by Journal Influence Issue group.
The share of sentences in a evaluation allotted to the 8 peer evaluation high quality classes is proven. A sentence could possibly be allotted to no, one, or a number of classes. Evaluation based mostly on 10,000 evaluation studies. Vertical dashed traces present the common prevalence after aggregating prevalences to the extent of critiques. The information underlying this determine may be present in S3 Knowledge.
Regression analyses
The affiliation between journal impression issue and the 8 content material classes was analysed in 2 regression analyses. The primary predicted the variety of sentences of every content material class throughout the ten Journal Influence Issue teams; the second, the modifications within the proportion of sentences addressing content material classes. All coefficients and normal errors can be found from S3 File.
The anticipated variety of sentences are proven in Fig 4 with their 95% confidence intervals (CI). The outcomes affirm these noticed within the descriptive analyses. There was a considerable enhance within the variety of sentences addressing Supplies and Strategies from Journal Influence Issue group 1 (6.1 sentences; 95% CI 5.3 to six.8) to group 10 (12.5 sentences; 95% CI 11.6 to 13.5), for a distinction of 6.4 sentences. For the opposite classes, solely small will increase have been predicted, consistent with the descriptive analyses.
Fig 4. Predicted variety of sentences addressing thoroughness and helpfulness classes throughout the ten Journal Influence Issue teams.
Predicted values and 95% confidence intervals are proven. Evaluation based mostly on 10,000 evaluation studies. All destructive binomial mixed-effects fashions embrace random intercepts for the journal identify and reviewer ID. The information underlying this determine may be present in S4 Knowledge.
The anticipated variations within the proportion of sentences addressing content material classes are proven in Fig 5. Once more, the outcomes affirm these noticed within the descriptive analyses. The prevalence of sentences on Supplies and Strategies within the journals with the very best impression issue was larger (+11.0 proportion factors; 95% CI + 7.9 to +14.1) than within the group with the bottom impression issue journals. The development for sentences addressing Presentation and Reporting went in the other way, with critiques submitted to the journals with the very best impression issue giving much less emphasis to such content material (−7.7 proportion factors; 95% CI −10.0 to −5.4). There was barely much less concentrate on Significance and Relevance within the group of journals with the very best impression components relative to the group with the bottom impression components (−1.9 proportion factors; 95% CI −3.5 to −0.4) and little proof of a distinction for Outcomes and Dialogue (+1.1 proportion factors; 95% CI −0.54 to +2.8). Critiques for larger impression issue journals devoted much less consideration to Suggestion and Answer. The group with the very best Journal Influence Issue had 6.2 proportion factors fewer sentences addressing Suggestion and Answer (95% CI −8.5 to −3.8). No substantive variations have been noticed for Examples (0.3 proportion factors; 95% CI −1.7 to +2.3), Reward (1.6 proportion factors; 95% CI −0.5 to +3.7), and Criticism (0.5 proportion factors; 95% CI −1.0 to +2.0).
Fig 5. Share level change within the proportion of sentences addressing thoroughness and helpfulness classes relative to the bottom Journal Influence Issue group.
Regression coefficients and 95% confidence intervals are proven. Evaluation based mostly on 10,000 evaluation studies. All linear mixed-effects fashions embrace random intercepts for the journal identify and reviewer ID. The information underlying this determine may be present in S5 Knowledge.
Sensitivity analyses
We carried out a number of sensitivity analyses to evaluate the robustness of findings. Within the first, we eliminated critiques with 0 sentences or 0% within the respective content material class, leading to related regression coefficients and predicted counts. Within the second, the pattern was restricted to critiques with not less than 10 sentences (sentence fashions) or 200 phrases (proportion mannequin). The evaluation confirmed that brief critiques don’t drive associations. Within the third sensitivity evaluation, the regression fashions adjusted for extra variables (self-discipline, profession stage of reviewers, and log variety of critiques submitted by reviewers). The addition of those variables diminished the pattern dimension from 10,000 to five,806 critiques due to lacking reviewer-level knowledge. Once more, the relationships between content material classes and journal impression issue persevered. The fourth sensitivity evaluation revealed that outcomes have been typically related for female and male reviewers. The fifth confirmed that the outcomes modified little when changing the Journal Influence Issue teams with the uncooked Journal Influence Issue (S3 File).
Typical phrases in content material classes
A keyness evaluation [15] extracts typical phrases for every content material class throughout the total corpus of the 188,106 sentences. The evaluation is predicated on χ2 exams evaluating the frequencies of every phrase in sentences assigned to a content material class and different sentences. Desk 2 studies the 50 phrases showing extra often in sentences assigned to the respective content material class than in different sentences (in accordance with the DistilBERT classification). The desk helps the validity of the classification. Widespread phrases within the thoroughness classes have been “knowledge”, “evaluation”, “methodology” (Supplies and Strategies); “please”, “textual content”, “sentence”, “line”, “determine” (Presentation and Reporting); “outcomes”, “dialogue”, “findings” (Outcomes and Dialogue); and “attention-grabbing”, “essential”, “matter” (Significance and Relevance). For helpfulness, frequent distinctive phrases included “please”, “want”, “embrace (Suggestion and Answer); “line”, “web page”, “determine” (Examples); “attention-grabbing”, “good”, “nicely” (Reward); and “nonetheless”, “(un)clear”, “errors” (Criticism).
Desk 2. The 50 key phrases for every content material class.
Outcomes depend on keyness analyses utilizing χ2 exams for every phrase, evaluating the frequency of phrases in sentences the place a content material attribute was current with sentences (goal group) the place attribute was absent (reference group). Desk studies the 50 phrases with the very best χ2 values per class.
Dialogue
This research used fine-tuned transformer language fashions to analyse the content material of peer evaluation studies and examine the affiliation of content material with the Journal Influence Issue. We discovered that the impression issue was related to the traits and content material of peer evaluation studies and reviewers. The size of studies elevated with rising Journal Influence Issue, with the variety of related sentences rising for all content material classes, however particularly for Supplies and Strategies. Expressed as the share of sentences addressing a class (and thus standardising for the totally different lengths of peer evaluation studies), the prevalence of sentences offering ideas and options, examples, or addressing the reporting of the work declined with rising Journal Influence Issue. Lastly, the proportion of reviewers from Asia, Africa, and South America additionally declined, whereas the proportion of reviewers from Europe and North America elevated.
The constraints of the Journal Influence Issue are nicely documented [16–18], and there may be rising settlement that it shouldn’t be used to judge the standard of analysis printed in a journal. The San Francisco Declaration on Analysis Evaluation (DORA) requires the elimination of any journal-based metrics in funding, appointment, and promotion [19]. DORA is supported by hundreds of universities, analysis institutes and people. Our research exhibits that the peer critiques submitted to journals with larger Journal Influence Issue could also be extra thorough than these submitted to decrease impression journals. Ought to, subsequently, the Journal Influence Issue be rehabilitated and used as a proxy measure for peer evaluation high quality? Much like the distribution of citations in a journal, the size of studies and the prevalence of content material associated to thoroughness and helpfulness diverse broadly, inside journals and between journals with related Journal Influence Issue. In different phrases, the Journal Influence Issue is a poor proxy measure for the thoroughness or helpfulness of peer evaluation authors might anticipate when submitting their manuscripts.
The rise within the size of peer evaluation studies with rising Journal Influence Issue is likely to be defined by the truth that reviewers from Europe and North America and reviewers with English as their first language have a tendency to write down longer studies and to evaluation for larger impression journals [20]. Additional, excessive impression issue journals could also be extra prestigious to evaluation for and might thus afford to recruit extra senior students. Of notice, there may be proof suggesting that the standard of studies decreases with age or years of reviewing [21,22]. Curiously, a number of medical journals with excessive impression components have not too long ago dedicated to enhancing variety amongst their reviewers [23–25]. Sadly, attributable to incomplete knowledge, we couldn’t study the significance of the extent of seniority of reviewers. Independently of seniority, reviewers could also be temporary reviewing for a journal with low impression issue, believing a extra superficial evaluation will suffice. Alternatively, temporary critiques will not be essentially superficial: The evaluation of a really poor paper might not warrant an extended textual content.
Peer evaluation studies have been hidden for a few years, hampering analysis on their traits. Earlier research have been based mostly on smaller, chosen samples. An early randomised trial evaluating the impact of blinding reviewers to the authors’ identification on the standard of peer evaluation was based mostly on 221 studies submitted to a single journal [26]. Since then, science has turn into extra open, embracing open entry to publications and knowledge and open peer evaluation. Some journals now publish peer critiques and authors’ responses together with the articles [27–29]. Bibliographic databases have additionally began to publish critiques [30]. The European Cooperation in Science and Expertise (COST) Motion on new frontiers of peer evaluation (PEERE), established in 2017 to look at peer evaluation in numerous areas, was based mostly on knowledge from a number of hundred Elsevier journals from a variety of disciplines [31].
To our data, the Publons database is the biggest of peer evaluation studies, and the one one not restricted to particular person publishers or journals, making it a novel useful resource for analysis on peer evaluation. Based mostly on 10,000 peer evaluation studies submitted to medical and life science journals, that is seemingly the biggest research of peer evaluation content material ever accomplished. It constructed on a earlier evaluation of the traits of students who evaluation for predatory and legit journals [32]. Different strengths of this research embrace the cautious classification and validation step, based mostly on the coding by hand of two,000 sentences by educated coders. The efficiency of the classifiers was excessive, which is reassuring on condition that the sentence-level classification duties cope with imbalanced and generally ambiguous classes. Efficiency is consistent with latest research. For instance, a research utilizing an extension of BERT to categorise ideas corresponding to nationalism, authoritarianism, and belief reported outcomes for precision and recall much like the current research [33]. We educated the algorithm on journals from many disciplines, which ought to make it relevant to different fields than medication and the life sciences. Journals and funders might use our strategy to analyse the thoroughness and helpfulness of their peer evaluation. Journals might submit their peer evaluation studies to an impartial organisation for evaluation. The outcomes might assist journals enhance peer evaluation, give suggestions to look reviewers, inform the coaching of peer reviewers, and assist readers gauge the standard of the journals of their area. Additional, such analyses might inform a reviewer credit score system that could possibly be utilized by funders and analysis establishments.
Our research has a number of weaknesses. Reviewers could also be extra prone to submit their evaluation to Publons in the event that they really feel it meets basic high quality standards. This might have launched bias if the choice course of into Publons’ database trusted the Journal Influence Issue. Nevertheless, the big variety of journals inside every Journal Influence Issue group makes it seemingly that the patterns noticed are actual and generalizable. We acknowledge that our findings are extra dependable for the extra frequent content material classes than for the much less frequent. We solely examined peer evaluation studies and couldn’t contemplate the customarily in depth contributions made by journal editors and editorial employees to enhance articles. In different phrases, though our outcomes present precious insights into the peer evaluation course of, they provide an incomplete image of the final high quality assurance processes of journals. Because of the lack of know-how within the database, we couldn’t analyse any variations between open (signed) and nameless peer evaluation studies. Equally, we couldn’t distinguish between critiques of authentic analysis articles and different article sorts, for instance, narrative evaluation articles. Some journals don’t contemplate significance and relevance when assessing submissions, and these journals might have influenced outcomes for this class. We lacked the assets to determine these journals among the many over 1,600 retailers included in our research to look at their affect. Lastly, we couldn’t assess to what extent the content material of peer evaluation studies affected acceptance or rejection of the paper.
Conclusions
This research of peer evaluation traits signifies that peer evaluation in journals with larger impression components tends to be extra thorough, notably in addressing the research’s strategies whereas giving comparatively much less emphasis to presentation or suggesting options. Our findings might have been influenced by variations in reviewer traits, high quality of submissions, and the angle of reviewers in the direction of the journals. Variations have been modest, and the Journal Influence Issue is subsequently a nasty predictor of the standard of peer evaluation of a person manuscript.
Strategies
Our research was based mostly on peer evaluation studies submitted to Publons from January 24, 2014, to Might 23, 2022. Publons (a part of Net of Science) is a platform for students to trace their peer evaluation actions and obtain recognition for reviewing [34]. A complete of two,000 sentences from peer evaluation studies have been hand-coded and assigned to none, one, or a couple of of 8 content material classes associated to thoroughness and helpfulness. The transformer mannequin DistilBERT [14,35] was then used to assign the sentences in peer evaluation studies as contributing or not contributing to classes. Extra particulars are offered within the Part “Classification and validation” beneath and S2 File. After validating the classification efficiency utilizing out-of-sample predictions, the affiliation between the 2019 Journal Influence Elements [36] and the prevalence of related sentences in peer evaluation studies was examined. The pattern is proscribed to evaluation studies submitted to medical and life sciences journals with an impression issue. The evaluation took the hierarchical nature of the information into consideration.
Knowledge sources
As of Might 2022, the Publons database contained info on 15 million critiques carried out and submitted by greater than 1,150,000 students for about 55,000 journals and convention proceedings. Critiques may be submitted to Publons in numerous methods. When students evaluation for journals partnering with Publons and want recognition, Publons receives the evaluation and a few meta-data immediately from the journal. For different journals, students can add the evaluation and confirm it by forwarding the affirmation e-mail from the journal to Publons or by sending a screenshot from the peer evaluation submission system. Publons audits a random subsample of emails and screenshots by contacting editors or journal directors.
Publons randomly chosen English-language peer evaluation studies for the coaching from a broad spectrum of journals, protecting all (ESI) fields [37] besides Physics, House Science, and Arithmetic. Critiques from the latter fields contained many mathematical formulae, which have been troublesome to classify. Within the subsequent step, a stratified random pattern of 10,000 verified prepublication critiques written in English was drawn. First, the Publons database was restricted to critiques from medical and life sciences journals based mostly on ESI analysis fields, leading to an information set of roughly 5.2 million critiques. The ESI area Multidisciplinary was excluded as these journals publish articles not inside the medical and life sciences area (e.g., PLOS ONE, Nature, Science). Second, these critiques have been divided into 10 equal teams based mostly on Journal Influence Issue deciles. Third, 1,000 critiques have been chosen randomly from every of the ten teams. Second-round peer evaluation studies have been excluded at any time when this info was out there. The continent of the reviewer’s institutional affiliation, the entire variety of publications of the reviewer, the beginning and finish yr of the reviewers’ publications, and gender have been out there for a subset of critiques. The gender of reviewers have been categorised with the gender-guesser Python bundle (model 0.4.0). Because the knowledge on reviewer traits are incomplete and automatic gender classification suffers from misclassification, these variables are solely included in regression fashions reported in S3 File.
Classification and validation
Two authors (ASE and MS) have been educated in coding sentences. After piloting and refining coding and establishing intercoder reliability, the reviewers labelled 2,000 sentences (1,000 sentences every). They allotted sentences to none, one, or a number of of 8 content material classes. We chosen the 8 classes based mostly on prior work, together with the Overview High quality Instrument and different scales and checklists [38], and former research utilizing textual content evaluation or machine studying to evaluate pupil and peer evaluation studies [39–43]. Within the handbook coding course of, the classes have been refined, making an allowance for the convenience of operationalising classes and their intercoder reliability. Based mostly on the pilot knowledge, Krippendorff’s α, a measure of reliability in content material evaluation, was calculated [44].
The classes describe, first, the Thoroughness of a evaluation, measuring the diploma to which a reviewer feedback on (1) Supplies and Strategies (Did the reviewer touch upon the strategies of the manuscript?); (2) Presentation and Reporting (Did the reviewer touch upon the presentation and reporting of the paper?); (3) Outcomes and Dialogue (Did the reviewer touch upon the outcomes and their interpretation?); and (4) the paper’s Significance and Relevance (Did the reviewer touch upon the significance or relevance of the manuscript?). Second, the Helpfulness of a evaluation was examined based mostly on feedback on (5) Suggestion and Answer (Did the reviewer present ideas for enchancment or options?); (6) Examples (Did the reviewer give examples to substantiate his or her feedback?); (7) Reward (Did the reviewer determine strengths?); and (8) Criticism (Did the reviewer determine issues?). Classes have been rated on a binary scale (1 for sure, 0 for no). A sentence could possibly be coded as 1 for a number of classes. S4 File provides additional particulars.
We used the transformer mannequin DistilBERT to foretell the absence or presence of the 8 traits in every sentence of the peer evaluation studies [45]. For validation, knowledge have been break up randomly right into a coaching set of 1,600 sentences and a held-out take a look at set of 400 sentences. Eight DistilBERT fashions (one for every content material classes) have been fine-tuned on the set of 1,600 sentences and predicted the classes within the remaining 400 sentences. Efficiency measures, together with precision (i.e., the optimistic predictive worth), recall (i.e., sensitivity), and the F1 rating, have been calculated. The F1 rating is a harmonic imply of precision and recall and an total measure of accuracy. The F1 rating can vary between 0 and 1, with larger values indicating higher classification efficiency [46].
General, the classification efficiency of the fine-tuned DistilBERT language fashions was excessive. The typical F1 rating for the presence of a attribute was 0.75, starting from 0.68 (Reward) to 0.88 (Suggestion and Answer). For many classes, precision and recall have been related, indicating the absence of systematic measurement error. Significance and Relevance and Outcomes and Dialogue have been the exceptions, with decrease recall for traits being current. Balanced accuracy (the arithmetic imply of sensitivity and specificity) was additionally excessive, starting from 0.78 to 0.91 (with a imply of 0.83 throughout the 8 classes). S2 File provides additional particulars.
We in contrast the chances of sentences addressing every class between the human annotation dataset and the output from the machine studying mannequin. For the take a look at set of 400 sentences, the share of sentences that fall into every of the 8 classes have been calculated, individually for the human codings and the DistilBERT predictions. There was a detailed match between the 2: DistilBERT overestimated Significance and Relevance by 3.0 proportion factors and underestimated Supplies and Strategies by 2.3 proportion factors. For all different content material classes, smaller variations have been noticed. Having assessed the validity of the classification, the machine studying classifiers have been fine-tuned utilizing all 2,000 labelled sentences, and the 8 classifiers have been used to foretell the presence or absence of content material within the full textual content corpus consisting of 188,106 sentences.
Lastly, we recognized distinctive phrases in every high quality class utilizing a “keyness” evaluation [47]. The phrases retrieved from the keyness analyses mirror typical phrases utilized in every content material class.
Statistical evaluation
The affiliation between peer evaluation traits and Journal Influence Issue teams was examined in 2 methods. The evaluation of the variety of sentences for every class used destructive binomial regression fashions. The evaluation of the chances of sentences addressing content material classes relied on linear mixed-effects fashions. To account for the clustered nature of the information, we embrace random intercepts for journals and reviewers [48]. The regression fashions take the shape,
with
the place Yi is the rely of sentences addressing a content material class (for the destructive binomial regression fashions) or the chances (for the linear-mixed results fashions), whereas i, βm are the coefficients for the m = 2,…,10 classes of the explicit variable of Journal Influence Issue (with m = 1 because the reference class), and ϵi is the unobserved error time period. The mannequin contains various intercepts αj[i],okay[i] for J journals and Okay reviewers. denotes the indicator perform.
All regression analyses have been accomplished in R (model 4.2.1). The fine-tuning of the classifier and sentence-level predictions have been accomplished in Python (model 3.8.13). The libraries used for knowledge preparation, textual content evaluation, supervised classification, and regression fashions have been transformers (model 4.20.1) [49], quanteda (model 3.2.3) and quanteda.textstats (model 0.95) [50], lme4 (model 1.1.30) [51], glmmTMB (model 1.1.7) [52], ggeffects (model 1.1.5) [53], and tidyverse (model 1.3.2) [54].
Supporting info
S1 File. Journals and disciplines included within the research.
The ten journals from every journal impression issue group that offered the biggest variety of peer evaluation studies and all 1,664 journals included within the evaluation listed in alphabetical order. The numbers in parentheses characterize the JIF and the variety of critiques included within the pattern.
https://doi.org/10.1371/journal.pbio.3002238.s001
(PDF)
S2 File. Additional particulars on classification and validation.
Additional info on the hand-coded set of sentences, the classification strategy, and efficiency present metrics on the classification efficiency and present that aggregating the classification carefully mirrors human coding of the identical set of sentences. All outcomes are out-of-sample predictions, which means that the information within the held-out take a look at set will not be used for coaching the classifier throughout validation steps.
https://doi.org/10.1371/journal.pbio.3002238.s002
(PDF)
S3 File. Further particulars on regression analyses and sensitivity analyses.
All regression tables for the evaluation reported within the paper, and plots and regression tables regarding the 5 sensitivity analyses. All sensitivity analyses are performed for the prevalence-based and sentence-based fashions.
https://doi.org/10.1371/journal.pbio.3002238.s003
(PDF)
Acknowledgments
We’re grateful to Anne Jorstad and Gabriel Okasa from the Swiss Nationwide Science Basis (SNSF) knowledge workforce for precious feedback on an earlier draft of this paper. We might additionally wish to thank Marc Domingo (Publons, a part of Net of Science) for assist with the sampling process.
References
- 1.
Severin A, Chataway J. Functions of peer evaluation: A qualitative research of stakeholder expectations and perceptions. Study Publ. 2021;34:144–155. - 2.
ORCID Assist. Peer Overview. In: ORCID [Internet]. [cited 2022 Jan 20]. Out there from: https://help.orcid.org/hc/en-us/articles/360006971333-Peer-Overview - 3.
Malchesky PS. Observe and confirm your peer evaluation with Publons. Artif Organs. 2017;41:217. pmid:28281285 - 4.
Ledford H, Van Noorden R. Covid-19 retractions increase issues about knowledge oversight. Nature. 2020;582:160–160. - 5.
Grudniewicz A, Moher D, Cobey KD, Bryson GL, Cukier S, Allen Okay, et al. Predatory journals: no definition, no defence. Nature. 2019;576:210–212. pmid:31827288 - 6.
Strinzel M, Severin A, Milzow Okay, Egger M. Blacklists and whitelists to sort out predatory publishing: a cross-sectional comparability and thematic evaluation. MBio. 2019;10:e00411–e00419. pmid:31164459 - 7.
Garfield E. The historical past and which means of the journal impression issue. JAMA-J Am Med Assoc. 2006;295:90–93. pmid:16391221 - 8.
Frank E. Authors’ standards for choosing journals. JAMA: The. J Am Med Assoc. 1994;272:163–164. pmid:8015134 - 9.
Regazzi JJ, Aytac S. Creator perceptions of journal high quality. Study Publ. 2008;21:225−+. - 10.
Rees EL, Burton O, Asif A, Eva KW. A technique for the insanity: A world survey of well being professions schooling authors’ journal selection. Perspect Med Educ. 2022;11:165–172. pmid:35192135 - 11.
Saha S, Saint S, Christakis DA. Influence issue: a legitimate measure of journal high quality? J Med Libr Assoc. 2003;91:42–46. pmid:12572533 - 12.
McKiernan EC, Schimanski LA, Muñoz Nieves C, Matthias L, Niles MT, Alperin JP. Use of the journal impression consider educational evaluation, promotion, and tenure evaluations. elife. 2019;8:e47338. pmid:31364991 - 13.
Important Science Indicators. In: Overview [Internet]. [cited 2023 Mar 9]. Out there from: https://esi.assist.clarivate.com/Content material/overview.htm?Spotlight=esipercent20essentialpercent20sciencepercent20indicators - 14.
Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled model of BERT: smaller, quicker, cheaper and lighter. arXiv. 2020. Out there from: http://arxiv.org/abs/1910.01108 - 15.
Bondi M, Scott M, editors. Keyness in Texts. Amsterdam: John Benjamins Publishing Firm; 2010. https://doi.org/10.1075/scl.41 - 16.
Seglen PO. Why the impression issue of journals shouldn’t be used for evaluating analysis. BMJ. 1997;314(7079):498–502. pmid:9056804 - 17.
de Rijcke S, Wouters PF, Rushforth AD, Franssen TP, Hammarfelt B. Analysis practices and results of indicator use—a literature evaluation. Res Eval. 2016;25:161–169. - 18.
Bornmann L, Marx W, Gasparyan AY, Kitas GD. Range, worth and limitations of the journal impression issue and various metrics. Rheumatol Int. 2012;32:1861–1867. pmid:22193219 - 19.
DORA–San Francisco Declaration on Analysis Evaluation (DORA). [cited 2019 Oct 2]. Out there from: https://sfdora.org/ - 20.
International State of peer evaluation report. In: Clarivate [Internet]. [cited 2023 Mar 10]. Out there from: https://clarivate.com/lp/global-state-of-peer-review-report/ - 21.
Callaham M, McCulloch C. Longitudinal traits within the efficiency of scientific peer reviewers. Ann Emerg Med. 2011;57:141–148. pmid:21074894 - 22.
Evans AT, Mcnutt RA, Fletcher SW, Fletcher RH. The traits of peer reviewers who produce good-quality critiques. J Gen Intern Med. 1993;8:422–428. pmid:8410407 - 23.
The Editors of the Lancet Group. The Lancet Group’s commitments to gender fairness and variety. Lancet. 2019;394:452–453. pmid:31402014 - 24.
A dedication to equality, variety, and inclusion for BMJ and our journals. In: The BMJ [Internet]. 2021 Jul 23 [cited 2022 Apr 12]. Out there from: https://blogs.bmj.com/bmj/2021/07/23/a-commitment-to-equality-diversity-and-inclusion-for-bmj-and-our-journals/ - 25.
Fontanarosa PB, Flanagin A, Ayanian JZ, Bonow RO, Bressler NM, Christakis D, et al. Fairness and the JAMA Community. JAMA. 2021;326:618–620. pmid:34081100 - 26.
Godlee F, Gale CR, Martyn C. Impact on the standard of peer evaluation of blinding reviewers and asking them to signal their studies. A randomized managed trial. JAMA: The. J Am Med Assoc. 1998;280:237–240. - 27.
Open Peer Overview. In: PLOS [Internet]. [cited 2022 Mar 1]. Out there from: https://plos.org/useful resource/open-peer-review/ - 28.
Wolfram D, Wang P, Hembree A, Park H. Open peer evaluation: selling transparency in open science. Scientometrics. 2020;125:1033–1051. - 29.
A decade of clear peer evaluation–Options–EMBO. [cited 2023 Mar 10]. Out there from: https://www.embo.org/options/a-decade-of-transparent-peer-review/ - 30.
Clarivate AHSPM. Introducing open peer evaluation content material within the Net of Science. In: Clarivate [Internet]. 2021 Sep 23 [cited 2022 Mar 1]. Out there from: https://clarivate.com/weblog/introducing-open-peer-review-content-in-the-web-of-science/ - 31.
Squazzoni F, Ahrweiler P, Barros T, Bianchi F, Birukou A, Blom HJJ, et al. Unlock methods to share knowledge on peer evaluation. Nature. 2020;578:512–514. pmid:32099126 - 32.
Severin A, Strinzel M, Egger M, Domingo M, Barros T. Traits of students who evaluation for predatory and legit journals: linkage research of Cabells Scholarly Analytics and Publons knowledge. BMJ Open. 2021;11:e050270. pmid:34290071 - 33.
Bonikowski B, Luo Y, Stuhler O. Politics as ordinary? Measuring populism, nationalism, and authoritarianism in U.S. presidential campaigns (1952–2020) with neural language fashions. Sociol Strategies Res. 2022;51:1721–1787. - 34.
Publons. Observe extra of your analysis impression. In: Publons [Internet]. [cited 2022 Jan 18]. Out there from: http://publons.com - 35.
Tunstall L, von Werra L, Wolf T. Pure Language Processing with Transformers: Constructing Language Functions with Hugging Face. 1st ed. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly Media; 2022. - 36.
2019 Journal Influence Elements. Journal Quotation Experiences. London, UK: Clarivate Analytics; 2020. - 37.
Scope Notes [cited 2022 Jun 20]. Out there from: https://esi.assist.clarivate.com/Content material/scope-notes.htm - 38.
Superchi C, González JA, Solà I, Cobo E, Hren D, Boutron I. Instruments used to evaluate the standard of peer evaluation studies: a methodological systematic evaluation. BMC Med Res Methodol. 2019;19:48. pmid:30841850 - 39.
Ramachandran L, Gehringer EF. Automated evaluation of evaluation high quality utilizing latent semantic evaluation. 2011 IEEE eleventh Worldwide Convention on Superior Studying Applied sciences. Athens, GA, USA: IEEE; 2011. p. 136–138. https://doi.org/10.1109/ICALT.2011.46 - 40.
Ghosal T, Kumar S, Bharti PK, Ekbal A. Peer evaluation analyze: A novel benchmark useful resource for computational evaluation of peer critiques. PLoS ONE. 2022;17:e0259238. pmid:35085252 - 41.
Thelwall M, Papas E-R, Nyakoojo Z, Allen L, Weigert V. Routinely detecting open educational evaluation reward and criticism. On-line Inf Rev. 2020;44:1057–1076. - 42.
Buljan I, Garcia-Costa D, Grimaldo F, Squazzoni F, Marušić A. Giant-scale language evaluation of peer evaluation studies. Rodgers P, Hengel E, editors. elife. 2020;9:e53249. pmid:32678065 - 43.
Luo J, Feliciani T, Reinhart M, Hartstein J, Das V, Alabi O, et al. Analyzing sentiments in peer evaluation studies: Proof from two science funding companies. Quant Sci Stud. 2022;2:1271–1295. - 44.
Krippendorff Okay. Reliability in content material evaluation—Some frequent misconceptions and suggestions. Hum Commun Res. 2004;30:411–433. - 45.
Manning CD, Raghavan P, Schütze H. Introduction to info retrieval. New York: Cambridge College Press; 2008. - 46.
Olczak J, Pavlopoulos J, Prijs J, Ijpma FFA, Doornberg JN, Lundström C, et al. Presenting synthetic intelligence, deep studying, and machine studying research to clinicians and healthcare stakeholders: an introductory reference with a suggestion and a Medical AI Analysis (CAIR) guidelines proposal. Acta Orthop. 2021;92:513–525. pmid:33988081 - 47.
Gabrielatos C. Chapter 12: Keyness Evaluation: nature, metrics and strategies. In: Taylor C, Marchi A, editors. Corpus Approaches to Discourse: A vital evaluation. Oxford: Routledge; 2018. p. 31. - 48.
Jayasinghe UW, Marsh HW, Bond N. A multilevel cross-classified modelling strategy to look evaluation of grant proposals: the results of assessor and researcher attributes on assessor scores. J R Stat Soc Ser A Stat Soc. 2003;166:279–300. - 49.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s transformers: state-of-the-art pure language processing. arXiv. 2020. Out there from: http://arxiv.org/abs/1910.03771 - 50.
Benoit Okay, Watanabe Okay, Wang H, Nulty P, Obeng A, Müller S, et al. quanteda: An R bundle for the quantitative evaluation of textual knowledge. J Open Supply Softw. 2018;3:774. - 51.
Bates D, Maechler M, Bolker BM, Walker SC. Becoming linear mixed-effects fashions utilizing lme4. J Stat Softw. 2015;67:1–48. - 52.
Brooks ME, Kristensen Okay, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, et al. glmmTMB balances velocity and suppleness amongst packages for zero-inflated generalized linear combined modeling. R J. 2017;9:378. - 53.
Lüdecke D. ggeffects: Tidy knowledge frames of marginal results from regression fashions. J Open Supply Softw. 2018;3:772. - 54.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the Tidyverse. J Open Supply Softw. 2019;4:1686.
[ad_2]