Grading and quality of clinical scientific evidence
Thanks to Andy Gray for bringing this to my attention.
The WHO Handbook for Guideline Development explains the process:
"Quality is defined as the ʹextent to which one can be confident that an estimate of effect or association is correctʹ. It is a continuum, as any discrete categorization involves some degree of arbitrariness. It is based on the following criteria:
● study design and any limitations of the studies, in terms of their conduct and analysis,
● the consistency of the results across the available studies,
● the precision of the results (wide or narrow confidence intervals)
● the directness (or applicability or external validity) of the evidence with respect to the populations, interventions and settings where the proposed intervention may be used
● the likelihood of publication bias
And additionally for observational studies:
● the magnitude of the effect
● presence or absence of a dose‐response gradient
● direction of plausible biases.
ʹQualityʹ of evidence is categorized as ʹhighʹ, ʹmoderateʹ , ʹlowʹ or ʹvery lowʹ.
Quality of evidence and their definitions
Grade – Definition
High – Further research is very unlikely to change our confidence in the estimate of effect.
Moderate – Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.
Low – Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate.
Very low – Any estimate of effect is very uncertain."
"Studies are broadly classified as 2 types:
1. RCT – randomized controlled studies or randomized cluster trials 2. observational studies, including interrupted time‐series (or quasi‐experimental design), cohort studies and case‐control studies and other types of design such as case series and case reports
The ʹdesignʹ is the baseline for rating quality of evidence. If you have studies of more than one design reporting the outcome, you should have a separate row in your table for each type.
Evidence based on RCTs begins as ʹhighʹ quality evidence and evidence from observational studies begins as ʹlowʹ quality evidence.
Having defined the type of studies, you then need to consider how well they were performed and analysed. For randomized controlled trials (RCTs), the main criteria for assessing trial limitations are:
● whether concealment of allocation to treatment group is adequate,
● whether participants and investigators were blinded, especially if the outcomes are measured subjectively and subject to bias,
● whether an intention‐to‐treat analysis is reported, ● whether all withdrawals and patients lost to follow‐up are accounted for,
● whether the trial was stopped early for benefit.
There are many checklists for assessing quality of RCTs. This is a minimum set of criteria. If you want to find out more about assessing quality of RCTs see “How to evaluate the quality of RCTs” for additional references.
For observational studies, the main criteria depend on the design: case control or cohort studies. For both, the methods used to select the population in the study and the comparability of the two groups are important. For case control studies the method of determining exposure to the factor of interest also needs to be evaluated. For cohort studies the method of measuring outcomes needs to be evaluated. For a checklist for evaluating observational studies, see the Newcastle‐Ottawa checklist (Appendix IV).
For studies of diagnostic accuracy, the QUADAS tools can be used.
Rating the limitations in study design requires a degree of judgment.
You need to decide whether the studies have ʹno limitationsʹ or ʹserious limitationsʹ or ʹvery serious limitationsʹ.
ʹNo limitationsʹ generally means that the majority of studies meet all of the minimum quality criteria for the design. The implication of this is that the rating of quality of evidence remains the same as the starting assessment.
ʹMinor limitationsʹ Sometimes minor flaws are found when analysing how the available studies were designed and performed. For example, allocation concealment may not be reported for one study out of several in a systematic review or a study could be nonblinded, but report objective outcomes. If you decide there are minor limitations, these should be noted in a footnote, but do not usually downgrade the quality.
ʹSerious limitationsʹ Serious means that one of the minimum criteria for quality is not met by the majority in studies in the review. This results in a ʹ‐1ʹ score for the overall quality rating (e.g high becomes moderate).
ʹVery serious limitationsʹ Very serious means that at least two of the criteria proposed as potential study limitations are present in the majority of studies in the review. This results in a ʹ‐2ʹ score for quality.
The criteria that are actually used for downgrading the quality of evidence and the reason for the assessment should be explained in a footnote for the table.
If most of the evidence for an outcome (based on the weight given to each study in the meta‐analysis) came from trials that did not have serious limitations, the overall assessment for that outcome will be that there are no important limitations and so the final judgment on the quality of evidence will coincide with the study design.
3. Assessing consistency
Consistency refers to the similarity of estimates of effect across studies. To evaluate the degree of consistency of the results of the available studies you should evaluate the direction and size of the effect for each outcome. If a formal meta‐analysis was conducted, the result of the test for heterogeneity can be used to help assess consistency. Variability or inconsistency in results can arise from differences in the populations in the studies, differences in the interventions, or outcomes.
Differences in the direction of effect, the size of the differences in effect, and the significance of the differences guide the (inevitably somewhat arbitrary) decision about whether important inconsistency exists. If all the results of the studies for one outcome are in the same direction, with overlapping confidence intervals, there is unlikely to be important inconsistency. If there is inconsistency in the results, such as the largest trial showing results that contradict smaller trials, then a ʹ‐1ʹ score should be rated. If the results are very heterogeneous, choose ʹvery seriousʹ, which will downgrade the evidence for this outcome by 2 levels.
In case only one study is present, consistency is not applicable as a criterion.
4. Assessing directness
Directness, generalisability, external validity of study results, and applicability are all synonymous. There are two types of indirectness.
1. Indirect comparison – occurs when a comparisons of intervention A versus B is not available, but A was compared with C and B was compared with C. Such trials allow indirect comparisons of the magnitude of effect of A versus B. Such evidence is of lower quality than head‐to‐head comparisons of A and B would provide.
2. Indirect population, intervention, comparator, or outcome: the question being addressed by the guideline development panel or by the authors of a systematic review is different from the available evidence regarding the population, intervention, comparator, or an outcome.
To determine whether important uncertainty exists, you should consider whether there is a compelling reason to expect important differences in the size of the effect. Because many interventions have more or less the same relative effects across most patient groups, criteria and judgments on directness should not be excessively stringent.
For some therapies—for example, behavioural interventions in which cultural differences are likely to be important—directness is mo re likely to be a problem. Similarly, reviewers may identify uncertainty about the directness of evidence for drugs that differ from those in the studies but are within the same class. Studies using surrogate outcomes generally provide less direct evidence than those using outcomes that are important to people. It is therefore prudent to use much more stringent criteria when considering the directness of evidence for surrogate outcomes.
For WHO guidelines ʹdirectnessʹ is a very important dimension that has relevance for the implementation of study results in actual practice..
The judgment about whether there is ʹsome uncertainty ʹ or ʹmajor uncertaintyʹ about directness can be challenging. Although there are no firm guidelines, if there is only one study, for example, in a developed‐world setting, and the intervention is likely to be altered according to setting, this would be rated as ʹmajor uncertaintyʹ (and therefore scored as ‐2).
Results are imprecise when studies include relatively few patients and few events and thus have wide confidence intervals around the estimate of the effect. In this case the quality of the evidence is lower than it otherwise would be because of uncertainty in the results.
For dichotomous outcomes, downgrade the quality of evidence for any of the following reasons:
1. total (cumulative) sample size is lower than the calculated optimal information size (OIS) 2. total number of events is less than 300 3. 95% confidence interval (or alternative estimate of precision) around the pooled or best estimate of effect includes no effect and a. if recommending in favor of an intervention – the upper confidence limit includes an effect that, if it were real, would represent a benefit that, given the downsides, would still be worth it b. if recommending against an intervention – the lower confidence limit includes an effect that, if it were real, would represent a harm that, given the benefits, would still be unacceptable 4. 95% confidence interval (or alternative estimate of precision) around the pooled or best estimate of effect excludes no effect but a. if recommending in favor of an intervention – the lower confidence limit crosses a threshold below which, given the downsides of the intervention, one would not recommend the intervention b. if recommending against an intervention – the upper confidence limit crosses a threshold above which, given the benefits of an intervention, one would recommend the intervention.
When event rates are very low, 95% confidence intervals around relative effects can be very wide, but 95% confidence intervals around absolute effects may be narrow. Under such circumstances one may not downgrade the quality of evidence for imprecision.
For continuous outcomes, downgrade the quality of evidence when:
1. 95% confidence interval includes no effect and the upper or lower confidence limit crosses the minimal important difference, either for benefit of harm 2. if the minimal important difference s not known or use of different outcomes measures required calculation of an effect size, downgrade if the upper or lower confidence limit crosses an effect size of 0.5 in either direction.
6. Other considerations
Publication bias is a systematic underestimate or overestimate of the underlying beneficial or harmful effect due to the selective publication of studies or selective reporting of outcomes.
Reporting bias arises when investigators fail to report studies they have undertaken (typically those that show no effect) or neglect to report outcomes that they have measured (typically those for which they observed no effect).
Methods to detect the possibility of publication bias in systematic reviews exist, although authors of the reviews and guideline panels must often guess about the likelihood of reporting bias. A prototypical situation that should elicit suspicion of reporting bias occurs when published evidence is limited to a small number of trials, all of which were funded by a for‐profit organization.
You should consider the extent to which they are uncertain about the magnitude of the effect due to selective publication of studies or reporting of outcomes and if this is likely, downgrade the quality rating by one or even two levels.
When methodologically strong observational studies yield large or very large and consistent estimates of the magnitude of a treatment or exposure effect, we may be confident about the results. In those situations, the weak study design is unlikely to explain all of the apparent benefit or harm, even though observational studies are likely to provide an overestimate of the true effect.
The larger the magnitude of effect, the stronger becomes the evidence.
Only studies with no threats to validity (not downgraded for any reason) can be upgraded.
Dose response curve
The presence of a dose‐response gradient may increase confidence in the findings of observational studies and thereby increase the quality of evidence, but this only applies to studies not downgraded for any reason. To rate the presence of dose‐response gradient:
● If there is no evidence of dose‐response gradient, there is no change ● If there is evidence of dose‐response gradient, upgrade the evidence for this outcome by 1 level
Direction of confounding factors
On occasion, all plausible biases from observational studies may be working to underestimate the true treatment effect. For instance, if only sicker patients receive an experimental intervention or exposure, yet they still fare better, it is likely that the actual intervention or exposure effect is larger than the data suggest.
Only studies with no threats to validity (not downgraded for any reason) can be upgraded. To rate the effect of all plausible residual confounding:
● If there is no evidence that the influence of all plausible residual confounding would reduce the observed effect, there is no change
● If there is evidence that the influence of all plausible residual confounding would reduce the observed effect, upgrade the evidence for this outcome by 1 level"
"The strength of a recommendation reflects the degree of confidence that the desirable effects of adherence to a recommendation outweigh the undesirable effects.
Desirable effects can include beneficial health outcomes, less burden and greater savings. Undesirable effects can include harms, greater burden, and increased costs. Burdens are the demands of adhering to a recommendation that patients or caregivers (e.g., family) may find onerous, such as having to undergo more frequent tests or opting for a treatment that may require a longer time for recovery.
Although the degree of confidence is a continuum, the GRADE system defines two categories: strong and weak.
A strong recommendation is one for which the panel is confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects. This can be both in favor of or against an intervention.
A weak recommendation is one for which the panel concludes that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but the panel is not confident about these trade‐offs.
Reasons for not being confident can include:
● absence of high‐quality evidence
● presence of imprecise estimates of benefits or harms ● uncertainty or variation in how different individuals value the outcomes ● small benefits ● benefits that are not worth the costs (including the costs of implementing the recommendation)
Despite the lack of a precise threshold for moving from a strong to a weak recommendation, the presence of important concerns about one or more of the above factors make a weak recommendation more likely (see Table below). Panels should consider all these factors and make the reasons for their judgments explicit.
Implications of a strong recommendation are:
● For patients: Most people in your situati on would want the recommended course of action and only a small proportion would not.
● For clinicians: Most patients should receive the recommended course of action. Adherence to this recommendation is a reasonable measure of good quality care.
● For policy makers: The recommendation can be adapted as a policy in most situations. Quality initiatives could use this recommendation to measure variations in quality.
Implications of a weak recommendation are:
● For patients: The majority of people in your situation would want the recommended course of action, but many would not.
● For clinicians: Be prepared to help patients to make a decision that is consistent with their own values.
● For policy makers: There is a need for substantial debate and involvement of stakeholders.
Table. Factors that may influence the strength of recommendations Factor – Examples of strong recommendations – Examples of weak recommendations
Quality of evidence – Many high-quality randomized trials have demonstrated the benefit of inhaled steroids in asthma – Only case series have examined the utility of pleurodesis in pneumothorax
Uncertainty about the balance between desirable and undesirable effects
– Aspirin in myocardial infarction reduces mortality with minimal toxicity, inconvenience and cost – Warfarin in low-risk patients with atrial fibrillation results in small stroke reduction but increased bleeding risk and substantial inconvenience
Uncertainty or variability in values and preferences – Young patients with lymphoma will invariably place a higher value on the life-prolonging effects of chemotherapy over treatment toxicity – Older patients with lymphoma may not place a higher value on the life-prolonging effects of chemotherapy over treatment toxicity
Uncertainty about whether the intervention represents a wise use of resources – The low cost of aspirin as prophylaxis against stroke in patients with transient ischemic attacks – The high cost of clopidogrel and dipyridamole/aspirin as prophylaxis against stroke in patients with transient ischemic attacks
Many recommendations are labeled as either strong or weak. However, because the ʺweakʺ label may sometimes be misinterpreted, other options exist. These include the use of : strong/conditional, or strong/qualified."
Here is a recent set of papers on the GRADE system:
1.Guyatt GH, Oxman AD, Vist G, Kunz R, Falck-Ytter Y, Alonso-Coello P, Schünemann HJ, for the GRADE Working Group. Rating quality of evidence and strength of recommendations GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924-926 or [pdf]
2.Guyatt GH, Oxman AD, Kunz R, Vist GE, Falck-Ytter Y, Schünemann HJ; GRADE Working Group. Rating quality of evidence and strength of recommendations: What is "quality of evidence" and why is it important to clinicians? BMJ. 2008 May 3;336(7651):995-8
3.Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, Williams JW Jr, Kunz R, Craig J, Montori VM, Bossuyt P, Guyatt GH; GRADE Working Group. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008 May 17;336(7653):1106-10
4.Guyatt GH, Oxman AD, Kunz R, Jaeschke R, Helfand M, Liberati A, Vist GE, Schünemann HJ; GRADE working group. Rating quality of evidence and strength of recommendations: Incorporating considerations of resources use into grading recommendations. BMJ. 2008 May 24;336(7654):1170-3
5.Guyatt GH, Oxman AD, Kunz R, Falck-Ytter Y, Vist GE, Liberati A, Schünemann HJ; GRADE Working Group. Rating quality of evidence and strength of recommendations: Going from evidence to recommendations. BMJ. 2008 May 10;336(7652):1049-51
6.Jaeschke R, Guyatt GH, Dellinger P, Schünemann H, Levy MM, Ku nz R, Norris S, Bion J; GRADE working group. Use of GRADE grid to reach decisions on clinical practice guidelines when consensus is elusive. BMJ. 2008 Jul 31;337:a744