Review Review

Data extraction form systematic review observational studies

Once the search and selection of studies for inclusion is completed the next step is to read the full text of each article identified for inclusion in the Systematic Review and extract the relevant data using a standardised data extraction form. The purpose of the data extraction step is to:

Objectively and accurately summarize studies in a common format to facilitate synthesis,
Identify numerical data if a meta-analysis is to take place, and
Obtain information to objectively assess the risk of bias in, and applicability of, studies.

The standardised data extraction form is as long or as short as necessary and can be coded if desired, especially if a quantitative analysis is required. If the Systematic Review is a narrative [no meta-analysis] and/or reviews a relatively small number of studies, coding is probably unnecessary.

Ideally, at least two reviewers should work independently, to extract quantitative and other critical data from each study, and a fair procedure for resolving discrepancies should be established.

Data Extraction Tools

Tools available for data extraction include an Excel spreadsheet or the data extraction templates in software like Covidence. If the Systematic Review includes a meta-analysis and/or reviews a large number of studies the data extraction software will likely be helpful. Using the excel spreadsheet allows for more customisation of what data to collect.

What Data to Collect?

Reviewers should develop the standardised form to suit the specific Systematic Review1 and use the key question[s] and inclusion and exclusion criteria as a guide. Use the PICOT framework to choose data elements in the data extraction form. Anticipate what the data summary table will need to include.

The following is an example of elements to include in a standardised data extraction form.

Citation	Include details such as journal, title, author, volume, page numbers etc.
Objective	Describe the study objective as stated by the authors
Population	Demographic detail of the participants in the study
Intervention	Describe the intervention or treatment
Comparison	Describe the control group or comparison intervention
Outcome	Record the results of the intervention and how measured
Type	Study Type / Design - eg. Randomised Control Trial
Comments	Notes in regard to the study quality for grading

This section describes how to undertake a systematic search using a range of methods to identify studies, manage the references retrieved by the searches, obtain documents and write up the search process. Practical examples of constructing search strategies are given in Appendix 2, and Appendix 3 provides examples of how the search should be documented. Issues around the identification of research evidence that are specific to review type such as adverse effects or clinical tests are discussed in the relevant chapters.

Conducting a thorough search to identify relevant studies is a key factor in minimizing bias in the review process. The search process should be as transparent as possible and documented in a way that enables it to be evaluated and reproduced.

Studies can be located using a combination of the following approaches:

Searching electronic databases
Visually scanning reference lists from relevant studies
Handsearching key journals and conference proceedings
Contacting study authors, experts, manufacturers, and other organisations
Searching relevant Internet resources
Citation searching
Using a project Internet site to canvas for studies

1.3.1.1 Minimizing publication and language biases

Decisions about where and how to search could unintentionally introduce bias into the review, so the team needs to consider, and try to minimize, the possible impact of search limitations. For example, restricting the searching to the use of electronic databases, which consist mainly of references to published journal articles, could result in the review being subject to publication bias as this approach is unlikely to identify studies that have not been published in peer reviewed journals. Wider searching is needed to identify research results circulated as reports or discussion papers. The identification of grey literature, such as unpublished papers, is difficult, but some are included on databases such as NTIS [National Technical Information Service] and HMIC [Health Management Information Consortium]. Libraries of specialist research organisations and professional societies may also provide access to collections of grey literature.

Searching databases and registers that include unpublished studies, such as records of ongoing research, conference proceedings and theses, can reduce the impact of publication bias. Conference proceedings provide information on both research in progress and completed research. Conference abstracts are recorded in some major bibliographic databases such as BIOSIS Previews, as well as in dedicated databases such as Index to Scientific and Technical proceedings, ZETOC, and the Conference Papers Index.31, 32, 33, 34 It is also worth consulting catalogues from major libraries, for example the British Library and the US National Library of Medicine. The abstracts in conference proceedings may only give limited information, and there can be differences between data presented in an abstract and that included in a final report.35, 36 For these reasons, researchers should try to acquire the full report, if there is one, before considering whether to include the results in a systematic review.

As already discussed, limiting searches to English language papers can introduce language bias. Large bibliographic databases, such as MEDLINE and EMBASE, do include a small number of non-English language journals.37 Using additional databases such as LILACS [Latin American and Caribbean Health Sciences Literature] that contain collections of non-English language research can minimize potential language bias.

1.3.1.2 Searching electronic databases

The selection of electronic databases to search will depend upon the review topic. Lists of databases are available from libraries and from database providers, such as Dialog and Wolters Kluwer, while subject experts will be familiar with the bibliographic databases in their field.

For reviews of health care interventions, MEDLINE and EMBASE are the databases most commonly used to identify studies. The Cochrane Central Register of Controlled Trials [CENTRAL] includes details of published articles taken from bibliographic databases and other published and unpublished sources.38 There are other databases with a narrower focus that could be equally appropriate. These include PsycINFO [psychology and psychiatry], AMED [complementary medicine], MANTIS [osteopathy and chiropractic] and CINAHL [nursing and allied health professions]. If the topic includes social care there are a range of databases available including ASSIA [Applied Social Sciences Index and Abstracts], CSA Sociological Abstracts, and CSA Social Services Abstracts, that could be used. The databases referred to above are all subject-based but there are others, such as AgeInfo, Ageline and ChildData, that focus on a specific population group that could be relevant to the review topic.

Due to the diversity of questions addressed by systematic reviews, there can be no agreed standard for what constitutes an acceptable search in terms of the number of databases searched. For example, if the review is on a cross-cutting public health topic such as housing and health it is advisable to search a wider range of databases than if the review is of a pharmaceutical intervention for a known health condition [See Chapter 3, Section 3.3 Identifying research evidence].

1.3.1.3 Searching other sources

In addition to searching electronic databases, published and unpublished research may also be obtained by using one or more of the following methods.

Scanning reference lists of relevant studies

Browsing the reference lists of papers [both primary studies and reviews] that have been identified by the database searches may identify further studies of interest.

Handsearching key journals

Handsearching involves scanning the content of journals, conference proceedings and abstracts, page by page. It is an important way of identifying very recent publications that have not yet been included and indexed by electronic databases or of including articles from journals that are not indexed by electronic databases.39 Handsearching can also ensure complete coverage of journal issues, including letters or commentaries, which may not be indexed by databases. It can also compensate for poor or inaccurate database indexing that can result in even the most carefully constructed strategy failing to identify relevant studies. Selecting which journals to handsearch can be done by analysing the results of the database searches to identify the journals that contain the largest number of relevant studies.

Searching trials registers

Trials can be identified by searching one or more of the many trials registers that exist. It can be a particularly useful approach to identifying unpublished or ongoing trials. Many of the registers are available on the Internet and some of the larger ones, such as www.ClinicalTrials.gov and www.who.int/trialsearch/, include the facility to search by drug name or by condition. While some registers are disease specific, others collect together trials from a specific country or region. Pharmaceutical companies may also make information about trials they have conducted available from their websites.

Contacting experts and manufacturers

Research groups and other experts as well as manufacturers may be useful sources of research not identified by the electronic searches, and may also be able to supply information about unpublished or ongoing research. Contacting relevant research centres or specialist libraries is another way of identifying potential studies. While these methods can all be useful, they are also time consuming and offer no guarantee of obtaining relevant information.

After a thorough and systematic search has been conducted, and relevant studies have been identified, topic experts can be asked to check the list to identify any known missing studies.

Searching relevant Internet resources

Internet searching can be a useful means of retrieving grey literature, such as unpublished papers, reports and conference abstracts. Identifying and scanning specific relevant websites will usually be more practical than using a general search engine such as ‘Google’.

Reviews of transport and ‘welfare to work’ programmes have reported how Internet searching of potentially relevant websites was effective in identifying additional studies to those retrieved from databases.40, 41 It is worth considering using the Internet when investigating a topic area where it is likely that studies have been published informally rather than in a journal indexed in a bibliographic database.

Internet searching should be carried out in as structured a way as possible and the procedure documented [see Appendix 3].

Citation searching

Citation searching involves selecting a number of key papers already identified for inclusion in the review and then searching for articles that have cited these papers. This approach should identify a cluster of related, and therefore highly relevant, papers. As this is in effect a search forward through time, citation searching is not suitable for identifying recent papers as they cannot have been referenced by other older papers.

Citation searching used to be limited to using the indexes Science Citation Index Expanded, Social Sciences Citation Index, and Arts & Humanities Citation Index,. but other resources [including CINAHL, PsycINFO and Google Scholar] now include cited references in their records so these are also available for citation searching. Using similar services offered by journals such as the BMJ can also be helpful.

Using a project Internet site to canvas for studies

Where it has been agreed that a dedicated website should be set up for the review, for example as part of the overall dissemination strategy, this can be used to canvas for unpublished data/grey literature. Inclusion of an email contact address allows interested parties to submit information about relevant research. Posting the inclusion and exclusion criteria on the website may help to ensure submissions are appropriate. Throughout the review process the website should be continually updated with information about the studies identified. Personal responses should be sent to all respondents and where appropriate submitted material should be included in the library of references. Further details about dedicated project websites can be found in Section 1.3.8 Disseminating the findings of systematic reviews.

This approach should probably only be considered for ‘high profile’ reviews and then it should be as an adjunct to active canvassing for unpublished/grey literature.

1.3.1.4 Constructing the search strategy for electronic databases

Search strategies are explicitly designed to be highly sensitive so as many potentially relevant studies as possible are retrieved. Consequently the searches tend to retrieve a large number of records that do not meet the inclusion criteria. While it is possible to increase the precision of a search strategy, and so reduce the number of irrelevant papers retrieved, this may lead to relevant studies being missed.42

Constructing an effective combination of search terms involves breaking down the review question into ‘concepts’. Using the Population, Intervention, Comparator, and Outcomes elements from PICOS can help to structure the search, but it is not essential that every element is used. For example it may be better not to use terms for the outcomes since inclusion might mean that the database being searched fails to show relevant studies simply because the outcome is not mentioned prominently enough in the record, even though the study measured it. For each of the elements used, it is important to consider all the possible alternative terms. For example a drug intervention may be known by a generic name and one or more proprietary names. Advice should be sought from the topic experts on the review team and advisory group.

For a detailed discussion of how to structure a search from a review question, including the use of search filters for study design, see Appendix 2.

1.3.1.5 Text mining

Text mining is a rapidly developing approach to utilizing the large amount of published text now available. Its potential use in systematic reviews is currently being explored and it may in future be an additional useful way of identifying relevant studies.43, 44 The aim of text mining is to identify connections between seemingly unrelated facts to generate new ideas or hypotheses. A number of processes are involved in the technique: a] Information Retrieval identifies documents to match a user's query; b] Natural Language Processing provides linguistic data needed to perform; c] Information Extraction, the process of automatically obtaining structured data from an unstructured natural language document; and d] Data Mining, the process of identifying patterns in large sets of data.45, 46 In future this approach may be helpful in automatically screening and ranking large numbers of potentially eligible studies prior to assessment by the researchers.

There are a variety of text mining tools available, for example TerMine and Acromine47 are tools dealing with term extraction and variation. Also of interest are KLEIO,48 which provides advanced searching facilities across MEDLINE and FACTA, which finds associated concepts using text analysis.49 Further information about text mining and the use of these tools can be found on the National Centre for Text Mining website [www.nactem.ac.uk/].

1.3.1.6 Updating literature searches

Depending on the scope and timescale of the review, an update of the literature searches towards the end of the project may be required. If the initial searches were carried out some time before the final analysis is undertaken [e.g. six months] it may be necessary to re-run the searches to ensure that no recent papers are missed. To do this successfully the date the original search was conducted and the years covered by the search must have been recorded.

When doing update searches the update date field should be used rather than the actual date. This ensures that anything added to the database since the original search was conducted will be identified. If the database has added a lot of older material [e.g. from 1967] this will be removed by using the original date limits [e.g. 1990-2008] in combination with the update date field. For databases that do not include an update date field it may be better to run the whole search again and then use reference management software to remove those records that have already been identified and assessed.

1.3.1.7 Current awareness

If a review is covering an area where there is rapid change or if a major study is expected to report its findings in the near future, setting up current awareness alerts can ensure that new papers are identified as soon as they become available. Options for current awareness include e-mail alerts from journals and RSS feeds from databases or websites.

1.3.1.8 Managing references

To ensure the retrieved records are managed efficiently the team should agree working practices. For example, who will screen the references and record decisions about which documents to obtain and how to code these decisions; whether decisions about rejecting or obtaining documents should be made blind to others’ decisions; and how to store documents received. In addition, one member of the team should be responsible for identifying and removing duplicate references, ordering inter-library loans, recording the receipt of documents, and following up non-arrivals.

Using bibliographic software such as EndNote, Reference Manager or ProCite to record and manage references will help in documenting the process, streamline document management and make the production of reference lists for reports and journal papers easier. EPPI-Reviewer, a web-based review management programme, also incorporates reference management functions.4, 50 Alternatively it is possible to construct a database of references using a database package such as Microsoft Access or a word processing package. By creating a 'library' [database] of references, information can be shared by the whole review team, duplicated references can be identified and deleted more easily, and customised fields can be created where ordering decisions can be recorded.42 Specialised bibliographic management software packages have the facility to import references from electronic databases into the library and interact with word processing packages so bibliographies can be created in a variety of styles.

When an electronic library of references is used, it is important to establish in advance clear rules about which team members can add or amend records in the library, and that consistent terminology is used to record decisions. It is usually preferable to have one person from the team responsible for the library of references.

1.3.1.9 Obtaining documents

1.3.1.10 Documenting the search

The search process should be reported in sufficient detail so that it could be re-run at a later date. The easiest way to document the search is to record the process and the results contemporaneously. The decisions reached during development and any changes or amendments made should be recorded and explained. It is important to record all searches, including Internet searches, handsearching and contact with experts.

Providing the full detail of searches helps future researchers to re-run or update the searches and enables readers to evaluate the thoroughness of searching. The write up of the search should include information about the databases and interfaces searched [including the dates covered], full detailed search strategies [including any justifications for date or language restrictions] and the number of records retrieved.

When systematic reviews are reported in journal articles, limits on the word count may make it impossible to provide full details of the searches. In these circumstances as much information as possible should be provided within the available space. For example, ‘We searched MEDLINE, EMBASE and CINAHL’ is more helpful to the reader than ‘We conducted computer searches’. Many journals now have an electronic version of the publication where the full search details can be provided. Alternatively, the published report can include the review team’s contact details so full details of the search strategies can be requested. If a detailed report is being written for the commissioners of the review, the full search details should be included.

The use of flow charts to demonstrate how relevant papers are identified is detailed in Section 1.3.2 Study selection. Guidance on documenting the different aspects of the searching process is given in Appendix 3.

Summary: Identifying research evidence for systematic reviews

The search for studies should be comprehensive.
The extent of searching is determined by the research question and the resources available to the research team.
Thorough searching is best achieved by using a variety of search methods [electronic and manual] and by searching multiple, possibly overlapping resources.
Most of the searching is likely to take place at the beginning of the review with an update search towards the end.
Using bibliographic software to record and manage references will help in documenting the process, streamline document management and make the production of reference lists for reports and journal papers easier.
The search process should be documented in full or details provided of where the strategy can be obtained.

1.3.2 Study selection

Literature searching may result in a large number of potentially eligible records that need to be assessed for inclusion against predetermined criteria, only a small proportion of which may eventually be included in the review. The process for selecting studies should be explicit and conducted in such a way as to minimize the risk of errors and bias. This section explains the steps involved and the issues to be considered when planning and conducting study selection.

1.3.2.1 Process for study selection

The process by which decisions on the selection of studies will be made should be specified in the protocol, including who will carry out each stage and how it will be performed. The aim of selection is to ensure that only relevant studies are included in the review.

It is important that the selection process should minimise biases, which can occur when the decision to include or exclude certain studies may be affected by pre-formed opinions.52, 53, 54, 55, 56 The process for study selection therefore needs to be explicit, objective and minimize the potential for errors of judgement. It should be documented clearly to ensure it is reproducible [see Figure 1.1]. The selection of studies from electronic databases is usually conducted in two stages:

Stage 1: a first decision is made based on titles and, where available, abstracts. These should be assessed against the predetermined inclusion criteria. If it can be determined that an article does not meet the inclusion criteria then it can be rejected straightaway. It is important to err on the side of over-inclusion during this first stage. The review question and the subsequent specification of the inclusion and exclusion criteria are likely to determine ease of rejection in this first stage. Where the question and criteria are tightly focused then it is usually easier to be confident that the rejected studies are not relevant. Rejected citations fall into two main categories; those that are clearly not relevant and those that address the topic of interest but fail on one or more criteria such as population. For those in the first category it is usually adequate to record as an irrelevant study, without a reason why. For those in the second category it is useful to record why the study failed to meet the inclusion criteria, as this increases the transparency of the selection process. Where abstracts are available the amount and usefulness of the information to the decision-making process often varies according to database and journal. Structured abstracts such as those produced by the BMJ are particularly useful at this stage of the review process.

Stage 2: for studies that appear to meet the inclusion criteria, or in cases when a definite decision cannot be made based on the title and/or abstract alone, the full paper should be obtained for detailed assessment against the inclusion criteria.

Some searching methods provide access to full papers directly, for example handsearching journals and contact with research groups, in which case assessment for inclusion is a one stage process.

Even when explicit inclusion criteria are specified, decisions concerning the inclusion of individual studies can remain subjective. Familiarity with the topic area and an understanding of the definitions being used are usually important.

The reliability of the decision process is increased if all papers are independently assessed by more than one researcher, and the decisions shown to be reproducible. One study found that on average a single researcher is likely to miss 8% of eligible studies, whereas a pair of researchers working independently would capture all eligible studies.57 Assessment of agreement is particularly important during the pilot phase [described later in this section], when evidence of poor agreement should lead to a revision of the selection criteria or an improvement of their coding. Agreement between assessors [inter-assessor reliability] may be formally assessed mathematically using a Kappa statistic [a measure of chance-corrected agreement].58

The process for resolving disagreements between assessors should be specified in the protocol. Many disagreements may be simple oversights, whilst others may be matters of interpretation. These disagreements should be discussed and, where possible, resolved by consensus after referring to the protocol; if necessary a third person may be consulted.

If resources and time allow, the lists of included and excluded studies may be discussed with the advisory group. In addition, these lists can be posted on a dedicated website with a request for feedback on any missing studies, an approach used in a review of water fluoridation.59 For further information see Section 1.3.8 Disseminating the findings of systematic reviews.

Piloting the study selection process

The selection process should be piloted by applying the inclusion criteria to a sample of papers in order to check that they can be reliably interpreted and that they classify the studies appropriately. The pilot phase can be used to refine and clarify the inclusion criteria and ensure that the criteria can be applied consistently by more than one person. Piloting may also give an indication of the likely time needed for the full selection process.

Masking/blinding

Judgements about inclusion may be affected by knowledge of the authorship, institutions, journal titles and year of publication, or the results and conclusions of articles.60 Blind assessment may be possible by removing such identifying information, but the gain should be offset against the time and effort required to disguise the source of each article. Several studies have found that masking author, institution, journal name and study results is of limited value in study selection.61, 62 Therefore, the general opinion is that unmasked assessment by two independent researchers is acceptable.

Dealing with lack of information

Sometimes the amount of information reported about a study is insufficient to make a decision about inclusion, and it can be helpful to contact study authors to ask for more details. However, this requires time and resources, and the authors may not reply, particularly if the study is old. If authors are to be contacted it may be advisable to decide in advance how much time will be given to allow them to reply. If contacting authors is not practical then the studies in question could be excluded and listed as ‘potentially relevant studies’. If a decision is made to include such studies, the influence on the results of the review can be checked in a sensitivity analysis.

Dealing with duplication

It is important to look for duplicate publications of research results to ensure they are not treated as separate studies in the review. Multiple papers may be published for a number of reasons including: translations; results at different follow-up periods or reporting of different outcomes. However, it is not always easy to identify duplicates as they are often covert [i.e. not cross referenced to one another] and neither authorship nor sample size are reliable criteria for identification of duplication.63 Estimates of prevalence of duplicate publication range from 1.4% to 28%,64 and studies have been found to have up to five duplicate reports.63 Multiple reports from the same study may include identical samples with different outcomes reported or increasing samples with the same outcomes reported.

Multiple reporting can lead to biased results, as studies with significant results are more likely to be published or presented more frequently, leading to an overestimation of treatment effects when findings are combined.65 When multiple reports of a study are identified these should be treated as a single study but reference made to all the publications. It may be worthwhile comparing multiple publications for any discrepancies, which could be highlighted and the study authors contacted for clarification.

Documenting decisions

It is important to have a record of decisions made for each article. This may be in paper form, attached to paper copies of the articles, or the selection process may be partially or wholly computerised. If the search results are provided in electronic format, they can be imported into a reference management program such as EndNote, Reference Manager or ProCite which stores, displays and enables organisation of the records, and allows basic inclusion decisions to be made and recorded [in custom fields]. For more complex selection procedures, where several decisions and comments need to be recorded, a database program such as Microsoft Access may be of use. There are also programs specifically designed for carrying out systematic reviews which include aids for the selection process, such as TrialStat SRS and EPPI-Reviewer.

Reporting study selection

A flow chart showing the number of studies/papers remaining at each stage is a simple and useful way of documenting the study selection process. Recommendations for reporting and presentation of a flow chart when reporting systematic reviews with or without a meta-analysis have been developed by the PRISMA group, formerly the QUOROM group. Publication of these guidelines is forthcoming.66, 67 In the meantime, the existing QUOROM guidelines for the reporting meta-analysis of RCTs,9 provide guidance that is equally applicable to all systematic reviews. Figure 1.1 is an example of a flow chart from a systematic review of treatments for childhood retinoblastoma.14

A list of studies excluded from the review should also be reported where possible, giving the reasons for exclusion. This list may be included in the report of the review as an appendix. In general, this list is most informative if it is restricted to ‘near misses’ [i.e. those studies that only narrowly failed to meet inclusion criteria and that readers might have expected to see included] rather than all the research evidence identified. Decisions to exclude studies may be reached at the title and abstract stage or at the full paper stage.

Figure 1.1: Flow chart of study selection process14

Summary: Study selection

In order to minimize bias, studies should be assessed for inclusion using selection criteria that flow directly from the review question and that have been piloted to check that they can be reliably applied.

Study selection is a staged process involving sifting through the citations located by the search, retrieving full reports of potentially relevant citations and, from their assessment, identifying those studies that fulfil the inclusion criteria.

Parallel independent assessments should be conducted to minimize the risk of errors. If disagreements occur between assessors, they should be resolved according to a predefined strategy using consensus and arbitration as appropriate.

The study selection process should be documented, detailing reasons for exclusion of studies that are ‘near-misses’.

1.3.3 Data extraction

Data extraction is the process by which researchers obtain the necessary information about study characteristics and findings from the included studies. Data extraction requirements will vary from review to review, and the extraction forms should be tailored to the review question. The first stage of any data extraction is to plan the type of analyses and list the tables that will be included in the report. This will help to identify which data should be extracted. General guidance on the process is given here, but the specific details will clearly depend on the individual review topic.

A sample data extraction form and details of the data extraction process should be included in the review protocol. A common problem at the protocol stage is that there may be limited familiarity with the topic area. This can lead to uncertainties, for example, about comparators and outcome measures. As a result, time can be wasted extracting unnecessary data and difficulties can arise when attempting to utilise and synthesise the data. Sufficient time early in the project should therefore be allocated to developing, piloting and refining the data extraction form.

The extraction of data is linked to assessment of study quality in that both processes are often undertaken at the same time.

Standardised data extraction forms can provide consistency in a systematic review, whilst reducing bias and improving validity and reliability.68 Use of an electronic form has the added advantage of being able to combine data extraction and data entry into one step, and to facilitate data analysis and the production of results tables for the final report.

1.3.3.1 Design

Integral to the design of the form is the category of data to be extracted. It may be numerical, fixed text such as yes/no, a ‘pick list’, or free text. However, the number of free text fields should be limited as much as possible to simplify the analysis of data. The form should be unambiguous and easy to use in order to minimize discrepancies. Instructions for completion should be provided and each field should have decision rules about coding data in order to avoid ambiguity and to aid consistent completion. Piloting the form is essential. Paper forms should only be used where access to direct completion of electronic forms is impossible, to reduce risks of error in data transcription.

1.3.3.2 Content

The nature of the data extracted will depend on the type of question being addressed and the types of study available. Box 1.4 gives an example of some of the information that might be extracted for a comparative study.

Box 1.4 Example information requirements for data extraction

General information

Researcher performing data extraction

Date of data extraction

Identification features of the study:

Record number [to uniquely identify study]

Author

Article title

Citation

Type of publication [e.g. journal article, conference abstract]

Country of origin

Source of funding

Study characteristics

Aim/objectives of the study

Study design

Study inclusion and exclusion criteria

Recruitment procedures used [e.g. details of randomisation, blinding]

Unit of allocation [e.g. participant, GP practice etc.]

Participant characteristics

Characteristics of participants at the beginning of the study e.g.

Age

Gender

Ethnicity

Socio-economic status

Disease characteristics

Co-morbidities

Number of participants in each characteristic category for intervention and control group[s] or mean/median characteristic values [record whether it is the number eligible, enrolled, or randomised that is reported in the study]

Intervention and setting

Setting in which the intervention is delivered

Description of the intervention[s] and control[s] [e.g. dose, route of administration, number of cycles, duration of cycle, care provider, how the intervention was developed, theoretical basis [where relevant]]

Description of co-interventions

Outcome data/results

Unit of assessment/analysis

Statistical techniques used

For each pre-specified outcome:

Whether reported

Definition used in study

Measurement tool or method used

Unit of measurement [if appropriate]

Length of follow-up, number and/or times of follow-up measurements

For all intervention group[s] and control group[s]:

Number of participants enrolled

Number of participants included in analysis

Number of withdrawals, exclusions, lost to follow-up

Summary outcome data e.g.

Dichotomous: number of events, number of participants

Continuous: mean and standard deviation

Type of analysis used in study [e.g. intention to treat, per protocol]

Results of study analysis e.g.

Dichotomous: odds ratio, risk ratio and confidence intervals, p-value

Continuous: mean difference, confidence intervals

If subgroup analysis is planned the above information on outcome data or results will need to be extracted for each patient subgroup

Additional outcomes

Record details of any additional relevant outcomes reported

Costs

Resource use

Adverse events

NB: Notes fields can be useful for occasional pieces of additional information or important comments that do not easily fit into the format of other fields.

The results to be extracted from each individual study may be reported in a variety of ways, and it is often necessary for a researcher to manipulate the available data into a common format. Manipulations of the reported findings are discussed in further detail in Section 1.3.5 Data synthesis, but can include using confidence intervals to determine standard errors or estimating the hazard ratio from a survival curve. Data can be categorised at this stage; however, it is advisable to extract as much of the reported data as is likely to be needed, and categorise at a later stage, so that detailed information is not lost during data extraction.

1.3.3.3 Software

EPPI-Reviewer is a web application that enables researchers to manage all stages of a review in a single location. RevMan and TrialStat SRS are other software packages that can be used in data extraction for systematic reviews. Other tools commonly used include general word processing packages, spreadsheets and databases.

Whichever software package is used, ideally it should have the ability to provide different types of question coding. Some software will also allow researchers to develop quality control mechanisms for minimising data entry errors, for example, by specifying ranges of valid values.

1.3.3.4 Piloting data extraction

Data extraction forms should be piloted on a sample of included studies to ensure that all the relevant information is captured and that resources are not wasted on extracting data not required. The consistency of the data extracted should be assessed to make sure that those extracting the data are interpreting the forms, and the draft instructions and decision rules about coding data, in the same way. This will help to reduce data extraction errors. The exporting, analysis and outputs of the data extraction forms should also be pilot tested where appropriate, on a small sample of included studies. This will ensure that the exporting of data works correctly and the outputs provide the information required for data analysis and synthesis.

When using databases, piloting is particularly important as it becomes increasingly difficult to make changes once the template has been created and information has been entered into the database. Early production of the expected output is also the best way to check that the correct data structure has been set up.

1.3.3.5 Process of data extraction

Data extraction needs to be as unbiased and reliable as possible, however it is prone to human error and often subjective decisions are required. The number of researchers that will perform data extraction is likely to be influenced by constraints on time and resources. Ideally two researchers should independently perform the data extraction [the level of inter-rater agreement is often measured using a Kappa statistic58]. As an accepted minimum, one researcher can extract the data with a second researcher independently checking the data extraction forms for accuracy and completeness. This method may result in significantly more errors than two researchers independently performing data extraction but may also take significantly less time.69 Any disagreements should be noted and resolved by consensus among researchers or by arbitration by an additional independent researcher. A record of corrections or amendments to data extraction forms should be kept for future reference, particularly where there is genuine ambiguity [internal inconsistency] which cannot be resolved after discussion with the study authors. If using an electronic data extraction form that does not keep a record of amendments, completed forms can be printed and amendments recorded manually, before correcting the electronic version.

As with screening studies for inclusion, blinding researchers to the journal and author details has been recommended.70, 71 However this is a time-consuming operation, may not alter the results of a review and is likely to be of limited value.61

Reviews that include only published studies may be at risk of overestimating the treatment effect. Including data from unpublished studies [or unpublished outcomes] is therefore important in minimising bias. However, this can be time-consuming and the original data may no longer be available. Although those performing IPD meta-analyses,72 have generally been successful in obtaining data from the authors of unpublished studies, the same may not be true of other types of review. The practical difficulties of locating and obtaining information from unpublished studies may, for example, make the ideal of including relevant unpublished studies unachievable in the timescales available for many commissioned reviews. When information from unpublished studies is obtained, the published and unpublished material should be subjected to the same methodological evaluation.

Summary: Data extraction

Standardised data extraction forms provide consistency in a systematic review, thereby potentially reducing bias, improving validity and reliability.

Data extraction forms should be designed and developed with both the review question and subsequent analysis in mind. Sufficient time should be allocated early in the project for developing and piloting the data extraction forms.

The data extraction forms should contain only information required for descriptive purposes or for analyses later in the systematic review. Information on study characteristics should be sufficiently detailed to allow readers to assess the applicability of the findings to their area of interest.

Data extraction needs to be unbiased and reliable, however it is prone to human error and often subjective decisions are required. Clear instructions and decision rules about coding data should be used.

As a minimum, one researcher should extract the data with a second researcher independently checking the data extraction forms for accuracy and detail. If disagreements occur between assessors, they should be resolved according to a predefined strategy using consensus and arbitration as appropriate.

1.3.4. Quality assessment

1.3.4.1 Introduction

Research can vary considerably in methodological rigour. Flaws in the design or conduct of a study can result in bias, and in some cases this can have as much influence on observed effects as that of treatment. Important intervention effects, or lack of effect, can therefore be obscured by bias.

Recording the strengths and weaknesses of included studies provides an indication of whether the results have been unduly influenced by aspects of study design or conduct [essentially the extent to which the study results can be ‘believed’]. Assessment of study quality gives an indication of the strength of evidence provided by the review and can also inform the standards required for future research. Ultimately, quality assessment helps answer the question of whether the studies are robust enough to guide treatment, prevention, diagnostic or policy decisions.

Many useful books discuss the sources of bias in different study designs in detail, or provide an in-depth guide to critical appraisal.73, 74, 75 No single approach to assessing methodological quality is appropriate to all systematic reviews. The best approach will be determined by contextual, pragmatic and methodological considerations. However, the following sections describe the underlying principles of quality assessment and the key issues to consider.

1.3.4.2 Defining quality

Quality is a complex concept and the term is used in different ways. For example, a project using the Delphi consensus method with experts in the field of quality assessment of RCTs was unable to generate a definition of quality acceptable to all participants.76

Taking a broad view, the aim of assessing study quality is to establish how near the ‘truth’ its findings are likely to be and whether the findings are of relevance in the particular setting or patient group of interest. Quality assessment of any study is likely to consider the following:

Appropriateness of study design to the research objective

Risk of bias
Other issues related to study quality

Choice of outcome measure
Statistical issues
Quality of reporting
Quality of the intervention
Generalisability

The importance of each of these aspects of quality will depend on the focus and nature of the review. For example, issues around statistical analysis are less important if the study data are to be re-analysed in a meta-analysis, and the quality of reporting is irrelevant where data [either individual patient or aggregate] and information are obtained directly from those responsible for the study.

Appropriateness of study design

As discussed previously, types of study used to assess the effects of interventions can be arranged into a hierarchy, based broadly on their susceptibility to bias [Box 1.3]. Although the RCT is considered the best study design to evaluate the effect of an intervention, in cases where it is unworkable or unethical to randomise participants [e.g. when evaluating the effects of smoking on health], researchers may instead have to use a quasi-experimental or an observational design. Simply grading studies using this hierarchy does not provide an adequate assessment of study quality, because it does not take into account variations in quality among studies of the same design. Even RCTs can be implemented in such a way that findings are likely to be seriously biased and therefore of little value in decision-making.

It should be noted that the terminology used to describe study designs [e.g. cohort, prospective, retrospective, historical controls, etc.] can be ambiguous and used in different ways by different researchers. Therefore it is important to consider the individual aspects of the study design that may introduce bias rather than focussing on the descriptive label used. This is particularly important for the description of non-randomised studies.

Risk of bias

Bias refers to systematic deviations from the true underlying effect brought about by poor study design or conduct in the collection, analysis, interpretation, publication or review of data. Bias can easily obscure intervention effects, and differences in the risk of bias between studies can help explain differences in findings.

Internal validity is the extent to which an observed effect can be truly attributed to the intervention being evaluated, rather than to flaws in the design or conduct of the study. Any such flaws can increase the risk of bias.

The types of bias, and the ways in which they can be minimised by each type of study design, are described below.

Randomised controlled trials

The RCT is generally considered to be the most appropriate study design for evaluating the effects of an intervention. This is because, when properly conducted, it limits the risk of bias. The simplest form of RCT is known as the parallel group trial which randomises eligible participants to two or more groups, treats according to assignment, and compares the groups with respect to outcomes of interest.

Participants are allocated to groups using both randomisation [allocation involves the play of chance] and concealment [ensures that the intervention that will be allocated cannot be known in advance of assignment]. When appropriately implemented, these aspects of design should ensure that the groups being compared are similar in all respects other than the intervention. The groups should be balanced for both known and unknown factors that might influence outcome, such that any observed differences should be attributable to the effect of the intervention rather than to intrinsic differences between the groups.

Allocation in this way avoids the influence of confounding, where an additional factor is associated both with receiving the intervention and with the outcome of interest. For example, babies who are breast fed are less likely to have gastrointestinal illnesses than those who are bottle fed. Though this might suggest evidence for the protective effect of breastfeeding, mothers who breast feed also tend to be of higher socio-economic status, which in itself is associated with a range of health benefits to the baby. Therefore, when evaluating any possible protective effects of breastfeeding socio-economic status should be considered as a potential confounding factor. In some cases, the possible confounding factor[s] may not be known or measurable. In an RCT, so long as a sufficient number of participants are assigned then the groups should be balanced with respect to both known and unknown potential confounding factors.

Selection bias or allocation bias occurs where there are systematic differences between comparison groups in terms of prognosis or responsiveness to treatment. Concealed assignment prevents investigators being able to predict which intervention will be allocated next and using that information to select which participant receives which treatment. For example, clinicians may want to ’try out‘ the new intervention in patients with a poorer prognosis. If they succeed in doing this by knowing or correctly ‘guessing’ the order of allocation, the intervention group will eventually contain more seriously ill participants than the comparison group, such that the intervention will probably appear less effective than if the two groups had been properly balanced.

The most robust method for concealing the sequence of treatment allocation is a central telephone randomisation service, in which the care provider calls an independent trial service, registers the participant’s details and then discovers which intervention they are to be given. Similarly, an on-site computer-based randomisation system that is not readable until the time of allocation might be used. Envelope methods of randomisation, where allocation details are stored in pre-prepared envelopes, are less robust and more easily subverted than centralised methods. Where this method is adopted, sealed opaque sequentially numbered envelopes that are only opened in front of the participant being randomised should be used. Unfortunately, the methods which are used to ensure that the randomisation sequence remains concealed during implementation [frequently referred to as concealment of allocation] are often poorly reported making it difficult to discern whether the methods were susceptible to bias.

Some studies, which may describe themselves as randomised, may allocate participants to groups on an alternating basis, or based on whether their date of birth is an odd or even number. Allocation in these studies is neither random nor concealed.

Performance bias refers to systematic differences [apart from the intervention of interest] in the treatment or care given to comparison groups during the study and detection bias refers to systematic differences between groups in the way that outcomes are ascertained. The risk of these biases can be minimized by ensuring that people involved in the study are unaware of which groups participants have been assigned to [i.e. they are blinded or masked]. Ideally, the participants, those administering the intervention, those assessing outcomes and those analysing the data should all be blinded. If not, the knowledge of which comparison group is which may consciously or unconsciously influence the behaviour of any of these people. The feasibility and/or success of blinding will partly depend on the intervention in question. There are situations where blinding is not possible owing to the nature of the intervention, for example where a particular intervention has an obvious physiological effect whereas the comparator does not, and others where it may be unethical [e.g. sham surgery carries risks with no intended benefit]. Methods of blinding for studies of drugs involve the use of pills and containers of identical size, shape and number [placebos]. Sham devices can be used for many device interventions and for some procedural interventions sham procedures can be used [e.g. sham acupuncture]. Blinding of outcome assessors is particularly important for more subjective outcome measures such as pain, but less important for objective measures such as mortality. Implementation of a blinding process does not however guarantee successful blinding in practice. In study reports, terms such as double-blind, triple-blind or single-blind can be used inconsistently77 and explicit reporting of blinding is often missing.78 It is important to clarify the exact details of the blinding process.

A well-conducted RCT should have processes in place to achieve complete and good quality data,79 in order to avoid attrition bias. Attrition bias refers to systematic differences between the comparison groups in terms of participants withdrawing or being excluded from the study. Participants may withdraw or drop-out from a study because the treatment has intolerable adverse effects, or on the other hand, they may recover and leave for that reason. They may simply be lost to follow-up, or they may be withdrawn due to a lack of data on outcome measures. Other reasons that participants may be excluded include mistaken randomisation of participants who, on review, did not meet the study inclusion criteria, and participants receiving the wrong intervention due to protocol violation. The likely impact of such withdrawals and exclusions needs to be considered carefully; if the exclusion is related to the intervention and outcome then it can bias the results [for example, not accounting for high numbers of withdrawals due to adverse effects in one intervention arm will unduly favour that intervention]. Serious bias can arise as a result of participants being withdrawn for apparently ad hoc reasons that are related to the success or failure of an intervention. There is evidence from the field of cancer research that exclusion of patients from the analysis may bias results,80 though how this may apply to other fields is unclear. An intention to treat [ITT] analysis is generally recommended in order to reduce the risk of bias.

An ITT analysis includes outcome data on all trial participants and analyses them according to the intervention to which they were randomised, regardless of the intervention[s] they actually received. Complete outcome data are often unavailable for participants who drop-out before the end of the trial, so in order to include all participants, assumptions need to be made about their missing outcome data [for example by imputation of missing values]. ITT analysis generally provides a more conservative, and potentially less biased, estimate in trials of effectiveness [see Section 1.3.5.2 Quantitative synthesis of comparative studies]. However, ITT analyses are often poorly described and applied81 and if assessing methodological quality associated with statistical analysis, care needs to be taken in judging whether the use of ITT analysis has minimized the risk of attrition bias and whether it was appropriately applied. If an ITT analysis is not used, then the study should at least report the proportion of participants excluded from the analysis to allow a researcher to judge whether this is likely to have led to bias.

The minimum criteria for assessment of risk of bias in RCTs are set out in Box 1.5. While all these criteria are relevant to assessing risk of bias, their relative importance can be context specific. For example, the importance of blinded outcome assessment will vary depending on whether the outcomes involve subjective judgement [this may vary between different outcomes measured within the same trial]. Therefore, when planning which criteria to use it is important to think carefully about what characteristics would realistically be considered ideal. The Cochrane handbook provides a detailed assessment tool for use when assessing risk of bias in an RCT.82

Box 1.5: Criteria for assessment of risk of bias in RCTs

Was the method used to generate random allocations adequate?

Was the allocation adequately concealed?

Were the groups similar at the outset of the study in terms of prognostic factors, e.g. severity of disease?

Were the care providers, participants and outcome assessors blind to treatment allocation? If any of these people were not blinded, what might be the likely impact on the risk of bias [for each outcome]?

Were there any unexpected imbalances in drop-outs between groups? If so, were they explained or adjusted for?

Is there any evidence to suggest that the authors measured more outcomes than they reported?

Did the analysis include an intention to treat analysis? If so, was this appropriate and were appropriate methods used to account for missing data?

Other randomised study designs

In addition to parallel group RCTs, there are other randomised designs where further quality criteria may need to be considered. These are described below.

Randomised cross-over trials

In randomised cross-over trials all participants receive all the interventions. For example in a two arm cross-over trial, one group receives intervention A before intervention B, and the other group receive intervention B before intervention A. It is the sequence of interventions that is randomised. The advantage of cross-over trials is that they are potentially more efficient than parallel trials of a similar size, in which each participant receives only one of the interventions. The criteria for assessing risk of bias in RCTs also apply to cross-over trials, but there are some additional factors that need to be taken into consideration.

The cross-over design is inappropriate for conditions where the intervention may provide a cure or remission, where there is a risk of spontaneous improvement or resolution of the condition, where there is a risk of deterioration over the period of the trial [e.g. degenerative conditions] or where there is a risk that patients may die.83 This is because these outcomes lead either to the participant being unable to enter the second period or, on entering the second period, their condition is systematically different from that in the first period.

The possibility of a ‘carryover’ of the effect of the intervention provided in the first period into the second intervention period is an important concern in this study design.83 This risk is dealt with by building in a treatment-free or placebo ‘washout period’ between the intervention periods.83 The adequacy of the washout period length will need to be considered as part of the assessment of risk of bias.

The statistical analysis appropriate to cross-over trials are discussed in the synthesis section and statistical advice is likely to be required [see Section 1.3.5 Data synthesis].

Cluster randomised trials

A cluster randomised trial is a trial where clusters of people rather than single individuals are randomised to different interventions.84 For example, whole clinics or geographical locations may be randomised to receive particular interventions, rather than individuals.

The distinctive feature of cluster trials is that the outcome for each participant within a cluster may not be independent, since individuals within the cluster are likely to respond in a similar way to the intervention. Underlying reasons for this intra-cluster correlation include individuals in a cluster being affected in a similar manner due to shared exposure to a common environment such as specific hospital policies on discharge times; or personal interactions between cluster members and sharing of attitudes, behaviours and norms that may lead to similar responses.84 This has implications for estimating the sample size required [i.e. the sample needs to be larger than for an individually randomised trial] and the statistical analysis.

When assessing the risk of selection bias in cluster randomised trials there are two factors that need to be considered: the randomisation of the clusters and how participants within clusters are selected into the study.85 The first can be dealt with by using an appropriate randomisation method with concealed allocation [clusters are often allocated at the outset]. However, where the trial design then requires selection of participants from within a cluster, the risk of selection bias should also be assessed. There is a clear risk of selection bias when the person recruiting participants knows in advance the clinical characteristics of a participant and which intervention they will receive. Also, potential participants may know in advance which intervention their cluster will receive, leading to different participation rates in the comparison groups.85 Two key methods for reducing bias in the selection of individuals within clusters have been identified: recruitment of individuals prior to the random allocation of clusters and, where this is not possible, use of an impartial individual to recruit participants following randomisation of the clusters.86

The statistical analyses appropriate to cluster randomised trials are discussed in Section 1.3.5 Data synthesis and statistical advice is likely to be required.

Wider reading is recommended prior to conducting a quality assessment of cluster randomised trials. Several texts discuss the design, analysis and reporting of this trial design.75, 84, 87, 88

Quasi-experimental studies

The main distinction between randomised and quasi-experimental studies is the way in which participants are allocated to the intervention and control groups; quasi-experimental studies do not use random assignment to create the comparison groups.

In non-randomised controlled studies, individuals are allocated to concurrent comparison groups, using methods other than randomisation. The lack of concealed randomised allocation increases the risk of selection bias.

Before-and-after studies evaluate participants before and after the introduction of an intervention. The comparison is usually made in the same group of participants, thus avoiding selection bias, although a different group can be used. In this type of design however, it can be difficult to account for confounding factors, secular trends, regression to the mean, and differences in the care of the participants apart from the intervention of interest.

An alternative to this is a ‘time series’ design. Interrupted time series studies are multiple observations over time that are ‘interrupted’, usually by an intervention or treatment and thus permit separating real intervention effects from other long-term trends. It is a study design used where others, such as RCTs, are not feasible, for example in the evaluation of a screening service or a mass media campaign. It is also frequently used in policy evaluation, for example to measure the effect of a smoking ban.

The circumstances in which, and extent to which, studies without randomisation are at risk of bias are not fully understood.89 A key influencing factor may be the extent to which prognosis influences selection for a particular intervention as well as eventual outcome.89 Because of the risk of bias, careful consideration should be given to the inclusion of quasi-experimental studies in a review to assess the effectiveness of an intervention. If included, researchers should think carefully about the strength of this evidence and how it should be interpreted.

A review of quality assessment tools designed for or used to assess studies without randomisation identified key aspects of quality as being particularly pertinent:89

How the treatment groups were created [how allocation occurred; and whether the study was designed to generate groups that are comparable on key prognostic factors e.g. by ‘matching’ participants in each group].

The comparability of intervention and comparison groups at the analysis stage. For example, whether prognostic factors were identified; and whether case-mix adjustment was used to account for any between group differences.

Other quality issues identified were similar to those for assessing performance, detection and attrition bias in RCTs: blinding of participants and investigators; the level of confidence that the participants received the intervention to which they were assigned and experienced the reported outcome as a result of that intervention; the adequacy of the follow-up; and appropriateness of the analysis.

Observational studies

In observational studies the intervention[s] that individuals receive are determined by usual practice or ‘real-world’ choices, as opposed to being actively allocated as part of the study protocol.

Observational studies are usually more susceptible to bias than experimental studies, and the conclusions that can be drawn from them are necessarily more tentative and are often hypothesis generating, highlighting areas for further research.

Observational designs such as cohort studies, case-control studies and case series are often considered to form a hierarchy of increasing risk of bias. However, such a hierarchy is not always helpful because, as noted before, the same label can be used to describe studies with different design features and there is not always agreement on the definitions of such studies. Attention should focus on specific features of the studies [e.g. participant allocation, outcome assessment] and the extent to which they are susceptible to bias.

In a cohort study design, a defined group of participants is followed over time and comparison is made between those who did and did not receive an intervention [e.g. a study may follow a cohort of women who choose to use oral contraceptives and compare them over time with women who choose other forms of contraception]. Prospective cohort studies are planned in advance and define their participants before the intervention of interest and follow them into the future. These are less likely to be susceptible to bias than retrospective cohort studies, which identify participants from past records and follow them from the time of that record.

Case-control studies compare groups from the same population with [cases] and without [controls] a specific outcome of interest, to evaluate the association between exposure to an intervention and the outcome. The risk of selection bias in such studies will be dependent on how the control group was selected. Groups may be matched to make them comparable for potential confounding factors. However, since analysis cannot be performed on matched variables, the matching criteria must be selected carefully, as this can give rise to ‘over-matching’ when the factors are related to allocation to the intervention.

Case series are observations made on a number of individuals [with no control group] and are not comparative. They can, however, provide useful information, for example about the unintentional effects of an intervention [see Chapter 4] and in such situations it is important to assess their quality.

Other issues related to study quality

Choice of outcome measure

As well as using blinding to minimise bias when assessing outcomes, it is usually necessary to consider the reliability or validity of the actual outcome measure being used [e.g. several different scales can be used to measure quality of life or psychological outcomes]. It is important that the scales are fully understood to enable comparison, [e.g. a high score implies a favourable outcome in some scales and an unfavourable one in others].

The outcome should also be relevant and meaningful to both the intervention and the evaluation [i.e. a treatment intended to reduce mortality should measure mortality, not merely a range of biochemical indicators].

Statistical issues

Although, issues around statistical analysis are less important if the study data are to be combined in a meta-analysis, when studies are not being quantitatively pooled it is also important to assess statistical issues around design and analysis. For example, assessing whether a study is adequately powered to detect an effect of the intervention.90 The assessment of statistical power may involve relying on the sample size calculation in the primary study, where reported. However, defining population parameters for sample size calculations is a subjective judgement which may vary between investigators;91 for some review topics it may be appropriate to define a priori an adequate sample size for the purposes of the review.

Quality of reporting

Inadequate reporting of important aspects of methodological quality such as allocation concealment, blinding and statistical analysis is common,92 as is failure to report detail about the intervention and its implementation. Quality of reporting does not necessarily reflect the quality of the underlying methods or data, but when planning quality assessment it is important to decide how to deal with poor reporting. One approach is to assume that if an item is not reported then the criterion has not been met. While this may often be justifiable,93, 94 there is evidence to suggest that failure to report a method does not necessarily mean it has not been used.95, 96, 97 Therefore it is important to be accurate and distinguish between failure to report a criterion and failure to meet a criterion. For example, a criterion can be described as being met, not met, or unclear due to inadequate reporting.

There have been a number of initiatives aimed at improving the quality of reporting of primary research. The CONSORT statement contains a set of recommendations for the reporting of RCTs,98 the TREND statement provides guidelines for the reporting of non-randomised evaluations of behavioural and public health interventions,99 and the STROBE statement is an initiative to improve reporting of observational studies.100 The EQUATOR network promotes the transparent and accurate reporting of health research in a number of ways, including the use of these consensus reporting guidelines.101 It is anticipated that implementation of these guidelines will help improve the standard of reporting, which should make quality assessment more straightforward.

Quality of the intervention

In addition to study design, it is often helpful to assess the quality of the intervention and its implementation. At its most simplistic, the quality of an intervention refers to whether it has been used appropriately. This is a fairly straightforward assessment where, for example drug titration studies have been conducted. It is more problematic where there is no preliminary research suggesting that an intervention should be administered in a particular way,102 or where the intervention requires a technical skill such as surgery or physiotherapy.103 It is important to establish to what extent these are standardised, as this will affect how the results should be interpreted.

The quality of the intervention is particularly relevant to complex interventions made up from a number of components, which act independently and inter-dependently.104 These include clinical interventions such as physiotherapy as well as public health interventions such as community-based programmes. The quality of an intervention can be conceptualised as having two main aspects: [i] whether the intervention has been appropriately defined and [ii] whether it has been delivered as planned [the integrity or fidelity of the intervention].

If the quality of the intervention is relevant, the review should assess whether the intervention was implemented as planned in the individual studies [i.e. how many participants received the intervention as planned, whether consistency of implementation was measured, and whether it is likely that participants received an unintended intervention/contamination of the intervention that may influence the results]. In some topic areas, for example when a sham device or procedure is being used, it may also be relevant to assess the quality of the comparator. When an intervention relies on the skill of the care provider it may be useful to assess whether the performance of those providing the intervention was measured. For more detailed information on complex interventions see Chapter 3.

Generalisability

Generalisability, also known as applicability or external validity, is not considered in detail in this section. In addition to assessing the risk of bias [internal validity], researchers may also consider how closely a study reflects routine practice or the usual setting where the intervention would be implemented. However, this is not an inherent characteristic of a study as the extent to which a study is ‘generalisable’ depends also on the situation to which the findings are being applied.105 Therefore the issue of generalisability is also raised in Section 1.3.3 Data extraction, in the context of defining inclusion criteria for the review, Section 1.2 The review protocol, and Section 1.3.6 Report writing.

1.3.4.3
The impact of study quality on the estimate of effect

Several empirical studies have explored how quality can influence the results of clinical trials [and therefore the results of reviews of trials]. Trials with double-blinding and adequate concealment of allocation have been found to indicate less beneficial treatment effects than trials without these features.106 Similarly, exclusion of lower quality studies has led to less beneficial effects in meta-analyses.106 In meta-analyses of subjectively assessed outcomes [e.g. patient reported outcomes], inadequate allocation concealment and lack of blinding have been associated with substantially more beneficial treatment effects, whereas for objective outcomes [e.g. mortality] there was a modest effect of inadequate allocation concealment and no effect of lack of blinding.107 There is some evidence about the relationship between study quality and the estimate of effect that is contradictory to the above,108, 109 though this may be due to the data sets used and how specific quality criteria were defined.

1.3.4.4
The process of quality assessment in systematic reviews

There are two main approaches towards assessing quality. One involves the use of checklists of quality items and the other of scales which provide an overall numerical quality score for each study.110

Tools for assessing quality

Checklists can be a reliable means of ensuring that all the studies assessed are critically appraised in a standardised way. There are many different checklists and scales readily available, 75, 111, 112, 113, 114, 115, 116 which can be modified to meet the requirements of the review, or a new detailed checklist, specific to the review, may be developed.

Because some items included may require a degree of subjective judgement, it is important to pilot the use of the checklist and to ensure that the quality assessment is undertaken independently by two researchers.

The use of scales with summary scores to distinguish high and low quality studies is questionable and not recommended.117, 118 Very few scales have been developed using standard techniques to establish their validity and reliability.113 The weighting assigned to methodological items varies considerably between scales,117 and does not usually take into account the direction of bias.119 An investigation comparing low-molecular-weight heparin [LMWH] with standard heparin for thromboprophylaxis in general surgery found that trials identified as ‘high quality’ by some of the 25 scales investigated indicated that LMWH was not superior to standard heparin, whereas trials identified as ‘high quality’ by other scales led to the opposite conclusion, that LMWH was beneficial.117 It is therefore preferable that aspects of quality such as blinding and treatment allocation [and their potential impact on study results] should be considered individually.117

Checklists by type of study design

In general checklists tend to be specific to particular study designs, and where reviews include more than one type of study design, separate lists can be used or a combined list selected or developed. Checklists have also been developed for use with both randomised and non-randomised studies such as that by Downs and Black.111

There are multiple systems available for the evaluation of RCTs,112, 113 in addition to the Cochrane handbook assessment tool for assessing risk of bias.82 In a review of checklists for the assessment of non-randomised studies, nearly 200 tools were identified. From these, six were recommended as being suitable for use in systematic reviews including non-randomised studies.89 The Cochrane Effective Practice and Organisation of Care Group [EPOC] have developed guidelines to assist researchers in making decisions about when to include studies that use interrupted time series designs and how to assess their methodological quality.115, 116 A useful checklist for observational studies was published as part of the US Agency for Healthcare Research and Quality’s [AHRQ] ‘Systems to Rate the Strength of Scientific Evidence’.112 The most recent version of the Cochrane Handbook also contains guidance on dealing with non-randomised studies in systematic reviews of interventions, from the protocol to synthesis stages.75

How will the quality assessment information be used?

Simply reporting which quality criteria were met by studies included in a systematic review is not sufficient. The implications of the quality assessment for interpreting results need to be explicitly considered.

Study quality can be incorporated into the synthesis either quantitatively through subgroup or sensitivity analyses [see Section 1.3.5.2: Quantitative synthesis], or in a narrative synthesis. In the latter, the quality assessment can be used to help interpret and explain differences in results across studies [e.g. unblinded studies with subjective outcomes may have consistently larger effects than blinded studies] and inform a qualitative interpretation of the risk of bias [see Section 1.3.5.1 Narrative synthesis].

Summary: Quality assessment

An important part of the systematic review process is to assess the risk of bias in included studies caused by inadequacies in study design, conduct or analysis that may have led to the treatment effect being over or underestimated.
Various tools are available but there is no single tool that is suitable for use in all reviews. Choice should be guided by:

Study design
The level of detail required in the assessment
The ability to distinguish between internal validity [risk of bias] and external validity [generalisability]

Using quality scores is problematic; it is preferable to consider individual aspects of methodological quality in the quality assessment and synthesis.
Where appropriate, the potential impact that methodological quality had on the findings of the included studies should be considered.

Detailed quality assessment can be time consuming if a review includes a large number of studies and may require considerable expertise in critical appraisal. If resources are limited, priority should be given to assessment of the key sources of bias.

1.3.5 D
ata synthesis

Synthesis involves the collation, combination and summary of the findings of individual studies included in the systematic review. Synthesis can be done quantitatively using formal statistical techniques such as meta-analysis, or if formal pooling of results is inappropriate, through a narrative approach. As well as drawing results together, synthesis should consider the strength of evidence, explore whether any observed effects are consistent across studies, and investigate possible reasons for any inconsistencies. This enables reliable conclusions to be drawn from the assembled body of evidence.

Deciding what type of synthesis is appropriate

Many systematic reviews evaluating the effects of health interventions focus on evidence from RCTs, the results of which, generally, can be combined quantitatively. However, not all health care questions can be addressed by RCTs, and systematic reviews do not automatically involve statistical pooling. Meta-analysis is not always possible or sensible. For example, pooling results obtained from diverse non-randomised study types is not recommended.120 Similarly, meta-analysis of poor quality studies could be seriously misleading as errors or biases in individual studies would be compounded and the very act of synthesis may give credence to poor quality studies. However, when used appropriately, meta-analysis has the advantage of being explicit in the way that data from individual studies are combined, and is a powerful tool for combining study findings, helping avoid misinterpretation and allowing meaningful conclusions to be drawn across studies.

The planned approach should be decided at the outset of the review, depending on the type of question posed and the type of studies that are likely to be available. There may be topics where it can be decided a priori that a narrative approach is appropriate. For example, in a systematic review of interventions for people bereaved by suicide, it was anticipated there would be such diversity in the included studies, in terms of settings, interventions and outcome measures, that a narrative synthesis alone was proposed in the protocol.121

Narrative and quantitative approaches are not mutually exclusive. Components of narrative synthesis can be usefully incorporated into a review that is primarily quantitative in focus and those that take a primarily narrative approach can incorporate some statistical analyses such as calculating a common outcome statistic for each study.

Initial descriptive synthesis

Both quantitative and narrative synthesis should begin by constructing a clear descriptive summary of the included studies. This is usually done by tabulating details about study type, interventions, numbers of participants, a summary of participant characteristics, outcomes and outcome measures. An indication of study quality or risk of bias may also be given in this or a separate table [see Section 1.3.2 Study selection and Section 1.3.4 Quality assessment]. An example is given in Table 1.1. If the review will not involve re-calculating summary statistics, but will rather rely on the reported results of the author’s analyses, these may also be included in the table. The descriptive process should be both explicit and rigorous and decisions about how to group and tabulate data should be based on the review question and what has been planned in the protocol. This initial phase will also be helpful in confirming that studies are similar and reliable enough to synthesise, and that it is appropriate to pool results.

Table 1.1: Example table describing studies included in a systematic review of the effectiveness of drug treatments for attention deficit hyperactivity disorder in children and adolescents. 122
Study	Design	Intervention#– N	Age [years]	Duration [weeks]	Core outcomes
Administered once daily Rapport, 1989	C [5x]	MPH [5 mg/day, o.d.] – 45 MPH [10 mg/day, o.d.] – 45 MPH [15 mg/day, o.d.] – 45	5–12	5	Core: no hyp; Abbreviated CTRS: total score QoL: not reported AE: not reported
DuPaul, 1993	C [5x]	MPH [5 mg/day, o.d.] – 31 MPH [10 mg/day, o.d.] – 31 MPH [15 mg/day, o.d.] – 31	6–11	6	Core: No hyp; Abbreviated CTRS: total score QoL: not reported AE: not reported
Werry, 1980	C [3x]	MPH [0.40 mg/kg, o.d.] – 30	5.5–12.5	4	Core: Conners’ Teacher Questionnaire: hyperactivity; Conners’ Parent Questionnaire: hyperactivity QoL: CGI [physician]AE: weight
Administered two or more times daily Brown, 1988	C [4x]	MPH [8.76 mg/day, b.d.] –11	13–15	8	Core: CPRS: Hyperactivity Index; Conners’ Teacher Hyperactivity Index; ACTeRS: hyperactivity QoL: not reported AE: SERS [parents]; weight
Fischer, 1991	C [3x]	MPH [0.40 mg/kg/day, b.d.] – 161	2.4–17.2	3	Core: CPRS-R: Hyperactivity Index; CTRS-R: hyperactivity index; CTRS-R: hyperactivity QoL: not reported AE: CPRS-R: psychosomatic; SERS [parents, teachers]: number of side-effects, mean severity rating
Fitzpatrick, 1992	C [4x]	MPH [10–15 mg/day, b.d.] – 19	6.9–11.5	8	Core: Conners’ Hyperactivity Index [parents and teacher]; TOTS: hyperactivity [parents and teachers] QoL: no CGI; comments ratings [parent/teacher] AE: STESS [parents]; weight
Fine, 1993	C [3x]	MPH [0.30 mg/kg/day [unclear], b.d.] – 12	6–10	3	Core: not reported QoL: not reported AE: side-effects questionnaire
Hoeppner, 1997	C [3x]	MPH [0.30 mg/kg/day, b.d.] – 50	6.1–18.2	4	Core: CPRS: Hyperactivity Index; CTRS: Hyperactivity Index QoL: not reported AE: not reported
Handen, 1999	C [3x]	MPH [12–15 mg/day, max. 3x] – 11	4–5.1	3	Core: CTRS: Hyperactivity Index; CTRS: hyperactivity QoL: not reported AE: Side Effects Checklist [teachers, parents]; mean severity rating 0–6
Manos, 1999	C [4x]	MPH [10 mg/day, b.d.] – 42	5–17	4	Core: no hyp; ASQ [parents and teachers]; ARS [parent] QoL: no CGI; composite ratings [clinician] AE: Side Effects Behaviour Monitoring Scale [parents]
Barkley, 2000	C [5x]	MPH [10 mg/day, b.d.] – 38	12–17	5	Core: no hyp; ADHD Total Parent/Teacher rating QoL: not reported AE: number and severity of side-effects [teachers, parents, self]
Tervo, 2002	C [3x]	MPH [0.10 mg/kg/day, b.d.] – 41	M=9.9 [2.9]	3	Core: no hyp; CBCL [parent] QoL: not reported AE: not reported
ACTeRS, ADD-H Comprehensive Teachers’ Rating Scale; AE, adverse effects; ARS, ADHD Rating Scale; ASQ, Abbreviated Symptoms Questionnaire; b.d., twice daily; C, cross-over trial [number of cross-overs]; CBCL, Child Behaviour Checklist; CGI, Clinical Global Impression; CPRS, Conners’ Parent Rating Scale; CTRS, Conners’ Teacher Rating Scale; MPH, methylphenidate hydrochloride;N, number of participants;, o.d., once daily; P, parallel trial; hyp, hyperactivity; PACS, Parental Account of Childhood Symptoms; SERS, Side Effects Rating Scale.

1.3.5.1 Narrative synthesis

All systematic reviews should contain text and tables to provide an initial descriptive summary and explanation of the characteristics and findings of the included studies. However simply describing the studies is not sufficient for a synthesis. The defining characteristic of narrative synthesis is the adoption of a textual approach that provides an analysis of the relationships within and between studies and an overall assessment of the robustness of the evidence.

A narrative synthesis of studies may be undertaken where studies are too diverse [either clinically or methodologically] to combine in a meta-analysis, but even where a meta-analysis is possible, aspects of narrative synthesis will usually be required in order to fully interpret the collected evidence.

Narrative synthesis is inherently a more subjective process than meta-analysis; therefore, the approach used should be rigorous and transparent to reduce the potential for bias. The idea of narrative synthesis within a systematic review should not be confused with broader terms like 'narrative review', which are sometimes used to describe reviews that are not systematic.

A general framework for narrative synthesis

How narrative syntheses are carried out varies widely, and historically there has been a lack of consensus as to the constituent elements of the approach or the conditions for establishing credibility. A project for the Economic and Social Research Council [ESRC] Methods Programme has developed guidance on the conduct of narrative synthesis in systematic reviews.123, 124, 125, 126 The guidance offers both a general framework and specific tools and techniques that help to increase the transparency and trustworthiness of narrative synthesis.

The general framework consists of four elements:

Developing a theory of how the intervention works, why and for whom
Developing a preliminary synthesis of findings of included studies
Exploring relationships within and between studies
Assessing the robustness of the synthesis

Though the framework is divided into these four elements, the elements themselves do not have to be undertaken in a strictly sequential manner, nor are they totally independent of one another. A researcher is likely to move iteratively among the activities that make up these four elements.

For each element of the framework, this guidance presents a range of practical tools and techniques. It is not mandatory [or indeed appropriate] to employ each one of these for every narrative synthesis, but the appropriate tools/techniques should be selected depending upon the nature of the evidence being synthesised. The reason for the choice of tool or technique should be specified in the methods section of the review.

A fuller description of these tools and techniques and narrative synthesis in general can be found in the ESRC guidance report.125, 126 It should be noted that the list given here is not comprehensive and other tools and techniques may be appropriate in certain circumstances.

The four elements of the narrative synthesis framework [and some of their related tools and techniques] are described below [Figure 1.2].

BEGINNING OF SYNTHESIS

END OF SYNTHESIS

Figure 1.2: Example of applying the narrative synthesis framework

Developing a theory of how the intervention works, why and for whom

The extent to which theory will play a role will partly depend upon the type of intervention[s] being evaluated. For example, theory may only play a minor role in a systematic review looking at the effects of a single therapeutic drug on patient outcomes because many aspects of the ‘mechanism of action’ will have been established in early studies investigating pharmacodynamics, dose-finding etc. Alternatively, in a systematic review evaluating the effects of a psychosocial or educational programme, theories about the causal chain linking the intervention to the outcomes of interest will be of crucial importance and might be presented descriptively or in diagrammatic form, as displayed in Figure 1.3.

Figure 1.3: Interventions to increase use and function of smoke alarms: implicit theory of change model

Developing a preliminary synthesis of findings of included studies

Once the relevant studies have been data extracted, the first step is to bring together, organise and describe their findings. The direction and size of the reported effects may be the starting point. Or, for example, a collection of studies evaluating one kind of intervention might be divided into subgroups of studies with distinct populations, such as children and adults. It is important to remember that this is only the first step of the synthesis. The remaining elements of the framework need to be taken into account before it can be considered adequate as a narrative synthesis.

Table 1.2 describes a range of tools and techniques that might be employed at this stage of the synthesis.

Table 1.2: Developing a preliminary synthesis of findings of included studies
Textual descriptions of studies	A descriptive paragraph on each included study. These descriptions should be produced in a systematic way, including the same type of information for all studies if possible and in the same order. It may be useful for recording purposes to do this for all excluded studies as well.
Groupings and clusters	The included studies might be grouped at an early stage of the review, though it may be necessary to refine these initial groups as the synthesis develops. This can also be a useful way of aiding the process of description and analysis and looking for patterns within and across groups. It is important to use the review question[s] to inform decisions about how to group the included studies.
Tabulation	A common approach, used to represent data visually. The way in which data are tabulated may affect readers’ impressions of the relationships between studies, emphasising the importance of a narrative interpretation to supplement the tabulated data.
Transforming data into a common measure	In both narrative and quantitative synthesis it is important to ensure that data are presented in a common measure to allow an accurate description of the range of effects.
Vote-counting as a descriptive tool	Simple vote-counting might involve the tabulation of findings according to direction of effect. More complex approaches can be developed both in terms of the categories used and by assigning different weights or scores to different categories. However, vote-counting can disregard sample size and be misleading. So, the interpretation of the results must be approached with caution and subjected to further scrutiny.
Translating data: thematic analysis	A technique used in the analysis of qualitative data in primary research can be used to systematically identify the main, recurrent and/or most important [based on the review question] themes and/or concepts across multiple studies.127
Translating data: content analysis	A technique for compressing many words of text into fewer content categories based on explicit rules of coding.128 Unlike thematic analysis, it is essentially a quantitative method, since all the data are eventually converted into frequencies.

Exploring relationships within and between studies

Patterns emerging from the data during the preliminary synthesis need to be rigorously scrutinised in order to identify factors that might explain variations in the size/direction of effects. At this stage there is a clear attempt to explore relationships between: [a] characteristics of individual studies and their reported findings; and [b] the findings of different studies.

However, when exploring heterogeneity in this way, it is necessary to be wary of uncovering associations between characteristics and results that are based on comparisons of many subgroups – some of these may simply have occurred by chance. Subgroup comparisons which are specified in advance [i.e. as part of the review protocol] are more likely to be plausible than those which are not.129, 130

The extent to which these factors can be explored in the review depends on how clearly they are reported in the primary research studies. The amount of detail may depend on the type of publication and the nature of the intervention being reviewed [e.g. highly standardised interventions may not be described as fully as more unusual ones].

Tools and techniques that might be employed at this stage of the synthesis are described in Table 1.3.

Table 1.3: Exploring relationships within and between studies
Graphs, frequency distributions, funnel plots, forest plots and L’Abbe plots	There are several visual or graphical tools that can help reviewers explore relationships within and between studies. These include: presenting results in graphical form; plotting findings [e.g. effect size] against study quality; plotting confidence intervals; and/or plotting outcome measures.
Moderator variables and subgroup analyses	This refers to the analysis of variables which can be expected to moderate the main effects being examined in the review. This can be done at the study level, by examining characteristics that vary between studies [such as study quality, study design or study setting] or by analysing characteristics of the sample [such as subgroups of participants].
Idea webbing and conceptual mapping	Involves using visual methods to help to construct groupings and relationships. The basic idea underpinning these approaches is [i] to group findings that are empirically and/or conceptually similar and [ii] to identify [again on the basis of empirical evidence and/or conceptual/theoretical arguments] relationships between these groupings.
Qualitative case descriptions	Any process in which descriptive data from studies included in the systematic review are used to try to explain differences in statistical findings. For example why one intervention outperforms another apparently similar intervention or why some studies are statistical outliers.
Investigator/methodological/ conceptual triangulation	Triangulation makes use of a combination of different perspectives and/or assessment methods to study a particular phenomenon. This could apply to the methodological and theoretical approaches adopted by the researchers undertaking primary studies included in a systematic review, e.g. investigator triangulation explores the extent to which heterogeneity in study results may be attributable to the diverse approaches taken by different researchers. Triangulation involves analysing the data in relation to the context in which they were produced, notably the disciplinary perspectives and expertise of the researchers producing the data.

Assessing the robustness of the synthesis

Towards the end of the synthesis process, the analysis of relationships as described above should lead into an overall assessment of the strength of the evidence. This is essential when drawing conclusions based on the narrative synthesis.

Robustness can relate to the methodological quality of the included studies [such as risk of bias], and/or the credibility of the product of the synthesis process. Obviously, these are related. The credibility of a synthesis will depend on both the quality and the quantity of the evidence base it is built on, and the method of synthesis and the clarity/transparency of its description. If primary studies of poor methodological quality are included in the review in an uncritical manner then this will affect the integrity of the synthesis. Attempts to minimize the introduction of bias might include ‘weighting’ the findings of studies according to technical quality [i.e. giving greater credence to the findings of more methodologically sound studies] and providing a clear justification for this. Similarly, a clear description of the potential sources of bias within the synthesis process itself helps establish credibility with the reader.

Table 1.4 describes the tools and techniques that might be employed at this stage of the synthesis.

Table 1.4: Assessing the robustness of the synthesis
Use of validity assessment	Use of specific rules to define weak, moderate or good evidence. An example is the approach used by the US Centers for Disease Control and Prevention131 although there are many other evidence grading systems available. Decisions about the strength of evidence are explicit although the criteria used are often debated.
Reflecting critically on the synthesis process	Use of a critical discussion to address methodology of the synthesis used132 [especially focusing on its limitations and their potential influence on the results]; evidence used [quality, validity, generalisability] – with emphasis on the possible sources of bias and their potential influence on results of the synthesis; assumptions made; discrepancies and uncertainties identified; expected changes in technology or evidence [e.g. identified ongoing studies]; aspects that may have an influence on implementation and effectiveness in real settings. Such a discussion would provide information on both the robustness and generalisability of the synthesis.
Checking the synthesis with authors of primary studies	It is possible to consult with the authors of included primary studies in order to test the validity of the interpretations developed during the synthesis and the extent to which they are supported by the primary data.133 The authors of the primary studies may have useful insights into the possible accuracy and generalisability of the synthesis; this is most likely to be useful when the number of primary studies is small. This is a technique that has been used with qualitative evidence.

1.3.5.2 Quantitative synthesis of comparative studies

As with narrative synthesis, quantitative synthesis should be embedded in a review framework that is based on a clear hypothesis, should consider the direction and size of any observed intervention effects in relation to the strength of evidence, and should explore relationships within and between studies. The requirements for a careful and thoughtful approach, the need to assess the robustness of syntheses, and to reflect critically on the synthesis process, apply equally but are not repeated here.

This section aims to outline the rationale for quantitative synthesis of comparative studies and to focus on describing commonly used methods of combining study results and exploring heterogeneity. A more detailed overview of quantitative synthesis for systematic review is given in The Cochrane Handbook.75 Comprehensive accounts are also given by Whitehead134 and Cooper and Hedges,135 and a discussion of recent developments and more experimental approaches is given in a paper by Sutton and Higgins.136

Decisions about which comparisons to make, and which outcomes and summary effect measures to use, should have been addressed as part of the protocol development. However, as synthesis depends partly on what results are actually reported, some planned analyses may not be possible, and others may have to be adapted or developed. Any departures from the analyses planned in the protocol should be clearly justified and reported.

Decisions about what studies should and should not be combined are inevitably subjective and require careful discussion and judgement. As far as possible a priori consideration at the time of writing the protocol is desirable. There will always be differences between studies that address a common question. Reserving meta-analyses for only those studies that evaluate exactly the same interventions in near identical participant populations would be severely limiting and seldom achievable in practice. For example, whilst it may not be sensible to average the results of studies using different classes of experimental drugs or comparators, it may be reasonable to combine results of studies that use analogues or drugs with similar mechanisms of action. Likewise, it will often be reasonable to combine results of studies that have used similar but not identical comparators [e.g. placebo and no treatment]. Where there are substantial differences between studies addressing a broadly similar question, although combining their results to give an estimate of an average effect may be meaningless, a test of whether an overall effect is present might be informative. It can be useful to calculate summary statistics for each individual study to show the variability in results across studies. It may also be helpful to use meta-analysis methods to quantify this heterogeneity, even when combined estimates of effect are not produced.

Reasons for meta-analysis

Combining the results of individual studies in a meta-analysis increases power and precision in estimating intervention effects. In most areas of health care, ‘breakthroughs’ are rare and we may reasonably expect that new interventions will lead to only modest improvements in outcomes; such improvements can of course be extremely important to individuals and of significant benefit in terms of population health. Large numbers of events are required to detect modest effects, which are easily obscured by the play of chance, and studies are often too small to do so reliably. Thus, in any group of small trials addressing similar questions, although a few may have demonstrated statistically significant results by chance alone, most are likely to be inconclusive. However, combining the results of studies in a meta-analysis provides increased numbers of participants, reduces random error, narrows confidence intervals, and provides a greater chance of detecting a real effect as statistically significant [i.e. increases statistical power]. Meta-analysis also allows observation and statistical exploration of the pattern of results across studies and quantification and exploration of any differences.

Combining comparative study results in a meta-analysis

Most meta-analyses take a two-step approach in that they first analyse the outcome of interest and calculate summary statistics for each individual study. In the second stage, these individual study statistics are combined to give an overall summary estimate. This is usually calculated as a weighted average of the individual study estimates. The greater the weight awarded to a study, the more it influences the overall estimate. Studies are usually, at least in part, weighted in inverse proportion to their variance [or standard error squared], a method which essentially gives more weight to larger studies and less weight to smaller studies. It is also possible to weight studies according to other factors such as trial quality, but such methods are very seldom implemented and not recommended.

Two main statistical models are used. Fixed-effect models weight the contribution of each study proportional to the amount of information observed in the study. This considers only variability in results within studies and no allowance is made for variation between studies. Random-effect models allow for between-study variability in results by weighting studies using a combination of their own variance and the between-study variance. Where there is little between-study variability, the within-study variance will dominate and the random-effects weighting will tend towards that of the fixed-effect weighting. If there is substantial between-study variability, this dominates the weighting factor and within-study variability contributes little to the analysis. In this way, all trials will tend towards contributing equally towards the overall estimate and it can be argued that small studies will unduly influence the estimate. Those in favour of random-effects argue that it formally allows for between-study variability and that the fixed-effect approach unrealistically assumes a single effect across trials and gives over-precise estimates. In practice, with well-defined questions, the results of both approaches are often very similar and it is common to run both to test robustness of the choice of statistical model.

Generic inverse variance method of combining study results

The generic inverse variance method is a widely used and easy to implement method of combining study results that underlies many of the approaches that are described later. It is very flexible and can be used to combine any type of effect measure provided that an effect estimate and its standard error is available from each study. Effect estimates may include adjusted estimates, estimates corrected for clustering and repeat measurements, or other summaries derived from more complex statistical methods.

A fixed-effect meta-analysis using the generic inverse variance method calculates a weighted average of study effect estimates [EEIV] by summing individual effect estimates [EEi], for example, the log odds ratio or the mean difference, and weighting these by the reciprocal of their squared standard errors [SEi] as follows:137

A random-effects approach involves adjusting the study specific standard errors to incorporate between-study variation, which can be estimated from the effects and standard errors associated with the included studies.138

Types of data

Other ways to combine studies of effectiveness are available, some of which are specific to the nature of the data that have been collected, analysed and presented in the included studies.

Box 1.6: Illustration of how to calculate risk ratio, relative and absolute risk reduction, and odds ratios and their standard errors
	Individuals with event	Individuals without event
	Notation	Example	Notation	Example	Total
Experimental group	a	2	b	18	ne	20
Control group	c	4	d	16	nc	20
Total		6		34	N	40
Risk ratio


Relative risk reduction

Odds ratio


Peto odds ratio

Dichotomous/binary outcomes

Dichotomous outcomes are those that either happen or do not happen and an individual can be in one of only two states, for example having an acute myocardial infarction or not having an infarction. Dichotomous outcomes are most commonly expressed in terms of risks or odds. Although, in everyday use, the terms risk and odds are often used to mean the same thing, in the context of statistical evaluation they have quite specific meanings.

Risk describes the probability with which a health outcome will occur and is often expressed as a decimal number between 0.0 and 1.0, where 0.0 indicates that there is no risk of the event occurring, and 1.0 indicating certainty that the event will take place. A risk of 0.4 indicates that about four in ten people will experience the event. Odds describe the ratio of the probability that an event will happen to the probability that it will not happen and can take any value between zero and infinity. Odds are sometimes expressed as the ratio of two integers such that 0.001 can be written 1:1000 indicating that for every one individual who will experience the event, one thousand will not.

Risk ratios [RR], also known as relative risks, indicate the change in risk brought about by an intervention and are calculated as the probability of an event in the intervention group divided by the probability of an event in the control group [where the probability of an event is estimated by the total number of events observed in the group divided by the total number of individuals in that group]. A risk ratio of 2.0 indicates that the intervention leads to the risk becoming twice that of the comparator. A risk ratio of 0.75 indicates that the risk has been reduced to three quarters of that of the comparator. This can also be expressed in terms of a reduction in risk whereby the relative risk reduction [RRR] is given as one minus the risk ratio multiplied by 100. For example, a risk ratio of 2.0 corresponds to a relative risk reduction of -100% [a 100% increase], while a risk ratio of 0.75 corresponds to a relative risk reduction of 25%. Box 1.6 illustrates the calculation of these measures and further details of the formulae can be found elsewhere.137

Risk ratios can be combined using the generic inverse variance method applied to the log risk ratio and its standard error [either in a fixed-effect or a random-effects model]. Odds ratios [OR] describe the ratio of the odds of events occurring on treatment to the odds of events occurring on control, and therefore describes the multiplication of the odds of the outcome that occur with use of the intervention. Box 1.6 illustrates how to calculate the odds ratio for a single study. Odds ratios can be combined using the generic inverse variance method applied to the log odds ratio and its standard error as described above.

The Mantel-Haenszel method for combining risk ratios or odds ratios, which uses a different weighting scheme, is more robust when data are sparse, but assumes a fixed-effect model.137

The Peto odds ratio139 [ORPeto] is an alternative estimate of a combined odds ratio in a fixed-effect model, and is based on the difference between the observed number of events and the number of events that would be expected [O-E] if there was no difference between experimental and control interventions [see Box 1.6]. Combining studies using the Peto method is straightforward, and it may be particularly useful for meta-analysis of dichotomous data when event rates are very low, and where other methods fail.

This approach works well when the effect is small [that is when the odds ratio is close to 1.0], events are relatively uncommon, and there are similar numbers in the experimental and control groups. The approach is commonly used to combine data from cancer trials which generally conform to these expectations. Correction for zero cells is not necessary [see below] and the method appears to perform better than alternative approaches when events are very rare. It can also be used to combine time-to-event data by pooling log rank observed minus expected [O-E] events and associated variance. However, the Peto method does give biased answers in some circumstances, especially when treatment effects are very large, or where there is a lack of balance in treatment allocation within the individual studies.140 Such conditions will not usually apply to RCTs but may be particularly important when combining the results of observational studies which are often unbalanced.

Although both risk ratios and odds ratios are perfectly valid ways of describing a treatment effect, it is important to note that they are not the same measure, cannot be used interchangeably and should not be confused. When events are relatively rare, say less than 10%,141 differences between the two will be small, but where the event rate is high, differences will be large. For treatments that increase the chance of events, the odds ratio will be larger than the risk ratio and for interventions that reduce the chance of events, the odds ratio will be smaller than the risk ratio. Thus if an odds ratio is misinterpreted as a risk ratio it will lead to an overestimation of the effect of intervention. Unfortunately, this error in interpretation is quite common in published reports of individual studies and systematic reviews. Although some statisticians prefer odds ratios owing to their mathematical properties [they do not have inherent range limitations associated with high baseline rates and naturally arise as the antilog of coefficients in mathematical modelling, making them more suitable for statistical manipulation], they have been criticised for not being well understood by clinicians and patients.142, 143 It may therefore be preferable, even when calculations have been based on odds ratios, to transform the findings to describe results as changes in the more intuitively understandable concept of risk.

Neither the risk ratio nor the odds ratio can be calculated for a trial if there are no events in the control group [as calculation would involve division by zero]' and so in this situation it is customary to add 0.5 to each cell of the 2x2 table.137 If there are no events [or all participants experience the event] in both groups, then the trial provides no information about relative probability and so it is omitted from the meta-analysis. These situations are likely to occur when the event of interest is rare, and in such situations the choice of effect measure requires careful thought. A simulation study has shown that when events are rare, most meta-analysis methods give biased estimates of effect,144 and that the Peto odds ratio [which does not require a 0.5 correction] may be the least biased.

Continuous outcomes

Continuous outcomes are those that take any value in a specified range and can theoretically be measured to many decimal places of accuracy, for example, blood pressure or weight. Many other quantitative outcomes are typically treated as continuous data in meta-analysis, including measurement scales. Continuous data are usually summarized as means and presented with an indication of the variation around the mean using the standard deviation [SD] or standard error [SE]. The effect of an intervention on a continuous outcome is measured by the absolute difference between the mean outcome observed for the experimental intervention and control, termed the mean difference [MD]. This estimates the amount by which the treatment changes the outcome on average and is expressed:

Study mean differences and their associated standard errors can be combined using the generic inverse variance method.

Where studies assess the same outcome but measure it using different scales [for example, different quality of life scales], the individual study results must be standardised before they can be combined. This is done using the standardised mean difference [SMD], which considers the effect size in each study relative to the variability in the study and is calculated as the mean difference divided by the standard deviation among all participants. Where scales differ in direction of effect [i.e. some increase with increasing severity of outcome whilst others decrease with increasing severity], this needs to be accounted for by assigning negative values to the mean of one set of studies thereby giving all scales the same direction of measurement. There are three commonly used methods of recording the effect size in the standardised mean difference method, Cohen’s d,145 Hedges adjusted g,145 and Glass’ delta.146 The first two differ in whether the standard deviation is adjusted for small sample bias. The third differs from the other two by standardizing by the control group standard deviation rather than an average standard deviation across both groups. The standardised mean difference assumes that differences in the standard deviation between studies reflect differences in the measurement scale and not differences between the study populations. The summary intervention effect can be difficult to interpret as it is presented in abstract units of standard deviation rather than any particular scale.

Note that in social science meta-analyses, the term ‘effect size’ usually refers to versions of the standardised mean difference.

Time-to-event outcomes

Time-to-event analysis takes account not only of whether an event happens but when it happens. This is especially important in chronic diseases where even although we may not be able to ultimately stop an event from happening, slowing its occurrence can be beneficial. For example, in cancer studies in adult patients we rarely anticipate cure, but hope that we can significantly prolong survival. Time-to-event data are often referred to as ‘survival’ data since death is often the event of interest, but can be used for many different types of event such as time free of seizures, time to healing or time to conception. Each study participant has data capturing the event status and the time of that status. An individual may be recorded with a particular elapsed time-to-event, or they may be recorded as not having experienced the event by a particular elapsed time or period of follow-up. When the event has not [yet] been observed, the individual is described as censored, and their event-free time contributes information to the analysis up until the point of censoring.

The most appropriate way to analyse time-to-event data is usually to use Kaplan Meier analysis and express results as a hazard ratio [HR]. The HR summarises the entire survival experience and describes the overall likelihood of a participant experiencing an event on the experimental intervention compared to control. Meta-analyses that collect individual participant data are able to carry out such analysis for each included study and then pool these using a variant of the Peto method described above. Alternatively a modelling approach can be used.

Meta-analyses of aggregate data often treat time-to-event data as dichotomous and carry out analyses using the numbers of individuals who did or did not experience an event by a particular point in time. However, using such dichotomous measures in a meta-analysis of time-to-event outcomes is discarding information and can pose additional problems. If the total number of events reported for each study is used to calculate an odds ratio or risk ratio, this can involve combining studies reported at different stages of maturity, with variable follow-up, resulting in an estimate that is both unreliable and difficult to interpret. This approach is not recommended. Alternatively, ORs or RRs can be calculated at specific points in time. Although this makes estimates comparable, interpretation can still be difficult, particularly if individual studies contribute data at different time points. In this case it is unclear whether any observed difference in effect between time points is attributable to the timing or to the analyses being based on different sets of contributing studies. Furthermore, bias could arise if the time points are subjectively chosen by the researcher or selectively reported by the study author at times of maximal or minimal difference between intervention groups.

A preferable approach is to estimate HRs by using and manipulating published or other summary statistical data or survival curves.147, 148 This approach has also been described in non-technical step-by-step terms.149 Currently, such methods are under-used in meta-analyses,149 which may reflect unfamiliarity with the methods and that study reports do not always include the necessary statistical information150, 151 to allow the methods to be used.

Ordinal outcomes

Outcomes may be presented as ordinal scales, such as pain scales [where individuals’ rate their pain as none, mild moderate or severe]. These are sometimes analysed as continuous data, with each category being assigned a numerical value [for example, 0 for none, 1 for mild, 2 for moderate and 3 for severe]. This is usual when there are many categories, as is the case for many psychometric scales such as the Hamilton depression scale or the Mini-Mental State Examination for measuring cognition. However, a mean value may not be meaningful. Thus, an alternative way to analyse ordinal data is to dichotomise them [e.g. none or mild versus moderate or severe] to produce a standard 2´2 table. Methods are available for analysing ordinal data directly, but these typically require expert input.

Counts and rates

When outcomes can be experienced repeatedly they are usually expressed as event counts, for example, the number of asthma attacks. When these represent common events, they are often treated and analysed as continuous data [for example, number of days in hospital] and where they represent uncommon events they are often dichotomised [for example, whether or not each individual had at least one stroke].

When events are rare, analyses usually focus on rates expressed at the group level, such as the number of asthma attacks per person, per month. Although these can be combined as rate ratios using the generic inverse variance method, this is not always appropriate as it assumes a constant risk over time and over individuals, and is not often done in practice. It is important not to treat rate data as dichotomous data because more than one event may have arisen from the same individual.

Presentation of quantitative results

Results should be expressed in formats that are easily understood, and in both relative and absolute terms.

Where possible, results should be shown graphically. The most commonly used graphic is the forest plot [see Box 1.7], which illustrates the effect estimates from individual studies, and the overall summary estimate. It also gives a good visual summary of the review findings, allowing researchers and readers to get a sense of the data. Forest plots provide a simple representation of the precision of individual and overall results and of the variation between study results. They give an ‘at a glance’ identification of any studies with outlying or unusual results and can also indicate whether particular studies are driving the overall results. Forest plots can be used to illustrate results for dichotomous, continuous and time-to-event outcomes.152

Individual study results are shown as boxes centred on their estimate of effect, with extending horizontal lines indicating their confidence intervals. The confidence interval expresses the uncertainty around the point estimate, describing a range of values within which it is reasonably certain that the true effect lies; wider confidence intervals reflect greater uncertainty. Although intervals can be reported for any level of confidence, in most systematic reviews of health interventions, the 95% confidence interval is used. Thus, on the forest plot, studies with wide horizontal lines represent studies with more uncertain results. Different sized boxes may be plotted for each of the individual studies, the area of the box representing the weight that the study takes in the analysis providing a visual representation of the relative contribution that each study makes to the overall effect.

The plot shows a vertical line of equivalence indicating the value where there is no difference between groups. For odds ratios, risk ratios or hazard ratios this line will be drawn at an odds ratio/risk ratio/hazard ratio value of 1.0, while for risk difference and mean difference it will be drawn through zero. Studies reach conventional levels of statistical significance where their confidence intervals do not cross the vertical line. Summary [meta-analytic] results are usually presented as diamonds whose extremities show the confidence interval for the summary estimate. A summary estimate reaches conventional levels of statistical significance if these extremities do not cross the line of no effect. If individual studies are too dissimilar to calculate an overall summary estimate of effect, a forest plot that omits the summary value and diamond can be produced.

Odds ratios, risk ratios and hazard ratios can be plotted on a log-scale to introduce symmetry to the plot. The plot should also incorporate the extracted numerical data for the groups for each study, e.g. the number of events and number of individuals for odds ratios, the mean and standard deviation for continuous outcomes. Other forms of graphical displays have also been proposed.153

Box 1.7: Effects of four trials included in a systematic review

a] Presented without meta-analysis

b] Presented with meta-analysis [fixed effect model]

c] Presented with meta-analysis [random-effects model]

Example forest plots taken from a systematic review of endovascular stents for abdominal aortic aneurism [EVAR].154

Relative and absolute effects

Risk ratios, odds ratios and hazard ratios describe relative effects of one intervention versus another, providing a measure of the overall chance of the event occurring on the experimental intervention compared to control. These relative effects do not provide information on what this comparison means in absolute terms. Although there may be a large relative effect of an intervention, if the absolute risk is small, it may not be clinically significant because the change in absolute terms is minimal [a big percentage of a small amount may still be a small amount]. For example, a risk ratio of 0.8 may represent a 20% relative reduction in events from 50% to 40% or it could represent a 20% relative reduction from 5% to 4% corresponding to absolute differences of 10% and 1% respectively. There may be situations where the former is judged to be clinically significant whilst the latter is not. Meta-analysis should use ratio measures; for example, dichotomous data should be combined as risk ratios or odds ratios and pooling risk differences should be avoided. However, when reporting results it is generally useful to convert relative effects to absolute effects. This can be expressed as either an absolute difference or as a number needed to treat [NNT]. Absolute change is usually expressed as an absolute risk reduction which can be calculated from the underlying risk of experiencing an event if no intervention were given and the observed relative effect as shown in Box 1.8.

Box 1.8: Calculation of absolute risk reduction and number needed to treat from relative risks, odds ratios and hazard ratios

Absolute risk reduction from relative risk

Absolute risk reduction from odds ratio155

Absolute risk reduction from hazard ratio156

ARR = Scontrol HR – Scontrol at chosen time point

Number needed to treat

Where:

RR = relative risk

Scontrol = proportion event free on control treatment

ARR = absolute risk reduction

HR = hazard ratio

Consideration of absolute effects is particularly important when considering how results apply to different types of individuals who may have different underlying prognoses and associated risks. Even if there is no evidence that the relative effects of an intervention vary across different types of individual [see Subgroup analyses, Meta-regression below], if the underlying risks for different categories of individual differ, then the effect of intervention in absolute terms will be different. It is therefore important when reporting results to consider how the absolute effect of an intervention varies for different types of individual and a table expressing results in this way, as shown in Table 1.5, can be useful. The underlying risk for different types of individual can be estimated from the studies included in the meta-analysis, or generally accepted standard estimates can be used. Confidence intervals should be calculated around absolute effects.

Table 1.5: Example table expressing relative effects as absolute effects for individuals with differing underlying prognoses.
	2 year survival rate
HR = 0.84 95% CI [0·78–0·92]	Baseline	Absolute increase [95% CI]	Change
Age	=60	4%	2% [1% - 4%]	From 4 to 6%
Histology	AA	31%	6% [3% - 9%]	From 31 to 37%
	GBM	9%	4% [2% - 6%]	From 9 to 13%
	Other	52%	5% [3% - 8%]	From 52 to 57%
Performance	Good	22%	6% [3% - 9%]	From 22 to 28%
Status	Poor	9%	4% [2% - 6%]	From 9 to 13%
Baseline survival and equivalent absolute increases in survival calculated from a meta-analysis of chemotherapy in high-grade glioma.157 AA = anaplastic astrocytoma, GBM = glioblastoma multiforme.

The NNT, which is derived from the absolute risk reduction as shown in Box 1.8, also depends on both relative effect and the underlying risk. The NNT represents the number of individuals who need to be treated to prevent one event that would be experienced on the control intervention. The lower the number needed to treat, the fewer the patients that need to be treated to prevent one event, and the greater the efficacy of the treatment. For example a meta-analysis of antiplatelet agents for the prevention of pre-eclampsia found an RR of 0.90 [0.84 – 0.97] for pre-eclampsia.158 Plausible underlying risks of 2%, 6% and 18% had associated NNTs of 500 [313-1667], 167 [104-556] and 56 [35-185] respectively.

Sensitivity analyses

Sensitivity analyses explore the robustness of the main meta-analysis results by repeating the analyses having made some changes to the data or methods.159 Analyses run with and without the inclusion of certain trials will assess the degree to which particular studies [perhaps those with poorer methodology] affect the results. For example, analyses might be carried out on all eligible trials and a sensitivity analysis restricted to only those that used a placebo in the control group. If results differ substantially, the final results will require careful interpretation. However care must be taken in attributing reasons for differences, especially when a single or small numbers of trials are included/excluded in the sensitivity analysis, as a study may differ in additional ways to the issue being explored in the sensitivity analysis. Some sensitivity analyses should be proposed in the protocol, but as many issues suitable for exploration in sensitivity analyses only come to light whilst the review is being done, and in response to decisions made or difficulties encountered, these may have to change and/or be supplemented.

Exploring heterogeneity

There will inevitably be variation in the observed estimates of effect from the studies included in a meta-analysis. Some of this variation arises by chance alone, reflecting the fact that no study is so large that random error can be removed entirely. Statistical heterogeneity refers to variation other than that which arises by chance. It reflects methodological or clinical differences between studies. Exploring statistical heterogeneity in a meta-analysis aims to tease out the factors contributing to differences, such that sources of heterogeneity can be accounted for and taken into consideration when interpreting results and drawing conclusions.

There is inevitably a degree of clinical diversity between the studies included in a review,160 for example because of differing patient characteristics and differences in interventions. If these factors influence the estimated intervention effect then there will be some statistical heterogeneity between studies. Methodological differences that influence the observed intervention effect will also lead to statistical heterogeneity. For example, combining results from blinded and unblinded studies may lead to statistical heterogeneity, indicating that they might best be analysed separately rather than in combination. Although it manifests itself in the same way, heterogeneity arising from clinical differences is likely to be because of differences in the true intervention effect, whereas heterogeneity arising from differences in methodology is more likely to be because of bias.

An idea of heterogeneity can be obtained straightforwardly by visually examining forest plots for variations in effects. If there is poor overlap between the study confidence intervals, then this generally indicates statistical heterogeneity.

More formally, a chi-squared test [see Box 1.9], often also referred to as Q-statistic, can assess whether differences between results are compatible with chance alone. However, care must be taken in interpreting the chi-squared test as it has low power, consequently a larger P value [P