More Endpoints, More Problems: FDA Offers Advice for Multi-endpoint Trials

It makes sense that trialists want their studies to answer multiple questions, but for findings to hold water, proper design is key.

More Endpoints, More Problems: FDA Offers Advice for Multi-endpoint Trials

The US Food and Drug Administration has issued new guidance for industry about designing and interpreting clinical trials that use multiple endpoints, all to avoid making the types of mistakes than can lead to incorrect conclusions about a drug’s effectiveness.

Making several comparisons within a clinical trial with multiple endpoints increases the risk of misinterpreting the findings, but there are statistical ways to overcome the problem, according to the agency.

“Clinical trials are complicated,” John Lawrence, PhD, a statistician at the FDA’s Center for Drug Evaluation and Research, said in a podcast accompanying the guidance document. “Sponsors are commonly using multiple endpoints to measure many different things to assess the effectiveness of a drug candidate and for the potential inclusion of important information in the label. When more than one endpoint is analyzed in a single trial, the likelihood of making a false conclusion about a drug’s effects can increase due to an effect called multiplicity.”

The new guidance reflects the agency’s thinking about the analysis, interpretation, and management of issues related to multiple endpoints and is intended to provide consistent advice for industry, as well as create a resource reviewers can provide to sponsors. The guidance applies only to studies testing drugs or biological products, as there are different recommendations on the use of statistics, including Bayesian analyses, in medical device clinical trials.

Clinical trials are complicated. John Lawrence

Sanjay Kaul, MD (Cedars-Sinai Medical Center, Los Angeles, CA), an expert in cardiovascular epidemiology, said that when the FDA makes regulatory decisions, they focus on ensuring endpoints were prespecified because they want to avoid “fishing expeditions” after a trial is completed. Second, the agency requires a robust statistical analysis that is reproducible, and as a result, actionable for decision-making. Finally, the agency wants to avoid type 1 errors, or false-positive results.

“The FDA wants to make sure that if they’re going to approve a drug and it’s going to be widely adopted in the community, it better have some effect,” Kaul told TCTMD. “Otherwise, if it doesn’t have a true effect and it’s unsafe, then they’ve done a public disservice.”

Historically, CVD researchers have occasionally run into problems when using multiple endpoints, said Kaul. For example, the 1992 SOLVD Prevention trial testing enalapril against placebo in patients with asymptomatic left ventricular dysfunction did not reduce its primary endpoint of all-cause mortality but did reduce several secondary endpoints, including the prevention of heart failure hospitalizations. Despite missing the primary outcome, the FDA approved enalapril for the prevention of heart failure.

The approval of carvedilol in congestive heart failure is another example. In the 1990s, three clinical trials failed to show any benefit on the primary endpoint, which was the change in exercise tolerance, even though other endpoints were positive. Later, data from the US clinical trial program, which included these initial studies plus another, showed carvedilol reduced mortality by 65%, although that endpoint was not prespecified, nor was it the primary endpoint in any of the included studies. At the first US FDA advisory committee, carvedilol was rejected “because statisticians dominated the advisory panel,” said Kaul. The drug came up for a second review and was later approved because clinicians held greater sway.

“These sorts of inconsistencies are what the FDA wants to avoid,” said Kaul. “They want to have a uniform, standardized, and consistent interpretation of the data.”

The FDA wants to make sure that if they’re going to approve a drug and it’s going to be widely adopted in the community, it better have some effect. Sanjay Kaul

James Brophy, MD, PhD (McGill University, Montreal, Canada), whose work also spans cardiovascular epidemiology and outcomes research, said that use of multiple endpoints in clinical trials is entirely valid, but while trialists should be looking to exploit them for maximum information, “they need to be exploited properly.” The FDA’s new recommendations, he told TCTMD, are “quite good” in how they explain when and where to be careful when using multiple endpoints and what needs to be done to avoid false-positive results.

“Often, in a study, there can be a number of outcomes that are of great interest,” Brophy told TCTMD. “It depends on the research question, obviously, but in general, the use of multiple endpoints is necessary and something of merit. You have a lot of cost to get a trial off the ground in terms of infrastructure so you want to maximize the amount of information you can get out of it. It makes sense in most cases to consider multiple outcomes—it’s just how you do the interpretation to avoid falling into any traps where you think an effect is real, but it’s really nothing more than a play of chance.”

Mistakes Add Up

Grouping endpoints together is done to establish effectiveness to support a drug’s approval or to demonstrate additional meaningful effects, according to the FDA. A study might have multiple primary endpoints, a situation where a positive result in any of them could establish a drug’s efficacy. In this instance, there are different ways a drug can be successful, but failing to account for multiplicity can lead to a type 1 error where it’s falsely concluded the drug is effective.

“Multiplicity can occur because there is a chance of making a mistake during the assessment of each endpoint in the trial and these chances of making a mistake can add up when assessing multiple endpoints if the appropriate statistical adjustments are not made,” said Lawrence. In fact, the type 1 error rate nearly doubles when two independent endpoints are tested and is roughly 7% when three independent endpoints are assessed (assuming an error rate of 0.05 for two-sided testing).

One of the most common methods to adjust for multiplicity is the Bonferroni method, although there are others, according to the FDA.

A little bit of bad data isn’t enough to slow us down. James Brophy

Study investigators might choose to go with co-primary endpoints, a tactic that has a lower risk of false conclusions resulting from multiplicity because “there is only one path that leads to a successful outcome for the trial,” according to the FDA. However, co-primary endpoint testing does increase the risk of a type 2 error—the failure to show an effect when there is one—and this approach “should be carefully considered because of the loss of [statistical] power.” Studies can also use a primary composite endpoint, as is done in many CVD studies, and this too avoids issues related to multiplicity.

Secondary endpoints are typically selected to support the primary endpoint or might also be used to show evidence of a benefit distinct from the primary endpoint. The FDA says that when an effect on the primary endpoint is shown, secondary endpoints can be formally tested, but in general, it “may be best to limit the number.” With multiple testing adjustments, it can be challenging to show an effect on any secondary endpoint as the number of endpoints increases if there are a large number of secondary endpoints. In the podcast, Lawrence noted that the FDA does allow additional claims about a drug’s effectiveness to be based on secondary endpoints, but the European Medicines Agency does not.

“Failure to show an effect on the secondary endpoints does not doom a candidate treatment,” he said. “However, if an effect is demonstrated, it would be valuable for a sponsor to include that information in labeling.”

Kaul said the agency has a higher threshold for minimizing type 1 error than it does for type 2 error because the latter could result in a drug coming to market when it doesn’t work. Type 2 error is an error of omission, where a drug works but the trial doesn’t show a benefit, which can cause less harm to public health than type 1 error, he said.  

The New England Journal of Medicine also issued their own guidelines for statistical analyses in 2019, said Kaul. Like the FDA, the journal takes strict steps to avoid type 1 error and won’t allow researchers to provide P values when neither the study protocol nor statistical analysis has prespecified methods to adjust for multiplicity. Additionally, no P value for a secondary or tertiary endpoint can be stated if the primary endpoint is not met.

“Not every journal agrees with the New England Journal of Medicine,” said Kaul. “They’ve been accused of taking statistical orthodoxy to the extreme.”

Whatever Happened to Replication?

Recently, the device-based PROTECTED TAVR study came under scrutiny after researchers and sponsors put too much emphasis on a positive secondary endpoint (disabling stroke), having not shown a benefit for the primary endpoint: all strokes at 72 hours. Brophy said that when investigators, along with others in the medical community, have many reasons to think a drug or device should work, it’s easy to enthusiastically get caught up.

“A little bit of bad data isn’t enough to slow us down,” he said. “I’m not saying there’s anything deliberate, but these sort of cognitive biases can easily creep in.”

While it’s laudable for the FDA to emphasize statistical adjustments to avoid type 1 errors, Brophy said the agency has largely foregone one of the major safeguards to reduce false-positive outcomes, that being a second, confirmatory study.

“Replication is the hallmark and cornerstone of science,” he said. “Historically, there was an unwritten rule that you needed to have two randomized, controlled trials to speak to efficacy before it was accepted. Now, it’s one, under the guise that trials are bigger, and that they’re expensive, so that’s why we can’t really ask for two studies. This is problematic.”

Large outcome trials may be stopped early for efficacy, which can lead to an overestimation of the treatment effect, said Brophy. Accelerated approvals based on a single randomized trial, one discontinued before completion, “is as much of a concern, if not more of a concern, than the issue of multiple testing in terms of false-positive results,” he said, citing a controversial drug intended to reduce the risk of premature labor. The drug, Makena (AMAG Pharmaceuticals), received accelerated FDA approval in 2011 but a follow-up study, one that included three times the number of women than the first, showed it was ineffective. An FDA advisory panel voted last week to take the drug off the market.

“We’ve lowered the bar for approval,” said Brophy. “It’s a big recipe for false positive results.”

Kaul said that while the FDA had once required two positive trials before approval could be granted, they’ve altered their approach to one where they’ll approve based on a single, positive trial as long as it’s “statistically persuasive.” The means a highly significant P value, usually less than 0.001. The original criteria required two trials with a P value less than 0.05.

Michael O’Riordan is the Managing Editor for TCTMD. He completed his undergraduate degrees at Queen’s University in Kingston, ON, and…

Read Full Bio
Sources
Disclosures
  • Brophy and Kaul report no conflicts of interest.

Comments