The FDA’s clunky launch of Elsa, an AI tool to increase efficiency, has sparked concern from agency employees and outside experts.
On June 2, the FDA launched Elsa, an artificial intelligence large language model agency leadership hopes will streamline operations and speed up the drug approval process. But the agency has not yet provided detailed information as to how the AI’s responses will be used and evaluated, prompting a slew of questions from industry professionals and AI experts.
In a public announcement, FDA Commissioner Marty Makary said that Elsa had launched almost a month ahead of schedule and under budget. According to the announcement, Elsa is “designed to assist with reading, writing and summarizing” and will “accelerate clinical protocol reviews, shorten the time needed for scientific evaluations, and identify high priority inspection targets.” The agency has also said that it will use Elsa to accomplish more complex tasks, including writing code and identifying adverse events.
Experts were surprised at the speed at which FDA deployed the tool and the broad scope of tasks it will supposedly complete.
“The agency is using Elsa to expedite clinical protocol reviews and reduce the overall time to complete scientific reviews,” Makary in a video announcement. “One scientific reviewer told me what took him two to three days now takes six minutes.”
In an op-ed published in JAMA earlier this month, Makary and Vinay Prasad, director of the FDA’s Center for Biologics Evaluation and Research, outlined five priorities for a “new FDA,” one of which was “unleash[…] AI.” The officials wrote that AI can make a first-pass review of documentation received by the FDA.
Elsa’s launch was sharply criticized by FDA employees, who told STAT News that the implementation was rushed. Some of its responses were incorrect or only partially accurate, according to NBC News. Other STAT sources said that the AI struggled to complete simple tasks and ultimately did not save employees a significant amount of time. One employee told STAT that the FDA has failed to establish guardrails for the tool’s use.
“I heard that it was nowhere near ready to launch,” an FDA insider who was not involved in Elsa’s development told BioSpace.
Elsa’s implementation comes after the Trump administration fired thousands of FDA employees and proposed slashing HHS’s budget by 25%.
Some outside experts are optimistic about AI’s ability to improve efficiency. But many have called for the FDA to provide more details about how the model works and what benchmarks, if any, the agency has for evaluating its output. “It’s a two-edged sword,” Jason Conaty, who specializes in biotechnology regulation as counsel at Hogan Lovells, told BioSpace. “It’s exciting and it’s concerning.”
A Lack of Transparency
Experts say large language models (LLMs) are indeed ideal for certain tasks, like summarizing documents and finding specific pieces of information. “There are a lot of places that LLMs could increase efficiency,” James Zou, an associate professor of biomedical data science at Stanford University, told BioSpace.
But for many tasks, using LLMs comes with risk. Even the most sophisticated of these models can present false or misleading information as fact, Zou said. Whether the FDA has an effective method of evaluating the accuracy of Elsa’s outputs is unclear. The Department of Health and Human Services, of which FDA is a part, did not respond to BioSpace’s request for comment.
“The things that they say they’re using AI for are quite broad,” Zou said. The FDA’s indication that Elsa will be used to identify adverse events, for example, is potentially high-risk. “Even when it’s summarizing documents, the models could still hallucinate,” according to Zou.
The FDA has emphasized “human in the loop” as a crucial part of this process to ensure the accuracy and reliability of AI-generated content. Human reviewers are meant to verify citations and confirm that the information comes from trustworthy sources.
But according to Adam Rodman, assistant professor of medicine at Harvard Medical School, human beings are rarely flawless at identifying erroneous information provided by LLMs. “People have a tendency to trust AI systems,” Rodman told BioSpace. “It’s one of these things where it sounds just intuitive that having a human review everything will work, and what the literature has generally suggested is it’s not that easy.”
The FDA’s adoption of Elsa is part of a widespread trend of adopting AI models to automate certain tasks. “We’re seeing this now across industries,” Rodman said. “What they’ve run into is the same problem that every other field is running into, which is: how do you know how well it works?”
According to STAT, Elsa is based on Anthropic’s Claude LLM and is being developed by consulting firm Deloitte. Claude uses Retrieval-Augmented Generation (RAG), a framework that allows AI tools to access information outside of their training data and incorporate it into their answers. While this type of approach decreases errors, Rodman said, hallucinations can still happen—and with more complex tasks, “RAG might lower the overall rate, but make those remaining hallucinations harder to spot.”
“We don’t know how exactly it’s going to be used,” Zou said. “I think it would also be useful to have more information around how this human AI oversight and validation will happen.”
Sensitive Information and Legal Gray Areas
The FDA has not clarified how Elsa is being employed in scientific review, and using AI to make any key regulatory decisions opens up a host of legal questions. And how, exactly, the FDA will keep the tool from accessing proprietary information and trade secrets is unclear, Conaty said.
According to the FDA’s announcement, Elsa is not trained on data submitted by the regulated industry. Experts note that it is possible for LLMs to evaluate new data without incorporating it into how they make decisions. The announcement also said that Elsa was built within a high-security GovCloud environment and offers a “secure platform for FDA employees to access internal documents while ensuring all information remains within the agency.”
It’s also not clear how the AI will fit into the regulatory appeals process. In particular, questions remain about what will happen if an FDA decision is challenged in court. Normally if this happens, “you get the compilation of the administrative record,” Conaty said. But if artificial intelligence is used to make decisions at any point, it might be impossible to know how those decisions were made, given the limited amount of information we have about how AIs do so.
“The agency’s mission is to ensure the safety and efficacy of the nation’s supply of new drugs,” Conaty continued. “And perhaps those guardrails are in place, and perhaps there are humans in the loop at all critical inflection points. You would hope so.”
There are steps that the FDA, and any company or agency using AI, can take to evaluate and improve AI-based tools. That starts with training officials to work with these models and to spot potential errors, Zou said.
Rodman said that to test the efficacy and accuracy of AI tools, the agency would need to create benchmarks to assess how well models behave compared to human users. That would mean creating meaningful targets for the models to strive for, and methodically comparing its decisions with those of humans, in order to spot sources of bias and error.
“FDA is the leading authority in evaluating medical devices and AI systems,” Zou said. “It makes sense that when the FDA is using its own systems, there should be transparency about how they’re reviewing or evaluating internal tools.”