Microsoft, White House, and Allen Institute release coronavirus data set for medical and NLP researchers
The COVID-19 Open Research Dataset (CORD-19), a repository of more than 29,000 scholarly articles about coronavirus family viruses from around the world, is being released today for free. The result of work by Microsoft Research, the Allen Institute for AI, the National Library of Medicine at the National Institutes of Health (NIH), White House Office of Science and Technology (OSTP), and others, the data set includes machine-readable research of more than 13,000 scholarly articles to empower the medical and machine learning research communities to mine text data for insights that can help fight the coronavirus.
“The White House worked with the National Academies of Science, Engineering, and Medicine, and the World Health Organization to identify dozens of high priority scientific questions related to COVID-19 to inform the call to action,” White House CTO Michael Kratsios said today in a teleconference call. “Artificial intelligence can be incredibly power to help scientists summarize and analyze the information.”
The corpus of data comes along with a call to action for AI researchers to create data and text mining techniques to assist medical researchers. Increased data sharing and collaboration among scientific professionals may play a role in combating the COVID-19 coronavirus.
“Our goal in creating this open data set, and [Kaggle] Q&A challenge for coronavirus is to stimulate the AI community to create tools that can help scientists stay on top of thousands of articles to enable them to develop approaches to addressing the COVID-19 pandemic,” Microsoft chief scientific officer Eric Horvitz said during the call. A Microsoft tool was used to perform worldwide indexing and mapping of scholarly articles. “With a million new publications being published each year across all of biomedicine, AI will grow in importance as a critical companion to scientists.”
Text mining can enable researchers to evaluate hypotheses, create research plans, understand seminal works, or do things like create question-answering bots. As part of the news today, the Allen Institute’s Semantic Scholar will deploy an adaptive feed of existing coronavirus-related research.
“By interacting with the feed, you train it to understand your interests and what relevance means to you. So while the feed might start with initially kind of the top papers on coronavirus depending on what papers you interact with and what you find useful and not useful, it will learn your preferences and so each scholar would get somewhat different ordering of papers because their interest in the problem is different,” Semantic Scholar general manager Doug Raymond told VentureBeat in a phone interview.
Semantic Scholar’s personalized adaptive feed is powered based on work the Allen Institute has done on language models like ELMO and AllenNLP to understand relationships between paper content. Machine learning experts speaking with VentureBeat said that Transformer-based advances in text generation and NLP are among the most significant developments of 2019, with more ahead in 2020.
“It’s because we’ve had significant advances in NLP in the last couple years, the utility of a data set like this, likely be greater than it was a few years ago, because there’s more readily available tools,” Raymond said.
Allen Institute for AI director Oren Etzioni said AI can help accelerate progress and unearth answers to questions but stressed that AI will augment humans and will not solve the problem on its own.
Multiple organizations are using NLP to fight coronavirus. Harvard Medical School developed a tool to review data like patient records, social media, and public health data. BlueDot, a company that uses tools like NLP to scour news articles, public health data, and other sources, reportedly spotted the beginning of the coronavirus outbreak before the World Health Organization sounded the alarm. In China, tech giants like Alibaba Cloud’s Damo Academy is applying its state-of-the-art NLP for text analysis of medical records and epidemiological investigation by China CDC officials. Last week, StructBERT was named the top performing NLP system in the world on the GLUE benchmark leaderboard.
Websites like PubMed, and Microsoft’s Academic Graph, now have COVID-19 resource pages for medical researchers to browse. Partnerships with published literature and preprint repositories like arXiv and medrxiv.org will help keep the data set up to date. The Chan Zuckerberg Initiative, and Georgetown University’s Center for Security and Emerging Technology, also joined the effort to supply researchers with knowledge. The effort coalesced in the past week and questions most in need of answers will be listed on the Kaggle website, White House deputy CTO Lynne Parker said today.
As part of a five-year research collaboration initiative, Harvard Medical School and the Guangzhou Institute will share $115 million in research funding provided by China Evergrande Group. Work at the Guangzhou Institute will be led by Zhong Nanshan, who currently acts as head of the Chinese 2019n-CoV Expert Taskforce and director-general of China State Key Laboratory of Respiratory Diseases.
Other forms of AI being applied to combat coronavirus around include disinfecting robots and deep learning for predicting mortality rates and coronavirus detection from CT scan imagery. Governments around the world have also turned to tech like GPS tracking, self-screening apps, text alerts, and tracking movement with smartphones. Other initiatives underway include an antibody discovery initiative between Abcellera and DARPA’s Pandemic Prevention Platform program and Autonomous Diagnostics to Enable Prevention and Therapeutics (ADEPT) that’s designed to stop disease outbreaks within 60 days.
The news of the open data set comes a week after White House CTO Michael Kratsios first shared a demo of the research repository during a teleconference with tech giants like Apple, Amazon, Facebook, Google, Microsoft, and Twitter via teleconference about ways to fight coronavirus using artificial intelligence and data collected by tech companies.
Few details were shared about the teleconference, but the White House said government and businesses discussed creating new tech tools and information sharing. Anonymous sources told the Washington Post an Amazon employee reportedly offered its cloud reporting services for tracking travelers. VentureBeat reached out to Amazon for more details but did not hear back. As the number of COVID-19 cases in the United States continues to rise, President Trump has repeatedly been criticized for spreading misinformation.
Shortly after declaring a national emergency to rush federal funding to stop the spread of coronavirus last Friday, President Trump, Vice President Pence, and other administration officials said Google is creating a website that seemingly promised broad coverage. However, Google said in a statement that Alphabet subsidiary Verily as part of its Project Baseline but at launch it will only be available in two locations in the San Francisco Bay Area. Use of the site requires a Google account.
On Sunday, Google CEO Sundar Pichai announced it is now working with government to create a website to help self-screen people wondering whether they should seek medical attention.