Graph Gurus 37: Combining Natural Language Processing (NLP) with a Graph Database for COVID-19 Dataset

Recorded June 10, 2020 

Over 80% of the world’s data is unstructured, which means it is stored in the form of documents, rather than as rows and columns in a relational database. This presents a perplexing challenge: how do we process this rich, although unstructured dataset to get meaningful content and insights out of it?

Due to the rapid increase in coronavirus literature, folks from the medical community are having difficulty identifying which articles or papers would be the most useful for their research. Thanks to the request made by The White House Office of Science and Technology Policy, we now have free access to the COVID-19 Open Research Dataset on Kaggle, the most extensive machine-readable Coronavirus literature collection available for data and text mining to date.

You can help by connecting insights across the body of literature and find a solution for the research community and TigerGraph can help you get started. 

In this Graph Gurus Episode, we: 

  • Learn how to process text and extract entities (words and phrases) as well as classes linking the entities using SciSpacy, a Natural Language Processing (NLP) tool. 
  • Import the output of NLP and semantically link it in TigerGraph
  • Run advanced analytics queries with TigerGraph to analyze the relationships and deliver insights 

In addition to COVID-19 related research papers and literature, this approach can be applied to any unstructured content such as call center notes, marketing as well as technical literature, research papers, and more. Tools covered in this episode include Google Colab(a python environment), SciSpacy (NLP tool), and TigerGraph, a native parallel graph database. 


  • Jonathan Herke, Developer Relations 

Supporting Blogs: