How to use scispaCy Entity Linkers for Biomedical Named Entities
This is a sequel to a previous tutorial by me. There's been an update to scispaCy library that now supports five knowledge bases for entity linking.
Resources:
- Project GitHub Repository
- Notebook with uncleared output (to help you know if you're getting the right output)
- Notebook with cleared output
Introduction
Now that you have extracted biomedical named entities, how can you make more of the entities? You can link them to knowledge base(s). This is exactly what the scispaCy entity linkers do.
The choice of knowledge base to link depends on the nature of extracted named entities and the question the user is trying to answer. It is preferable to read further about each database to have a good grasp of the possibilities with them. The names of the knowledge bases give an idea of the kind of information accessible with it.
Available Knowledge Bases
Previous versions supported only one knowledge base, but from the library documentation for v2.5.0, five (5) knowledge bases are now supported:
- UMLS - Unified Medical Language System
- MeSH - Medical Subject Headings
- RxNorm - RxNorm
- GO - Gene Ontology
- HPO - Human Phenotype Ontology
Tutorial Overview
In this tutorial, before linking entities to the available knowledge bases, biomedical entities will be extracted from the sample text using the 4 available scispaCy NER models. To read the specificity of each NER model, click this link.
All identified entities will then be parsed through different knowledge bases.
Extracting Named Entities
Sample text link: https://www.ncbi.nlm.nih.gov/books/NBK92477/
Any text sample can be used; data can also be a series of text files.
The images below show the code and output for named entity extraction using different models. As expected, the type of entities recognized and extracted is dependent on the type of model.
Entity Linking Function
A total of 422 biomedical named entities were extracted from the sample corpus using 4 NER models from scispaCy. The function below is a general function to link biomedical entities to the scispaCy knowledge bases.
Knowledge Base Comparison
One of the goals of this tutorial is to show how different knowledge bases can return different entity linkage based on the type of data the knowledge base is designed for. Here, the same entity is parsed by four knowledge bases and they returned different concepts, matching scores, and definitions.
The entity_linker function was tested with the 4 scispaCy knowledge bases: "umls", "mesh", "go", "hpo". The function will return 2 entities and their scores as they relate to the knowledge base.
Applying to DataFrames
To apply the entity linker to all entities in a pandas dataframe, the code shows some lines being moved out of the entity_linker function and adjusted to be able to link to one database.
Readers can compare the difference between using the code with certain lines outside the function or the general entity linker used above. It's an interesting difference that answers the question of when functions should be a general function or tweaked for optimization.
This is the view of the resulting dataframe showing what each entity links to in the available knowledge bases. Some entities had definitions to link to in the 4 scispaCy knowledge bases connected to.
Conclusion
This tutorial showed how to extract biomedical and clinical entities and link to medical knowledge bases using the scispaCy Python library. I hope this answers some of your questions and you have a great time exploring.
Did you also learn a couple of things about pandas and swifter?
You can find more about scispaCy here.