Wuraola Oyewusi

← Back to Home
NLP August 11, 2019

How to use scispaCy for Biomedical Named Entity Recognition, Abbreviation Resolution and link UMLS

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. https://allenai.github.io/scispacy/

I think scispaCy is interesting and decided to share some part of exploring the library. I hope this makes working with scispaCy easier for someone. Google Colaboratory Notebook for this article can be found here.

As at the time of writing this, scispaCy has two entity mentions models(small and medium),Then four NER models optimized for different kinds of entities.Check here to view models and what entities they work for.

We will explore three models here, one entity mention model en_core_sci_md and two NER models en_ner_bc5cdr_md(for disease and chemical entities) and en_ner_bionlp13cg_md(for cancer,organ,tissue,organism,cell,amino_acid,gene_or_gene_products,anatomical_entities etc)

Install library and models

The test document used in this articles can be found here

Snippet of sample document

Python function display_entities()accepts a model and document to return a displacy image and word entities. The function will be used on three different scispaCy models and the tests document. The function can be adjusted as needed. E.g To view dependency parsing instead of entities use displacy.render(doc,jupyter=True,style='dep')

A python function that displays entities and labels View of entity mentions image Word Entity Mention and Label View of Bionlp13cg Named Entities Bionlp13cg Named Entities and Label View of Bc5cdr Named Entities Bc5cdr Named Entities and Label

The function show_medical_abbreviation() accepts a model and document to return abbreviated words and their resolutions. The function can be adjusted as needed. I set the list so only unique values are returned

A python function that resolves medical abbreviations Detected Medical abbreviations and their resolution

The function unified_medical_language_entity_linker() accepts a model and document to return information on named entities and links the entity to the unified medical language systems to return Concept Identity Number,Definitions,Aliases and Accuracy score of Named Entity. As at the time of writing this article,this feature in scispaCy is an alpha feature and the entity linker takes a while to load and there are still user warnings but it's totally worth it and interesting to try out

A python function that links named entities to UMLS database Bc5cdr Named Entities and UMLS links Bionlp13cg Named Entities and UMLS links

If you worked through this, I hope you had a great time too and I did a good taking you through scispaCy