TODO: group, clean it up and put most things in table format
This is a private list and accordingly chaotic
natural language parsing and information extraction
NLP Platforms, Libraries and algorithm
Textattack - util for adversarial attacks on NLP models
from medacy.model.model import Model model = Model.load_external('medacy_model_clinical_notes') annotation = model.predict("The patient was prescribed 1 capsule of Advil for 5 days.") print(annotation)
When an area is exploding in options, like NLP, not expertise, but faster feedback cycles and creativity win the game
Prodigy from the makers of Spacy
Prodigy is a fully scriptable annotation tool so efficient that data scientists can do the annotation themselves
Frontends with Python
FastAPI is a modern, fast (high-performance), web framework for building APIs
Sentence-Transformers, also through spaCy
dependency forest: differentiable parser
A dependency forest encodes all valid dependency trees of a sentence into a 3D space that syntax parsing is differentiable
We proposed an efficient and effective relation extraction model that leverage full dependency forests, each of which encodes all valid dependency trees into a dense and continuous 3D space.
we define a full forest as a 3-dimensional tensor, with each point representing the conditional probability p(wj , l|wi) of one word wi modifying another word wj with a relation l.
Compared with a 1-best tree, a full dependency forest efficiently represents all possible dependency trees within a compact and dense structure, containing all possible trees (including the gold tree).
This method allows us to merge a parser into a relation extraction model so that the parser can be jointly updated based on end-task loss.
Extensive experiments show the superiority of forests for RE, which significantly outperform all carefully designed baselines based on 1-best trees or surface strings.
SciSpacy has different NER recognition models depending on biomedical subdomain (drug/disease, cancer, genomics)
TL;DR: More efficient text encoders emerge from an attempt (failed?) to do "GAN's for Text".
The authors “train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark.” (Clark et al., 2020)
Going Binary improves sample efficiency: Instead of [MASKED] token, the Discriminator only has to predict if a viable swap was sampled from a generator (painting -> mask -> car). Later the generator is thrown away
__more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network.
Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.
more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. __
30x training efficiency over BERT
Jeff Hawkin's Numena uses a very different approach for cognition
SciFact: claim verification; evaluate veracity of scientific claims. Double digit improvements
Interesting errors and work to be done:
Science background includes knowledge of domain-specific lexical relationships – e.g. Drosophila is a synonym for fruit fly.
Directionality requires understanding increases or decreases in scientific quantities – e.g. decreased (but not increased) lipogenesis implies impairment of the lipogenesis process.
Numerical reasoning involves interpreting numerical or statistical statements, often reporting confidence intervals or p-values.
causality and effect requires reasoning about counterfactuals – e.g. if knocking out MVP inhibits tumor progression, then the presence of MVP enables tumor growth and aggression
The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health record
Challenges, Methods, Tasks
Hedge cue detection: detecting uncertain or hypothetical statements vs definite statements
Claim detection, argument mining and finding supporting or contradicting statements
dataset, corpi and benchmark; most annoted
Initially I thought the problem is great datasets, but now I think it's finding out which of the 100s of corpi are easiest to parse and build on
Dependencies, genetic, cancer etc.
INDRA uses some of it for DepMap Explainer
Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BIOSYN consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.
Seems popular for very short lifespan
MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.
The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital).
Causal BioNet CBN
Private BEL databases can be licensed from Qiagen Ingenuity, Clarivate etc.
The Causal Biological Networks (CBN) database is composed of multiple versions of over 120 modular, manually curated, BEL-scripted biological network models supported by over 80,000 unique pieces of evidence from the scientific literature.
They represent causal signaling pathways across a wide range of biological processes including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and vascular systems.
CancerMine, a text-mined and routinely updated database of drivers, oncogenes and tumor suppressors in different types of cancer. All data are available online
S2ORC is a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata. We provide structured full text for 8.1M open access papers.
All inline citation mentions in the full text are detected and linked to their corresponding bibliography entries, which are linked to their referenced papers, forming contextual citation edges. To our knowledge, this is the largest publicly-available contextual citation graph. The full text alone is the largest structured academic text corpus to date.
Never understood what this was supposed to be
And as always when the EU tries to coordinate: total failure
Biomedical Language Understanding Evaluation benchmark
BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.
DDI (Drug-Drug Interaction) corpus
NacTem corpi 💫
links to corpora of various sizes, with different levels of annotation, and belonging to different domains.
🌲 🏦 Treebanks, Treebanks, Treebanks!
The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion.
lots of good training data and compound library
We present BioRel, a large-scale dataset constructed by using Unified Medical Language System (UMLS) as knowledge base and Medline as corpus. Entities in sentences of Medline are identified and linked to UMLS by Metamap. Relation label for each sentence is recognized using distant supervision.
The AImed corpus consists of 225 Medline abstracts. 200 abstracts describe interactions between human proteins, 25 do not refer to any interaction. There are 4084 protein references and around 1000 tagged interactions in this data set. In this data set there is no distinction between genes and proteins and the relations are symmetric.
🎉 List of biomedical datasets and tools
DDI, PGR, and BC5CDR corpus
"We evaluate our tLSTM model on five publicly available PPI corpora: AIMed 37(https://scite.ai/reports/10.1016/j.artmed.2004.07.016), BioInfer 38(), IEPA 39(https://scite.ai/reports/10.1142/9789812799623_0031), HPRD50 40(https://scite.ai/reports/10.1093/bioinformatics/btl616) and LLL "
Very structured but unfortunately no set for biomedicine
Biomedical Relations: THEMES (possible relation types)
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis
(Mp) biomarkers (of disease progression)
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Q) production by cell population
“What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines”
To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. It also provides lexicon-based concept tagging as well as the state-of-the-art pre-trained taggers such as TaggerOne, GNormPlus and tmVar. ezTag is freely available at http://eztag.bioqrator.org
A collections of public and free annotated datasets of relationships between entities/nominals
Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to automatically annotate textual data.
Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This makes populating a knowledge graph with multiple nodes and edge types practically infeasible.
Data Programming for fast creation annotated training sets
We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.
We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings
by switching to an approach where the human can continuously improve the model by annotation while using the model to extract information, with the especially good news that the largest model improvements are achieved already very early in the process, as long as the domain is confined.
CiteSpace Scientometric visualization
CiteSpace has three core concepts: burst detection, betweenness centrality, and heterogeneous networks. These concepts can solve three practical problems: identifying the nature of research frontiers, marking keywords, and identifying emerging trends and sudden changes in time.
The main procedural steps of CiteSpace software are time slicing, thresholding, modeling, pruning, merging, and mapping and the main source of input data for CiteSpace is the Web of Science database. CiteSpace can identify frontier areas of current research by extracting burst terms from identifiers of titles, abstracts, descriptors, and bibliographic records. CiteSpace also makes it easier for users to recognize key points by identifying nodes with high betweenness centrality
Formats, Utils and DSLs: Write less code
Example Code and Projects
PubRunner: run tools against PubMed releases
People, Groups, Conferences
The UID or DOI of each dataset should let you search for corresponding data crunching algorithms or notebooks with an EDA
No matter how fast ML advances, people stay unsatisfied. That's good because datasets, benchmarks and algorithms have to incorporate higher and higher level features of discourse
SPECTER: scientific document level embeddings based on citations and transformers
INDRA Havard Team. Looks insanely advanced
INDRA (the Integrated Network and Dynamical Reasoning Assembler) assembles information about biochemical mechanisms into a common format that can be used to build several different kinds of explanatory models. Sources of mechanistic information include pathway databases, natural language descriptions of mechanisms by human curators, and findings extracted from the literature by text mining. Mechanistic information from multiple sources is de-duplicated, standardized and assembled into sets of mechanistic Statements with associated evidence. Sets of Statements can then be used to assemble both executable rule-based models (using PySB) and a variety of different types of network models.
INDRA-IPM also recognizes protein families and complexes and grounds them in the FamPlex ontology. In some cases, there is ambiguity in the name of a specific gene and a family it is part of. An example of this is the grounding of “JUN” from text to the JUN family, which also includes the JUN gene. In this case the user can use a synonym such as “c-JUN” that refers to the singular entity in order to reference only the gene and not the family.
We have exposed two reading systems to users. The REACH reader developed by the CLU Lab at the University of Arizona is an information extraction system for the biomedical domain, which aims to read scientific literature and extract cancer signaling pathways. We recommend users try REACH first due to its speed. The TRIPS/DRUM system developed by IHMC may offer greater mechanistic detail in some use cases (for instance, it supports recognizing complex molecular conditions such as “BRAF-V600E not bound to Vemurafenib”), but it requires significantly longer to run.
There are two broad motivations for comparing the similarities and differences within a family of models. In the first case, a research team is building a family of models up from a base model over time. As members leave the project, new members join to replace them. The continuity of the project is thus greatly facilitated by the ability of the new members to browse the history of the model and identify when and where modifications were made. Identifying the common core among the family of models is essential, since the elements that are not present in the core represent modifications to the model.
In the second case, a researcher intends to model a particular signaling pathway or set of pathways. As part of this process, they would want to see what elements of that pathway have been previously modeled, and explore the relationships among existing models in the literature. The researcher downloads several models from one of the several existing online databases [14–17] in a commonly-used model exchange format such as the Systems Biology Markup Language (SBML) . The researcher would like to see at a glance which model components are shared and which are unique.
Starting from these two motivating cases, and through close interaction with domain experts, we identified the following major tasks where visualizations can benefit model comparison in the area of cell signaling. Because of the similarities between model usage in this domain and in other domains, we assert that many of these tasks have global applications to model comparison beyond the cell signaling domain.
Identify similar structures within models. Identifying similar structures is beneficial because if two different models share a common core, it is likely that those models can be combined to form a single, more-complete model. Additionally, searching for a single structure common to a significant subset of a family of models can help to identify models missing this structure. This can help researchers make observations about the functionality of that subset of models.
Identify structures that differ between pairs of models. Performing a pairwise comparison similar to task 1 with the goal of identifying structures that differ between the models helps researchers identify model components present in one model that do not appear in the other. Researchers can use this information to explore the functional effects of the structural differences between models. When identifying both the similarities and differences between graphs, minimizing layout differences is essential to enable the user to see changes [19, 20].
Sort/cluster models by similarity. Sorting models by degree of similarity helps to minimize visual differences between graphs in proximity to each other, facilitating comparison . As such, a method for computing the similarity of a pair of models should be developed or found from literature. Following this, the models should be laid out based on these scores in a clear and visually pleasing way.
Support pairwise detailed comparison. Building upon the similarity and difference comparison of a pair of models, a researcher should also be able to examine the similar or differing structures of the models in more detail. In particular, the researcher may wish to examine the individual rules within the model to determine the level of similarity.
Explore the functional effects of differences between model structures. The researcher may also wish to explore the functional effects of model changes. In particular, the researcher should be able to perform a pairwise comparison of the simulation results or other species and reactions in the generated network of a model, in order to identify how the changes within a model affect the generated outputs.
Organize and browse model repositories. A researcher should be able to use this system to organize and browse a set of possibly unrelated models from a database or online repository. The researcher should still be able to look at the similar and different structures across the collection of models under examination.
Enable the ability to share model layouts with other researchers. Finally, if a researcher wishes to highlight important structural features that were custom-encoded into a model, that researcher must be able to also convey the structure of the model along with the model itself. To keep the model interactive and to share all of the properties of the model, simply sharing a screenshot of a model is not sufficient. Therefore, although the model language may not specify any kind of set structural information, that structural information needs to be maintained.
This task analysis breakdown shows that a number of problems related to the comparison of models can be solved or aided with visualization. Specifically, Tasks 1–6 can be performed with a clear visual representation of the model(s), and are specifically addressed in this work. Task 7, on the other hand, is not specifically a visualization challenge, but can be facilitated by specific aspects of our visualization system.
Kappa Language Harvard
The challenge of building models of complex biomolecular systems necessitates an intermediate stage to bridge between the biophysically and biochemically grounded descriptions in papers and databases on the one hand and the ungrounded abstract language of Kappa on the other. This intermediate staging area is a knowledge representation that enables the user-driven aggregation of nuggets encapsulating mechanistic information pertaining to protein-protein interactions, the visualization of their interrelations, identification of conflicts, etc. In essence, it aims at being the biologically grounded model the user reasons about. This model is then compiled into Kappa for execution. From this perspective, Kappa is seen as an "assembler" code rather than the primary language for building a model in the first place.
KAMI stands for Knowledge Aggregator and Model Instantiator. It is an ongoing development led by Russ Harmer at ENS Lyon. KAMI poses interesting challenges in knowledge representation and multi-level graph rewriting.
Nanopublications are implemented in the language RDF and come with an evolving ecosystem of tools and systems. They can be published to a decentralized server network, for example, and then queried, accessed, reused, and linked.
Furthermore, because nanopublications can be attributed and cited, they provide incentives for researchers to make their data available in standard formats that drive data accessibility and interoperability. Nanopublications have the following general structure:
As can be seen in this image, a nanopublication has three basic elements:
spaCy’s parser component can be used to be trained to predict any type of tree structure over your input text – including semantic relations that are not syntactic dependencies. This can be useful to for conversational applications, which need to predict trees over whole documents or chat logs, with connections between the sentence roots used to annotate discourse structure.
UNCURL-App is a unified framework for interactively analyzing single-cell RNA-seq data. It is based on UNCURL for data preprocessing and clustering. It can be used to perform a variety of tasks such as:
Unsupervised or semi-supervised preprocessing
Interactive data analysis and visualization
https://biothings.io/ Annotation as a Service (variants, genes)
QTL Table Miner. Mining relations from structured plots
We present QTLTableMiner++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature
A significant amount of experimental information about Quantitative Trait Locus (QTL) studies are described in (heterogenous) tables of scientific articles. Briefly, a QTL is a genomic region that correlates with a trait of interest (phenotype). QTM is a command-line tool to retrieve and semantically annotate results obtained from QTL mapping experiments.
In this study, we aimed to identify pathway figures published in the past 25 years, to characterize the human gene content in figures by optical character recognition, and to describe their utility as a resource for pathway knowledge.
[...] challenge would be to link data across multiple biological entities. For example, fields like systems chemical biology, which studies the effect of drugs on the whole biological system, requires the integration and cross-linking of data across from multiple domains, including genes, pathways, drugs as well as diseases
SemRegex provides solutions for a subtask of the program synthesis problem: generating regular expressions from natural language. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions.
The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches.
New approaches to approximate causality from correlation
The goal of probabilistic programming is to enable probabilistic modeling and machine learning to be accessible to the working programmer, who has sufficient domain expertise, but perhaps not enough expertise in probability theory or machine learning. We wish to hide the details of inference inside the compiler and run-time
diseases, drugs, genes, variants, proteins, pathways