Biomedical

ELIXIR Core Data Resources

The Molecular INTeraction (MINT) Database

The MINT Database “focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.”

Licata, Luana, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco et al. “MINT, the molecular interaction database: 2012 update.” Nucleic acids research 40, no. D1 (2012): D857-D861.

Encyclopedia of DNA Elements (ENCODE)

Experimental Data

The ENCODE experimental dataset contains information for approximately 7000 experiments along with 14,000 BED files collected by The Encyclopedia of DNA Elements (ENCODE) Consortium. Examples of experiment metadata captured include the target biosample, assay type, gene assembly, etc. Data Commons include the meta data for all experimental datasets in ENCODE as of 2019.

Data made available under: ENCODE Data Use Policy for External Users. This data was formatted for Data Commons through a collaboration with Dr. Anthony Oro’s group at Stanford University.

European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI)

UniProt

Data Commons includes protein sequence and functional information including protein interaction with chemical compounds maintained by the UniProt Consortium. The data is made available by the Creative Commons Attribution (CC BY 4.0) License. Further information on UniProt License and Disclaimer can be found here. The UniProt Consortium states how to cite UniProt data used in a journal article.

This data is made available by EMBL-EPI Terms of Use.

Gene Ontology Consortium

The Sequence Ontology

“The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence. SO includes different kinds of features which can be located on the sequence. Biological features are those which are defined by their disposition to be involved in a biological process. Examples are ‘binding site’ and ‘exon’. Biomaterial features are those which are intended for use in an experiment such as aptamer and PCR_product. There are also experimental features which are the result of an experiment. SO also provides a rich set of attributes to describe these features such as ‘polycistronic’ and ‘maternally imprinted’.” Gene Ontology Consortium data and data products are licensed under the Creative Commons Attribution 4.0 Unported License. When using or citing GO data please mention the particular release. For example, include where applicable the date (e.g. ‘2024-01-17’), Zenodo DOI (e.g. ‘10.5281/zenodo.10536401’), and links. More information on licensing and attribution in regards to the Gene Ontology Consortium can be found here.

International Committee on Taxonomy of Viruses (ICTV)

Master Species List

The official, current virus taxonomy approved by the ICTV. To accomplish the task of organizing and maintaining this virus taxonomy, the ICTV is composed of 7 subcommittees covering Animal DNA viruses and Retroviruses, Animal dsRNA and ssRNA (-) viruses, Animal ssRNA (+) viruses, Bacterial viruses, Archaeal Viruses, Fungal and Protist viruses, and Plant viruses. The ICTV has established over 100 international Study Groups (SGs) covering all major virus families and genera. The MSL version currently in the graph is MSL38 v3 released on 2023-09-11.

Virus Metadata Resource

The ICTV chooses an exemplar virus for each species and the VMR provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source. The VMR version currently in the graph is VMR MSL38 v2 released on 2023-09-13. This data is made available under Creative Commons Attribution ShareAlike 4.0 International (CC BY-SA 4.0).

Jensen Lab (University of Copenhagen)

DISEASES: Experiment

DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This dataset further unifies the evidence by assigning confidence scores that facilitate comparison of the different types and sources of evidence. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The knowledge files further contain the source database, the evidence type, and the confidence score. For further details please refer to the following Open Access articles about the database: DISEASES: Text mining and data integration of disease-gene associations and DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration. The data is made available under the CC-BY license.

DISEASES: Knowledge

DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This dataset further unifies the evidence by assigning confidence scores that facilitate comparison of the different types and sources of evidence. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The experiments files instead contain the source database, the source score, and the confidence score. For further details please refer to the following Open Access articles about the database: DISEASES: Text mining and data integration of disease-gene associations and DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration. The data is made available under the CC-BY license.

DISEASES: Textmining

DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This dataset further unifies the evidence by assigning confidence scores that facilitate comparison of the different types and sources of evidence. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The textmining files contain the z-score, the confidence score, and a URL to a viewer of the underlying abstracts. For further details please refer to the following Open Access articles about the database: DISEASES: Text mining and data integration of disease-gene associations and DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration. The data is made available under the CC-BY license.

Side Effect Resource (SIDER) 4.1

SIDER is a database of adverse drug reactions. Available information includes side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations. However, this database uses MEDRA ontology, which is under the UMLS license that is limited to non-commercial use. Therefore, only the data under zero license - mappings of PubChem Compound IDs (CIDs), and ATC Codes - are hosted. Data Commons hosts version 4.1 of SIDER released on October 21, 2015. Information about citing SIDER can be found here.

This data is made available under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

New York Botanical Garden (NYBG)

C. V. Starr Virtual Herbarium (Collaboration)

C. V. Starr Virtual Herbarium is a public specimen database with photos and detailed records about millions of plants, fungi, and algae.

PharmGKB

PharmGKB Primary Data

The Pharmacogenomics Knowledge Base, PharmGKB, is an interactive tool for researchers investigating how genetic variation affects drug response. PharmGKB displays genotype, molecular, and clinical knowledge integrated into pathway representations and Very Important Pharmacogene (VIP) summaries with links to additional external resources. Users can search and browse the knowledge base by genes, variants, drugs, diseases, and pathways. The Primary Data contains summary information on chemicals, drugs, genes, genetic variants, and phenotypes.

PharmGKB Relationships Data

PharmGKB reports association between chemicals, diseases, genes, and genetic variants, both with themselves and with each other.

Data made available under Creative Commons Attribution-ShareAlike 4.0 Intergovernmental Organization (CC BY-SA 4.0 IGO) licence. Explicit licensing for PharmGKB can be viewed on the download page.

The Human Protein Atlas

The Tissue Atlas

The Human Protein Tissue Atlas contains information about the distribution of proteins on human tissues derived from the antibody-based protein profiling from 44 normal human tissues types and mRNA expression data from 37 different normal tissue types.

This dataset is available under CC BY-SA 3.0. Please also see their Disclaimer and Licence & Citation.

U.S. Adopted Names (USAN) Council

USAN Stems

USAN stems represent common stems for which chemical and/or pharmacologic parameters have been established. These council-approved stems and their definitions are recommended for use in coining new nonproprietary drug names belonging to an established series of related agents. USAN appropriately incorporates this established class stem system. By doing so, similar compounds maintain a common “family” name that provides immediate recognition.

This data is made available through openFDA terms of service.

U.S. National Institutes of Health: National Center for Biotechnology Information (NIH: NCBI)

NCBI Assembly

“The NCBI Assembly database provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project” (Kitts et al. 2016). In this import we include the metadata for all genome assemblies documented in assembly_summary_genbank.txt and assembly_summary_refseq.txt. Assemblies are stored in GenomeAssembly nodes whose information is integrated from both the GenBank and RefSeq datasets.

NCBI Gene

NCBI Gene supplies gene-specific connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information. These gene identifiers are used throughout NCBI’s databases and tracked through updates of annotation. Gene includes genomes represented by NCBI Reference Sequences (or RefSeqs) and is integrated for indexing and query and retrieval from NCBI’s Entrez and E-Utilities systems. Gene comprises sequences from thousands of distinct taxonomic identifiers, ranging from viruses to bacteria to eukaryotes. It represents chromosomes, organelles, plasmids, viruses, transcripts, and millions of proteins.”

NCBI Taxonomy

“NCBI Taxonomy “consists of a curated set of names and classifications for all of the source organisms represented in the International Nucleotide Sequence Database Collaboration (INSDC). The NCBI Taxonomy database contains a list of names that are determined to be nomenclaturally correct or valid (as defined according to the different codes of nomenclature), classified in an approximately phylogenetic hierarchy (depending on the level of knowledge regarding phylogenetic relationships of a given group) as well as a number of names that exist outside the jurisdiction of the codes. That is, it focuses on nomenclature and systematics, rather than documenting the description of taxa.”

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

U.S. National Institutes of Health: National Library of Medicine

Medical Subject Headings (MeSH)

The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described here defined by all four xml files provided by MeSH (desc, pa, qual, and supp). Data Commons includes production year 2024 MeSH.

PubChem

PubChem is the world’s largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.

This data is from the National Library of Medicine (NLM) and is not subject to copyright and is freely reproducible as stated in the NLM’s copyright policy.

University of Maryland School of Medicine, Institute of Genome Sciences

Disease Ontology

The Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It “is a community driven, open source ontology that is designed to link disparate datasets through disease concepts”. It provides a “standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts”.

The data is made available under C0 1.0 Universal (CC0 1.0) Public Domain Dedication.

World Health Organization (WHO)

ATC Codes

Anatomical Therapeutic Chemical (ATC) is a heirarchical classification system for pharmacological substances. ‘In the ATC classification system, the active substances are classified in a hierarchy with five different levels. The system has fourteen main anatomical/pharmacological groups or 1st levels. Each ATC main group is divided into 2nd levels which could be either pharmacological or therapeutic groups. The 3rd and 4th levels are chemical, pharmacological or therapeutic subgroups and the 5th level is the chemical substance. The 2nd, 3rd and 4th levels are often used to identify pharmacological subgroups when that is considered more appropriate than therapeutic or chemical subgroups.’

Data made available under CC BY-NC-SA 3.0 IGO.

Page last updated: November 21, 2024 • Send feedback about this page