Broad Institute

GTEx Analysis V8 eQTL

The GTEx eGene and significant variant-gene association data were generated from samples “collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank.” The single-tissue cis-eQTL data from the v8 release was used.

GTEx is an NIH human genomic data unrestricted-access data repository and the data was made available in compliance with GTEx Data Release and Publication Policy. GTEx outlines how to cite use of GTEx data in journal publication.

ELIXIR Core Data Resources

The Molecular INTeraction (MINT) Database

The MINT Database “focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.”

Licata, Luana, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco et al. “MINT, the molecular interaction database: 2012 update.” Nucleic acids research 40, no. D1 (2012): D857-D861.

Encyclopedia of DNA Elements (ENCODE)

Experimental Data

The ENCODE experimental dataset contains information for approximately 7000 experiments along with 14,000 BED files collected by The Encyclopedia of DNA Elements (ENCODE) Consortium. Examples of experiment metadata captured include the target biosample, assay type, gene assembly, etc. Data Commons include the meta data for all experimental datasets in ENCODE as of 2019.

Data made available under: ENCODE Data Use Policy for External Users. This data was formatted for Data Commons through a collaboration with Dr. Anthony Oro’s group at Stanford University.

European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI)


“ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.” It includes information on drugs at all stages of drug discovery.


Data Commons includes protein sequence and functional information including protein interaction with chemical compounds maintained by the UniProt Consortium. The data is made available by the Creative Commons Attribution (CC BY 4.0) License. Further information on UniProt License and Disclaimer can be found here. The UniProt Consortium states how to cite UniProt data used in a journal article.

This data is made available by EMBL-EPI Terms of Use.

International Committee on Taxonomy of Viruses (ICTV)

Master Species List

The official, current virus taxonomy approved by the ICTV. To accomplish the task of organizing and maintaining this virus taxonomy, the ICTV is composed of 7 subcommittees covering Animal DNA viruses and Retroviruses, Animal dsRNA and ssRNA (-) viruses, Animal ssRNA (+) viruses, Bacterial viruses, Archaeal Viruses, Fungal and Protist viruses, and Plant viruses. The ICTV has established over 100 international Study Groups (SGs) covering all major virus families and genera. The MSL version currently in the graph is MSL38 v3 released on 2023-09-11.

Virus Metadata Resource

The ICTV chooses an exemplar virus for each species and the VMR provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source. The VMR version currently in the graph is VMR MSL38 v2 released on 2023-09-13. This data is made available under Creative Commons Attribution ShareAlike 4.0 International (CC BY-SA 4.0).

Jensen Lab (University of Copenhagen)


DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This dataset further unifies the evidence by assigning confidence scores that facilitate comparison of the different types and sources of evidence. For further details please refer to the following Open Access articles about the database: DISEASES: Text mining and data integration of disease-gene associations and DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration. The data is made available under the CC-BY license.

Side Effect Resource (SIDER) 4.1

SIDER is a database of adverse drug reactions. Available information includes side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations. However, this database uses MEDRA ontology, which is under the UMLS license that is limited to non-commercial use. Therefore, only the data under zero license - mappings of PubChem Compound IDs (CIDs), and ATC Codes - are hosted. Data Commons hosts version 4.1 of SIDER released on October 21, 2015. Information about citing SIDER can be found here.

This data is made available under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

New York Botanical Garden (NYBG)

C. V. Starr Virtual Herbarium (Collaboration)

C. V. Starr Virtual Herbarium is a public specimen database with photos and detailed records about millions of plants, fungi, and algae.


PharmGKB Primary Data

The Pharmacogenomics Knowledge Base, PharmGKB, is an interactive tool for researchers investigating how genetic variation affects drug response. The PharmGKB Web site,, displays genotype, molecular, and clinical knowledge integrated into pathway representations and Very Important Pharmacogene (VIP) summaries with links to additional external resources. Users can search and browse the knowledgebase by genes, variants, drugs, diseases, and pathways. The Primary Data contains summary information on chemicals, drugs, genes, genetic variants, and phenotypes.

PharmGKB Relationships Data

PharmGKB reports association between chemicals, diseases, genes, and genetic variants, both with themselves and with each other.

Data made available under Creative Commons Attribution-ShareAlike 4.0 Intergovernmental Organization (CC BY-SA 4.0 IGO) licence. Explicit licensing for PharmGKB can be viewed on the download page.

Temporary Data Commons Data

Temporary Gene Mappings

This maps the new way of generating Gene dcids (bio/) with the old, preexisting Gene dcids(bio/_). These are temporary mappings until all data using the old method of Gene dcid generation has been updated.

Data is publicly available via Data Commons.

The Human Protein Atlas

The Tissue Atlas

The Human Protein Tissue Atlas contains information about the distribution of proteins on human tissues derived from the antibody-based protein profiling from 44 normal human tissues types and mRNA expression data from 37 different normal tissue types.

This dataset is available under CC BY-SA 3.0. Please also see their Disclaimer and Licence & Citation.

U.S. Adopted Names (USAN) Council

USAN Stems

USAN stems represent common stems for which chemical and/or pharmacologic parameters have been established. These council-approved stems and their definitions are recommended for use in coining new nonproprietary drug names belonging to an established series of related agents. USAN appropriately incorporates this established class stem system. By doing so, similar compounds maintain a common “family” name that provides immediate recognition.

This data is made available through openFDA terms of service.

U.S. Food and Drug Administration (FDA)

FDA-Approved Drugs

“Drugs@FDA includes information about drugs, including biological products, approved for human use in the United States.” Data Commons includes the information about the FDA application for the drug as well as the drug’s strength, active ingredients, dosage forms, administration routes, FDA therapeutic equivalence code, and marketing status.

This data is made available through openFDA terms of service.

U.S. National Institutes of Health: National Center for Biotechnology Information


“ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.” It contains reports of genetic “variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data.” Data Commons includes the January 6, 2020 release of the ClinVar archive supporting both hg19 and hg38 genome assemblies.


The NIH NCBI gene info datasets from NCBI Gene for a subset of species contains “gene-specific content based on NCBI’s RefSeq project, information from model organism databases, and links to other resources.” The NCBI RefSeq project is “a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein”. The datasets included are from the February 19, 2020 update. The gene info files for the following species have been added:

  • Caenorhabditis elegans
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Homo sapiens
  • Mus musculus
  • Saccharomyces cerevisiae
  • Xenepus laevis

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

U.S. National Institutes of Health: National Library of Medicine

Medical Subject Headings (MeSH)

The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described here defined by all four xml files provided by MeSH (desc, pa, qual, and supp). Data Commons includes production year 2024 MeSH.


PubChem is the world’s largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.

This data is from the National Library of Medicine (NLM) and is not subject to copyright and is freely reproducible as stated in the NLM’s copyright policy.

UCSC Genomics Institute

Genome Browser

The UCSC Genome Browser originated from The Human Genome Project in 2000 to share and visualize genome data. It has grown to include an agglomeration of various genome assemblies and annotations. Data Commons includes data annotating chromosomes, genes, RNA transcripts, and genetic variants from the UCSC Genome Browser. The .chrom.sizes.txt files were downloaded from the UCSC Genome Browser Downloads page on August 13, 2019. The NCBI RefSeq files were downloaded from the UCSC Table Browser on August 2, 2019 for the following genome assemblies:

  • ce10
  • ce11
  • danRer10
  • danRer11
  • dm3
  • dm6
  • galGal5
  • galGal6
  • hg19
  • hg38
  • mm9
  • mm10
  • sacCer3
  • xenLae2

The All SNPs files were downloaded from the UCSC Table Browser on August 13, 2019 for the following genome assemblies and dbSNP builds:

  • gaGal5 (dbSNP Build 147)
  • hg19 (dbSNP Build 151)
  • hg38 (dbSNP Build 151)
  • mm9 (dbSNP Build 128)
  • mm10 (dbSNP Build 142)

The annotation data is made freely available under the UCSC Genome Browser terms of use. The UCSC Genome Browser states how to cite use of their data in a journal article publication.

University of Maryland School of Medicine, Institute of Genome Sciences

Disease Ontology

The Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It “is a community driven, open source ontology that is designed to link disparate datasets through disease concepts”. It provides a “standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts”.

The data is made available under C0 1.0 Universal (CC0 1.0) Public Domain Dedication.

World Health Organization (WHO)

ATC Codes

Anatomical Therapeutic Chemical (ATC) is a heirarchical classification system for pharmacological substances. ‘In the ATC classification system, the active substances are classified in a hierarchy with five different levels. The system has fourteen main anatomical/pharmacological groups or 1st levels. Each ATC main group is divided into 2nd levels which could be either pharmacological or therapeutic groups. The 3rd and 4th levels are chemical, pharmacological or therapeutic subgroups and the 5th level is the chemical substance. The 2nd, 3rd and 4th levels are often used to identify pharmacological subgroups when that is considered more appropriate than therapeutic or chemical subgroups.’

Data made available under CC BY-NC-SA 3.0 IGO.