Biomedical

Encyclopedia of DNA Elements (ENCODE)

BED (Browser Extensible Data) Files

The ENCODE dataset contains information for approximately 7000 experiments along with 14,000 BED files collected by The Encyclopedia of DNA Elements (ENCODE) Consortium. Examples of experiment metadata captured include the target biosample, assay type, gene assembly, etc. Bed files link to individual bed lines, which state the genomic position of individual peaks. Data Commons ingested all experimental data in BED format.

Data made available under: ENCODE Data Use Policy for External Users. This data was formatted for Data Commons through a collaboration with Dr. Anthony Oro’s group at Stanford University.

European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI)

ChEMBL

“ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.” It includes information on drugs at all stages of drug discovery.

This data is made available by EMBL-EPI Terms of Use. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Genotype-Tissue Expression (GTEx)

The GTEx eGene and significant variant-gene association data were generated from samples “collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank.” The single-tissue cis-eQTL data from the v8 release was used. Due to the size of the datasets only Skin - Not Sun Exposed and Skin - Sun Exposed are made available on the main graph. The data for all tissues can be accessed on the Biomedical Data Commons knowledge graph.

GTEx is an NIH human genomic data unrestricted-access data repository and the data was made available in compliance with GTEx Data Release and Publication Policy. GTEx outlines how to cite use of GTEx data in journal publication.

HUPO-PSI Working Groups and Outputs

The Molecular Interactions Controlled Vocabulary from the HUPO Proteomics Standards Initiative working groups is “a structured controlled vocabulary for the annotation of experiments concerned with protein-protein interactions”. The ontologies dictionary is represented in a tree structure in the EMBL-EBI Ontology Lookup Service. Data Commons includes three subsets of the ontologies: “interaction detection method”, “interaction type” and “database citation”, which are commonly used in protein-protein interactions.

Data Made available under Apache License 2.0. The license information of HUPO PSI can be found at the Community Practice. See also EBI term of use.

Institute of Genome Sciences

The Disease Ontology

The Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It “is a community driven, open source ontology that is designed to link disparate datasets through disease concepts”. It provides a “standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts”.

The data is made available under C0 1.0 Universal (CC0 1.0) Public Domain Dedication. Data Commons includes the 3/7/19 update of the Disease Ontology. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

New York Botanical Garden (NYBG)

C. V. Starr Virtual Herbarium (Collaboration)

C. V. Starr Virtual Herbarium is a public specimen database with photos and detailed records about millions of plants, fungi, and algae.

Side Effect Resource (SIDER)

Sider is a database of adverse drug reactions curated by the EMBL collaboration. “SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations.” Data Commons hosts version 4.1 of SIDER released on October 21, 2015.

This data is made available under the Creative Commons Attribution-Noncommercial-Share Alike 4.0 License. Information about citing SIDER can be found here. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

The Molecular INTeraction (MINT) Database

The MINT Database “focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.”

MINT is a part of ELIXIR Core Data Resources, of which the resources are all committed to open access. Any use of this database should cite:

Licata, Luana, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco et al. “MINT, the molecular interaction database: 2012 update.” Nucleic acids research 40, no. D1 (2012): D857-D861.

The Tissue Atlas

The Human Protein Tissue Atlas contains information about the distribution of proteins on human tissues derived from the antibody-based protein profiling from 44 normal human tissues types and mRNA expression data from 37 different normal tissue types.

This dataset is available under CC BY-SA 3.0. Please also see their Disclaimer and Licence & Citation.

U.S. Food and Drug Administration (FDA)

FDA-Approved Drugs

“Drugs@FDA includes information about drugs, including biological products, approved for human use in the United States.” Data Commons includes the information about the FDA application for the drug as well as the drug’s strength, active ingredients, dosage forms, administration routes, FDA therapeutic equivalence code, and marketing status.

Pharmacologic Class

The FDA established pharmacologic classes “associated with an approved indication of an active moiety that the FDA has determined to be scientifically valid and clinically meaningful”. This includes the (1) description of pharmacologic class (2) active moiety code and description (3) compounds associated with each class.

This data is made available through openFDA terms of service.

U.S. National Institutes of Health: National Center for Biotechnology Information

ClinVar

“ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.” It contains reports of genetic “variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data.” Data Commons includes the January 6, 2020 release of the ClinVar archive supporting both hg19 and hg38 genome assemblies.

Gene

The NIH NCBI gene info datasets from NCBI Gene for a subset of species contains “gene-specific content based on NCBI’s RefSeq project, information from model organism databases, and links to other resources.” The NCBI RefSeq project is “a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein”. The datasets included are from the February 19, 2020 update. The gene info files for the following species have been added:

  • Caenorhabditis elegans
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Homo sapiens
  • Mus musculus
  • Saccharomyces cerevisiae
  • Xenepus laevis

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

U.S. National Institutes of Health: National Library of Medicine

Medical Subject Headings (MeSH)

“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Descriptor, Concept, and Term elements of MeSH as described here.

This data is from the National Library of Medicine (NLM) and is not subject to copyright and is freely reproducible as stated in the NLM’s copyright policy. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

UCSC Genomics Institute

Genome Browser

The UCSC Genome Browser originated from The Human Genome Project in 2000 to share and visualize genome data. It has grown to include an agglomeration of various genome assemblies and annotations. Data Commons includes data annotating chromosomes, genes, RNA transcripts, and genetic variants from the UCSC Genome Browser. The .chrom.sizes.txt files were downloaded from the UCSC Genome Browser Downloads page on August 13, 2019. The NCBI RefSeq files were downloaded from the UCSC Table Browser on August 2, 2019 for the following genome assemblies:

  • ce10
  • ce11
  • danRer10
  • danRer11
  • dm3
  • dm6
  • galGal5
  • galGal6
  • hg19
  • hg38
  • mm9
  • mm10
  • sacCer3
  • xenLae2

The All SNPs files were downloaded from the UCSC Table Browser on August 13, 2019 for the following genome assemblies and dbSNP builds:

  • gaGal5 (dbSNP Build 147)
  • hg19 (dbSNP Build 151)
  • hg38 (dbSNP Build 151)
  • mm9 (dbSNP Build 128)
  • mm10 (dbSNP Build 142)

The annotation data is made freely available under the UCSC Genome Browser terms of use. The UCSC Genome Browser states how to cite use of their data in a journal article publication.

UniProt

Data Commons includes protein sequence and functional information including protein interaction with chemical compounds maintained by the UniProt Consortium.

UniProt Controlled Vocabulary of Species

UniProt’s Controlled Vocabulary of Species contains organism species UniProt identification codes, NCBI Taxonomy database identifiers, scientific names, common names, synonyms, and organism kingdoms.

The data is made available by the Creative Commons Attribution (CC BY 4.0) License. Further information on UniProt License and Disclaimer can be found here. The UniProt Consortium states how to cite Uniprot data used in a journal article. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

University of California San Francisco

SPOKE Disease Symptom Associations

These are statistical associations using a Fisher’s exact test co-occurrence of disease and symptom terms in Pubmed entries by performing as described in Himmelstein, et al (2017).

The data was previously hosted by UCSF Scalable Precision Medicine Knowledge Engine SPOKE. It was made available by the data’s owner, Sergio Baranzini, for use on Data Commons. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.