Prepare and load your own data
This page shows you how to format and load your own custom data into your local instance. This is step 2 of the recommended workflow.
- Overview
- Before you start: Identify your statistical variables
- Prepare the CSV files
- Write the data config file
- Load local custom data
Overview
Custom Data Commons provides a simple mechanism to import your own data, but it requires that the data be provided in a specific format and file structure. At a high level, you need to provide the following:
- All data must be in CSV format, using the schema described below.
- You must also provide a JSON configuration file, named
config.json
, that specifies how to map and resolve the CSV contents to the Data Commons schema knowledge graph. The contents of the JSON file are described below. - All CSV files and the JSON file must be in the same directory
Examples are provided in custom_dc/sample
and custom_dc/examples
directories.
We strongly recommend that, before proceeding, you familiarize yourself with the basics of the Data Commons data model by reading through Key concepts, in particular, entities, statistical variables, and observations.
The following sections walk you through the process of setting up your data.
Before you start: Identify your statistical variables
Your data undoubtedly contains metrics and observed values. In Data Commons, the metrics themselves are known as statistical variables, and the time series data, or values over time, are known as observations. While observations are always numeric, statistical variables must be defined as nodes in the Data Commons knowledge graph.
Statistical variables must follow a certain model; in particular, they must represent any breakdown properties and even encode those properties in their name. To explain what this means, consider the following example. Let’s say your dataset contains the number of schools in U.S. cities, broken down by level (elementary, middle, secondary) and type (private, public), reported for each year (numbers are not real, but are just made up for the sake of example):
CITY | YEAR | SCHOOL_TYPE | SCHOOL_LEVEL | COUNT |
---|---|---|---|---|
San Francisco | 2023 | public | elementary | 300 |
San Francisco | 2023 | public | middle | 300 |
San Francisco | 2023 | public | secondary | 200 |
San Francisco | 2023 | private | elementary | 100 |
San Francisco | 2023 | private | middle | 100 |
San Francisco | 2023 | private | secondary | 50 |
San Jose | 2023 | public | elementary | 400 |
San Jose | 2023 | public | middle | 400 |
San Jose | 2023 | public | secondary | 300 |
San Jose | 2023 | private | elementary | 200 |
San Jose | 2023 | private | middle | 200 |
San Jose | 2023 | private | secondary | 100 |
Although the properties of school type and school level may already be defined in the Data Commons knowledge graph (or you may need to define them), they cannot be present as columns in the CSV files that you store in Data Commons. Instead, you must create separate “count” variables to represent each case. In our example, you would actually need 6 different variables:
Count_Public_Elementary
Count_Public_Middle
Count_Public_Secondary
Count_Private_Elementary
Count_Private_Middle
Count_Private_Secondary
If you wanted totals or subtotals of combinations, you would need to create additional variables for these as well.
Prepare the CSV files
Custom Data Commons provides a simplified data model, which allows your data to be mapped to the Data Commons knowledge graph schema. Data in the CSV files should conform to a variable per column scheme. This requires minimal manual configuration; the Data Commons importer can create observations and statistical variables if they don’t already exist, and it resolves all columns to DCIDs.
With the variable-per-column scheme, data is provided in this format, in this exact sequence:
ENTITY, OBSERVATION_DATE, STATISTICAL_VARIABLE1, STATISTICAL_VARIABLE2, …
There are two columns, the ENTITY and the OBSERVATION_DATE, that specify the place and time of the observation; all other columns must be expressed as variables, as described above. To continue with the above example, a CSV file would need to look like this:
city,year,CountPublicElementary,CountPublicMiddle,CountPublicSecondary,CountPrivateElementary,CountPrivateMiddle,CountPrivateSecondary
San Francisco,2023,300,300,200,100,100,50
San Jose,2023,400,400,300,200,200,100
The ENTITY is an existing property in the Data Commons knowledge graph that is used to describe an entity, most commonly a place. The best way to think of the entity type is as a key that could be used to join to other data sets. The column heading can be expressed as any existing place-related property; see Place types for a full list. It may also be any of the special DCID prefixes listed in Special place names.
Note: The type of the entities in a single file should be unique; do not mix multiple entity types in the same CSV file. For example, if you have observations for cities and counties, put all the city data in one CSV file and all the county data in another one.
The DATE is the date of the observation and should be in the format YYYY, YYYY-MM, or YYYY-MM-DD. The heading can be anything, although as a best practice, we recommend using a corresponding identifier, such as year
, month
or date
.
The VARIABLE should contain a metric observation at a particular time. The variable values must be numeric. Zeros and null values are accepted: zeros will be recorded and null values ignored.
Special place names
In addition to the place names listed in Place types, you can also use the following special names:
dcid
— An already resolved DC ID. Examples:country/USA
,geoId/06
country3AlphaCode
— Three-character country codes. Examples:USA
,CHN
geoId
— Place geo IDs. Examples:06
,023
lat#lng
— Latitude and longitude of the place using the format lat#long. Example:38.7#-119.4
wikidataId
— Wikidata place identifiers. Example:Q12345
You can also simply use the heading name
or place
and the importer will resolve it automatically.
The following are all valid examples of headers:
geoId,observationYear,statVar1,statVar2
06,2021,555,666
08,2021,10,10
name,observationYear,statVar1,statVar2
California,2021,555,666
Colorado,2021,10,10
dcId,observationYear,statVar1,statVar2
geoId/06,2021,555,666
geoId/08,2021,10,10
Write the data config file
The config.json file specifies how the CSV contents should be mapped and resolved to the Data Commons schema. See the example in the sample/config.json
file provided, which describes the data in the sample/average_annual_wage.csv
and sample/gender_wage_gap.csv
files.
Here is the general spec for the JSON file:
{ "inputFiles": { "FILE_NAME1": { "entityType": "ENTITY_PROPERTY", "ignoreColumns": ["COLUMN1", "COLUMN2", ...], "provenance": "NAME", "observationProperties" { "unit": "MEASUREMENT_UNIT", "observationPeriod": "OBSERVATION_PERIOD", "scalingFactor": "DENOMINATOR_VALUE", "measurementMethod": "METHOD" } }, "FILE_NAME2": { ... }, ... "variables": { "VARIABLE1": {"group": "GROUP_NAME1"}, "VARIABLE2": {"group": "GROUP_NAME1"}, "VARIABLE3": { "name": "DISPLAY_NAME", "description": "DESCRIPTION", "searchDescriptions": ["SENTENCE1", "SENTENCE2", ...], "group": "GROUP_NAME2", "properties": { "PROPERTY_NAME1":"VALUE", "PROPERTY_NAME2":"VALUE", … } }, }, "sources": { "SOURCE_NAME1": { "url": "URL", "provenances": { "PROVENANCE_NAME1": "URL", "PROVENANCE_NAME2": "URL", ... } } } }
Each section contains some required and optional fields, which are described in detail below.
Input files
The top-level inputFiles
field should encode a map from the input file name to parameters specific to that file. Keys can be individual file names or wildcard patterns if the same config applies to multiple files.
You can use the *
wildcard; matches are applied in the order in which they are specified in the config. For example, in the following:
{
"inputFiles": {
"foo.csv": {...},
"bar*.csv": {...},
"*.csv": {...}
}
}
The first set of parameters only applies to foo.csv
. The second set of parameters applies to bar.csv
, bar1.csv
, bar2.csv
, etc. The third set of parameters applies to all CSVs except the previously specified ones, namely foo.csv
and bar*.csv
.
Input file parameters
entityType
-
Required: All entities in a given file must be of a specific type. This type should be specified as the value of the
entityType
field. The importer tries to resolve entities to DCIDs of that type. In most cases, theentityType
will be a supported place type; see Place types for a list. ignoreColumns
-
Optional: The list of column names to be ignored by the importer, if any.
provenance
-
Required: The provenance (name) of this input file. Provenances typically map to a dataset from a source. For example,
WorldDevelopmentIndicators
provenance (or dataset) is from theWorldBank
source.
You must specify the provenance details under sources.provenances
; this field associates one of the provenances defined there to this file.
observationProperties
-
Optional: Additional information about each contained in the CSV file. Currently, four properties are supported:
unit
: The unit of measurement used in the observations. This is a string representing a currency, area, weight, volume, etc. For example,SquareFoot
,USD
,Barrel
, etc.measurementPeriod
: The period of time in which the observations were recorded. This must be in ISO duration format, namelyP[0-9][Y|M|D|h|m|s]
. For example,P1Y
is 1 year,P3M
is 3 months,P3h
is 3 hours.measurementMethod
: The method used to gather the observations. This can be a random string or an existing DCID ofMeasurementMethodEnum
type; for example,EDA_Estimate
orWorldBankEstimate
.scalingFactor
: An integer representing the denominator used in measurements involving ratios or percentages. For example, for percentages, the denominator would be100
.
Note that you cannot mix different property values in a single CSV file. If you have observations using different properties, you must put them in separate CSV files.
Variables
The variables
section is optional. You can use it to override names and associate additional properties with the statistical variables in the files, using the parameters described below. All parameters are optional.
Variable parameters
name
-
The display name of the variable, which will show up in the site’s exploration tools. If not specified, the column name is used as the display name.
The name should be concise and precise; that is, the shortest possible name that allow humans to uniquely identify a given variable. The name is used to generate NL embeddings. description
-
A long-form description of the variable.
properties
-
Additional Data Commons properties associated with this variable. These are Data Commons property entities. See Representing statistics in Data Commons for more details.
Each property is specified as a key:value pair. Here are some examples:
{
"populationType": "schema:Person",
"measuredProperty": "age",
"statType": "medianValue",
"gender": "Female"
}
group
-
You can arrange variables in groups, so that they appear together in the Statistical Variables Explorer and other exploration tools. The group name is used as the heading of the group. For example, in the sample data, the group name
OECD
is used to group together the two variables from the two CSV files:
You can have a multi-level group hierarchy by using /
as a separator between each group.
searchDescriptions
-
An array of descriptions to be used for creating more NL embeddings for the variable. This is only needed if the variable
name
is not sufficient for generating embeddings.
Sources
The sources
section is optional. It encodes the sources and provenances associated with the input dataset. Each named source is a mapping of provenances to URLs.
Source parameters
url
- Required: The URL of the named source. For example, for named source
U.S. Social Security Administration
, it would behttps://www.ssa.gov
. provenances
- Required: A set of name:URL pairs. Here are some examples:
{
"USA Top Baby Names 2022": "https://www.ssa.gov/oact/babynames/",
"USA Top Baby Names 1923-2022": "https://www.ssa.gov/oact/babynames/decades/century.html"
}
Load local custom data
The following procedures show you how to load and serve your custom data locally.
To load data in Google Cloud, see instead Load data in Google Cloud for procedures.
Configure environment variables
Edit the env.list
file you created previously as follows:
- Set the
INPUT_DIR
variable to the directory where your input files are stored. - Set the
OUTPUT_DIR
variable to the directory where you would like the output files to be stored. This can be the same or different from the input directory. When you rerun the Docker data management container, it will create adatacommons
subdirectory under this directory.
Start the Docker containers with local custom data
Once you have configured everything, use the following commands to run the data management container and restart the services container, mapping your input and output directories to the same paths in Docker.
Step 1: Start the data management container
In one terminal window, from the root directory, run the following command to start the data management container:
docker run \ --env-file $PWD/custom_dc/env.list \ -v INPUT_DIRECTORY:INPUT_DIRECTORY \ -v OUTPUT_DIRECTORY:OUTPUT_DIRECTORY \ gcr.io/datcom-ci/datacommons-data:stable
(Optional) Start the data management container in schema update mode
If you have tried to start a container, and have received a SQL check failed
error, this indicates that a database schema update is needed. You need to restart the data management container, and you can specify an additional, optional, flag, DATA_RUN_MODE=schemaupdate
. This mode updates the database schema without re-importing data or re-building natural language embeddings. This is the quickest way to resolve a SQL check failed error during services container startup.
To do so, add the following line to the above command:
docker run \
...
-e DATA_RUN_MODE=schemaupdate \
...
gcr.io/datcom-ci/datacommons-data:stable
Once the job has run, go to step 2 below.
Step 2: Start the services container
In another terminal window, from the root directory, run the following command to start the services container:
docker run -it \ -p 8080:8080 \ -e DEBUG=true \ --env-file $PWD/custom_dc/env.list \ -v INPUT_DIRECTORY:INPUT_DIRECTORY \ -v OUTPUT_DIRECTORY:OUTPUT_DIRECTORY \ gcr.io/datcom-ci/datacommons-services:stable
Any time you make changes to the CSV or JSON files and want to reload the data, you will need to rerun the data management container, and then restart the services container.
Inspect the SQLite database
If you need to troubleshoot custom data, it is helpful to inspect the contents of the generated SQLite database.
To do so, from a terminal window, open the database:
sqlite3 OUTPUT_DIRECTORY/datacommons/datacommons.db
This starts the interactive SQLite shell. To view a list of tables, at the prompt type .tables
. The relevant table is observations
.
At the prompt, enter SQL queries. For example, for the sample OECD data, this query:
sqlite> select * from observations limit 10;
returns output like this:
country/BEL|average_annual_wage|2000|54577.62735|c/p/1
country/BEL|average_annual_wage|2001|54743.96009|c/p/1
country/BEL|average_annual_wage|2002|56157.24355|c/p/1
country/BEL|average_annual_wage|2003|56491.99591|c/p/1
country/BEL|average_annual_wage|2004|56195.68432|c/p/1
country/BEL|average_annual_wage|2005|55662.21541|c/p/1
...
To exit the sqlite shell, press Ctrl-D
.
Page last updated: December 17, 2024 • Send feedback about this page