Getting started
This page shows you how to run a local custom Data Commons instance inside Docker containers and load sample custom data from a local SQLite database. A custom Data Commons instance uses code from the public open-source repo, available at https://github.com/datacommonsorg/.
This is step 1 of the recommended workflow.
- System overview
- Prerequisites
- One-time setup steps
- About the downloaded files
- Look at the sample data
- Load sample data
- Start the services
- View the local website
- Send an API request
System overview
The instructions in this page use the following setup:
The “data management” Docker container consists of scripts that do the following:
- Convert custom CSV file data into SQL tables and store them in a data store – for now, in a local SQLite database
- Generate NL embeddings for custom data and store them – for now, in the local file system
The “services” Docker container consists of the following Data Commons components:
- A Nginx reverse proxy server, which routes incoming requests to the web or API server
- A Python-Flask web server, which handles interactive requests from users
- An Python-Flask NL server, for serving natural language queries
- A Go Mixer, also known as the API server, which serves programmatic requests using Data Commons APIs. The SQL query engine is built into the Mixer, which sends queries to both the local and remote data stores to find the right data. If the Mixer determines that it cannot fully resolve a user query from the custom data, it will make an REST API call, as an anonymous “user” to the base Data Commons Mixer and data.
Prerequisites
- Obtain a GCP billing account and project.
- If you are developing on Windows, install WSL 2 (any distribution will do, but we recommend the default, Ubuntu), and enable WSL 2 integration with Docker.
- Install Docker Desktop/Engine.
- Install Git.
- Optional: Get a Github account, if you would like to browse the Data Commons source repos using your browser.
One-time setup steps
Get a Data Commons API key
An API key is required to authorize requests from your site to the base Data Commons site. API keys are managed by a self-serve portal. To obtain an API key, go to https://apikeys.datacommons.org and request a key for the api.datacommons.org
domain.
Enable Google Cloud APIs and get a Maps API key
- Go to https://console.cloud.google.com/apis/dashboard for your project.
- Click Enable APIs & Services.
- Under Maps, enable Places API and Maps Javascript API.
- Go to https://console.cloud.google.com/google/maps-apis/credentials for your project.
- Click Create Credentials > API Key.
- Record the key and click Close.
- Click on the newly created key to open the Edit API Key window.
- Under API restrictions, select Restrict key.
- From the drop-down menu, enable Places API and Maps Javascript API. (Optionally enable other APIs for which you want to use this key.)
- Click OK and Save.
Clone the Data Commons repository
Note: If you are using WSL on Windows, open the Linux distribution app as your command shell. You must use the Linux-style file structure for Data Commons to work correctly.
- Open a terminal window, and go to a directory to which you would like to download the Data Commons repository.
- Clone the website Data Commons repository:
git clone https://github.com/datacommonsorg/website.git [DIRECTORY]
If you don’t specify a directory name, this creates a local website
subdirectory. If you specify a directory name, all files are created under that directory, without a website
subdirectory.
When the downloads are complete, navigate to the root directory of the repo (e.g. website
). References to various files and commands in these procedures are relative to this root.
cd website
Set environment variables
- Using your favorite editor, open
custom_dc/env.list
. - Enter the relevant values for
DC_API_KEY
andMAPS_API_KEY
. - Set the
INPUT_DIR
to the full path to thewebsite/custom_dc/sample/
directory. For example if you have cloned the repo directly to your home directory, this might be/home/USERNAME/website/custom_dc/sample/
. (If you’re not sure, typepwd
to get the working directory.) - For the
OUTPUT_DIR
, set it to the same path as theINPUT_DIR
.
Warning: Do not use any quotes (single or double) or spaces when specifying the values.
About the downloaded files
Directory/file | Description |
---|---|
custom_dc/sample/ |
Sample supplemental data that is added to the base data in Data Commons. This page describes the model and format of this data and how you can load and view it. |
custom_dc/examples/ |
More examples of custom data in CSV format and config.json. To configure your own custom data, see Work with custom data. |
server/templates/custom_dc/custom/ |
Contains customizable HTML files. To modify these, see Customize HTML templates. |
static/custom_dc/custom/ |
Contains customizable CSS file and default logo. To modify the styles or replace the logo, see Customize Javascript and styles. |
custom_dc/env.list |
Contains environment variables for locally run Data Commons data management and services containers. For details of the variables, see the comments in the file. |
Look at the sample data
Before you start up a Data Commons site, it’s important to understand the basics of the data model that is expected in a custom Data Commons instance. Let’s look at the sample data in the CSV files in the custom_dc/sample/
folder. This data is from the Organisation for Economic Co-operation and Development (OECD): “per country data for annual average wages” and “gender wage gaps”:
countryAlpha3Code | date | average_annual_wage |
---|---|---|
BEL | 2000 | 54577.62735 |
BEL | 2001 | 54743.96009 |
BEL | 2002 | 56157.24355 |
BEL | 2003 | 56491.99591 |
… | … | … |
countryAlpha3Code | date | gender_wage_gap |
---|---|---|
DNK | 2005 | 10.16733044 |
DNK | 2006 | 10.17206126 |
DNK | 2007 | 9.850297951 |
DNK | 2008 | 10.18354903 |
… | … | … |
There are a few important things to note:
- There are only 3 columns: one representing a place (
countryAlpha3Code
, a special Data Commons place type); one representing a date (date
); and one representing a statistical variable, which is a Data Commons concept for a metric:average_annual_wage
andgender_wage_gap
. (Actually, there can be any number of statistical variable columns – but no other types of additional columns – and these two CSV files could be combined into one.) - Every row is a separate observation, or a value of the variable for a given place and time. In the case of multiple statistical variable columns in the same file, each row would, of course, consist of multiple observations.
This is the format to which your data must conform if you want to take advantage of Data Commons’ simple import facility. If your data doesn’t follow this model, you’ll need to do some more work to prepare or configure it for correct loading. (That topic is discussed in detail in Preparing and loading your data.)
Load sample data
To load the sample data:
- If you are running on Windows or Mac, start Docker Desktop and ensure that the Docker Engine is running.
- Open a terminal window, and from the root directory, run the following command to run the data management Docker container:
docker run \
--env-file $PWD/custom_dc/env.list \
-v $PWD/custom_dc/sample:$PWD/custom_dc/sample \
gcr.io/datcom-ci/datacommons-data:stable
This does the following:
- The first time you run it, downloads the latest stable Data Commons data image,
gcr.io/datcom-ci/datacommons-data:stable
, from the Google Cloud Artifact Registry, which may take a few minutes. Subsequent runs use the locally stored image. - Maps the input sample data to a Docker path.
- Starts a Docker container.
- Imports the data from the CSV files, resolves entities, and writes the data to a SQLite database file,
custom_dc/sample/datacommons/datacommons.db
. - Generates embeddings in
custom_dc/sample/datacommons/nl
. (To learn more about embeddings generation, see the FAQ.
Once the container has executed all the functions in the scripts, it shuts down.
Start the services
- Open a new terminal window.
- From the root directory, run the following command to start the services Docker container:
docker run -it \
-p 8080:8080 \
-e DEBUG=true \
--env-file $PWD/custom_dc/env.list \
-v $PWD/custom_dc/sample:$PWD/custom_dc/sample \
gcr.io/datcom-ci/datacommons-services:stable
Note: If you are running on Linux, depending on whether you have created a “sudoless” Docker group, you may need to preface every
docker
invocation withsudo
.
This command does the following:
- The first time you run it, downloads the latest stable Data Commons image,
gcr.io/datcom-ci/datacommons-services:stable
, from the Google Cloud Artifact Registry, which may take a few minutes. Subsequent runs use the locally stored image. - Starts a services Docker container.
- Starts development/debug versions of the Web Server, NL Server, and Mixer, as well as the Nginx proxy, inside the container.
- Maps the output sample data to a Docker path.
Stop and restart the services
If you need to restart the services for any reason, do the following:
- In the terminal window where the container is running, press Ctrl-c to kill the Docker container.
- Rerun the
docker run
command as described in Start the services.
Tip: If you close the terminal window in which you started the Docker services container, you can kill it as follows:
- Open another terminal window, and from the root directory, get the Docker container ID.
docker ps
The CONTAINER ID
is the first column in the output.
- Run:
docker kill CONTAINER_ID
View the local website
Once the services are up and running, visit your local instance by pointing your browser to http://localhost:8080. You should see something like this:
Now click the Timeline link to visit the Timeline explorer. Click Start, enter a country and click Continue. Now, in the Select variables tools, you’ll see the new variables:
Select one (or both) and click Display to show the timeline graph:
To issue natural language queries, click the Search link. Try NL queries against the sample data you just loaded, e.g. “Average annual wages in Canada”.
Send an API request
A custom instance can accept REST API requests at the endpoint /core/api/v2/
, which can access both the custom and base data. To try it out, here’s an example request you can make to your local instance that returns the same data as the interactive queries above, using the observation
API. Try entering this query in your browser address bar:
http://localhost:8080/core/api/v2/observation?entity.dcids=country%2FCAN&select=entity&select=variable&select=value&select=date&variable.dcids=average_annual_wage
Note: You do not need to specify an API key as a query parameter.
If you select Prettyprint, you should see output like this: