Getting started

This page shows you how to run a local custom Data Commons instance inside Docker containers and load sample custom data from a local SQLite database. A custom Data Commons instance uses code from the public open-source repo, available at https://github.com/datacommonsorg/.

This is step 1 of the recommended workflow.

System overview

The instructions in this page use the following setup:

local setup

The “data management” Docker container consists of scripts that do the following:

  • Convert custom CSV file data into SQL tables and store them in a data store – for now, in a local SQLite database
  • Generate NL embeddings for custom data and store them – for now, in the local file system

The “services” Docker container consists of the following Data Commons components:

  • A Nginx reverse proxy server, which routes incoming requests to the web or API server
  • A Python-Flask web server, which handles interactive requests from users
  • An Python-Flask NL server, for serving natural language queries
  • A Go Mixer, also known as the API server, which serves programmatic requests using Data Commons APIs. The SQL query engine is built into the Mixer, which sends queries to both the local and remote data stores to find the right data. If the Mixer determines that it cannot fully resolve a user query from the custom data, it will make an REST API call, as an anonymous “user” to the base Data Commons Mixer and data.

Prerequisites

  • Obtain a GCP billing account and project.
  • If you are developing on Windows, install WSL 2 (any distribution will do, but we recommend the default, Ubuntu), and enable WSL 2 integration with Docker.
  • Install Docker Desktop/Engine.
  • Install Git.
  • Get an API key to authorize requests from your site to the base Data Commons, by filling out this form. Typical turnaround times are 24-48 hours.
  • Optional: Get a Github account, if you would like to browse the Data Commons source repos using your browser.

One-time setup steps

Enable Google Cloud APIs and get a Maps API key

  1. Go to https://console.cloud.google.com/apis/dashboard for your project.
  2. Click Enable APIs & Services.
  3. Under Maps, enable Places API and Maps Javascript API.
  4. Go to https://console.cloud.google.com/google/maps-apis/credentials for your project.
  5. Click Create Credentials > API Key.
  6. Record the key and click Close.
  7. Click on the newly created key to open the Edit API Key window.
  8. Under API restrictions, select Restrict key.
  9. From the drop-down menu, enable Places API and Maps Javascript API. (Optionally enable other APIs for which you want to use this key.)
  10. Click OK and Save.

Clone the Data Commons repository

Note: If you are using WSL on Windows, open the Linux distribution app as your command shell. You must use the Linux-style file structure for Data Commons to work correctly.

  1. Open a terminal window, and go to a directory to which you would like to download the Data Commons repository.
  2. Clone the website Data Commons repository:
   git clone https://github.com/datacommonsorg/website.git [DIRECTORY]
  

If you don’t specify a directory name, this creates a local website subdirectory. If you specify a directory name, all files are created under that directory, without a website subdirectory.

When the downloads are complete, navigate to the root directory of the repo (e.g. website). References to various files and commands in these procedures are relative to this root.

cd website

Set environment variables

  1. Using your favorite editor, open custom_dc/env.list.
  2. Enter the relevant values for DC_API_KEY and MAPS_API_KEY.
  3. Set the INPUT_DIR to the full path to the website/custom_dc/sample/ directory. For example if you have cloned the repo directly to your home directory, this might be /home/USERNAME/website/custom_dc/sample/. (If you’re not sure, type pwd to get the working directory.)
  4. For the OUTPUT_DIR, set it to the same path as the INPUT_DIR.

Warning: Do not use any quotes (single or double) or spaces when specifying the values.

About the downloaded files

Directory/file Description
custom_dc/sample/ Sample supplemental data that is added to the base data in Data Commons. This page describes the model and format of this data and how you can load and view it.
custom_dc/examples/ More examples of custom data in CSV format and config.json. To configure your own custom data, see Work with custom data.
server/templates/custom_dc/custom/ Contains customizable HTML files. To modify these, see Customize HTML templates.
static/custom_dc/custom/ Contains customizable CSS file and default logo. To modify the styles or replace the logo, see Customize Javascript and styles.
custom_dc/env.list Contains environment variables for locally run Data Commons data management and services containers. For details of the variables, see the comments in the file.

Look at the sample data

Before you start up a Data Commons site, it’s important to understand the basics of the data model that is expected in a custom Data Commons instance. Let’s look at the sample data in the CSV files in the custom_dc/sample/ folder. This data is from the Organisation for Economic Co-operation and Development (OECD): “per country data for annual average wages” and “gender wage gaps”:

countryAlpha3Code date average_annual_wage
BEL 2000 54577.62735
BEL 2001 54743.96009
BEL 2002 56157.24355
BEL 2003 56491.99591
countryAlpha3Code date gender_wage_gap
DNK 2005 10.16733044
DNK 2006 10.17206126
DNK 2007 9.850297951
DNK 2008 10.18354903

There are a few important things to note:

  • There are only 3 columns: one representing a place (countryAlpha3Code, a special Data Commons place type); one representing a date (date); and one representing a statistical variable, which is a Data Commons concept for a metric: average_annual_wage and gender_wage_gap. (Actually, there can be any number of statistical variable columns – but no other types of additional columns – and these two CSV files could be combined into one.)
  • Every row is a separate observation, or a value of the variable for a given place and time. In the case of multiple statistical variable columns in the same file, each row would, of course, consist of multiple observations.

This is the format to which your data must conform if you want to take advantage of Data Commons’ simple import facility. If your data doesn’t follow this model, you’ll need to do some more work to prepare or configure it for correct loading. (That topic is discussed in detail in Preparing and loading your data.)

Load sample data

To load the sample data:

  1. If you are running on Windows or Mac, start Docker Desktop and ensure that the Docker Engine is running.
  2. Open a terminal window, and from the root directory, run the following command to run the data management Docker container:
  docker run \
  --env-file $PWD/custom_dc/env.list \
  -v $PWD/custom_dc/sample:$PWD/custom_dc/sample  \
  gcr.io/datcom-ci/datacommons-data:stable

This does the following:

  • The first time you run it, downloads the latest stable Data Commons data image, gcr.io/datcom-ci/datacommons-data:stable, from the Google Cloud Artifact Registry, which may take a few minutes. Subsequent runs use the locally stored image.
  • Maps the input sample data to a Docker path.
  • Starts a Docker container.
  • Imports the data from the CSV files, resolves entities, and writes the data to a SQLite database file, custom_dc/sample/datacommons/datacommons.db.
  • Generates embeddings in custom_dc/sample/datacommons/nl. (To learn more about embeddings generation, see the FAQ.

Once the container has executed all the functions in the scripts, it shuts down.

Start the services

  1. Open a new terminal window.
  2. From the root directory, run the following command to start the services Docker container:
docker run -it \
-p 8080:8080 \
-e DEBUG=true \
--env-file $PWD/custom_dc/env.list \
-v $PWD/custom_dc/sample:$PWD/custom_dc/sample  \
gcr.io/datcom-ci/datacommons-services:stable

Note: If you are running on Linux, depending on whether you have created a “sudoless” Docker group, you may need to preface every docker invocation with sudo.

This command does the following:

  • The first time you run it, downloads the latest stable Data Commons image, gcr.io/datcom-ci/datacommons-services:stable, from the Google Cloud Artifact Registry, which may take a few minutes. Subsequent runs use the locally stored image.
  • Starts a services Docker container.
  • Starts development/debug versions of the Web Server, NL Server, and Mixer, as well as the Nginx proxy, inside the container.
  • Maps the output sample data to a Docker path.

Stop and restart the services

If you need to restart the services for any reason, do the following:

  1. In the terminal window where the container is running, press Ctrl-c to kill the Docker container.
  2. Rerun the docker run command as described in Start the services.

Tip: If you close the terminal window in which you started the Docker services container, you can kill it as follows:

  1. Open another terminal window, and from the root directory, get the Docker container ID.
  docker ps

The CONTAINER ID is the first column in the output.

  1. Run:
  docker kill CONTAINER_ID
	

View the local website

Once the services are up and running, visit your local instance by pointing your browser to http://localhost:8080. You should see something like this:

screenshot_homepage

Now click the Timeline link to visit the Timeline explorer. Click Start, enter a country and click Continue. Now, in the Select variables tools, you’ll see the new variables:

screenshot_timeline

Select one (or both) and click Display to show the timeline graph:

screenshot_display

To issue natural language queries, click the Search link. Try NL queries against the sample data you just loaded, e.g. “Average annual wages in Canada”.

screenshot_search

Send an API request

A custom instance can accept REST API requests at the endpoint /core/api/v2/, which can access both the custom and base data. To try it out, here’s an example request you can make to your local instance that returns the same data as the interactive queries above, using the observation API. Try entering this query in your browser address bar:

http://localhost:8080/core/api/v2/observation?entity.dcids=country%2FCAN&select=entity&select=variable&select=value&select=date&variable.dcids=average_annual_wage

Note: You do not need to specify an API key as a query parameter.

If you select Prettyprint, you should see output like this:

screenshot_api_call