Load data in Google Cloud

This page shows you how to store your custom data in Google Cloud, and create the data management container as a Google Cloud Run job. This is step 4 of the recommended workflow.

Overview

Once you have tested locally, the next step is to get your data into the Google Cloud Platform. You upload your CSV and JSON files to Google Cloud Storage, and run the Data Commons data management Docker container as a Cloud Run job. The job will transform and store the data in a Google Cloud SQL database, and generate NL embeddings stored in Cloud Storage.

data management setup

Alternatively, if you have a very large data set, you may find it faster to store your input files and run the data management container locally, and output the data to Google Cloud Storage. If you would like to use this approach, follow steps 1 to 3 of the one-time setup steps below and then skip to Run the data management container locally.

Prerequisites

  • A GCP billing account and project.
  • Optional: Install the gcloud CLI.

One-time setup steps

Step 1: Choose a location

While you are testing, you can start with a single Google Cloud region; to be close to the base Data Commons data, you can use us-central1. However, once you launch, you may want to host your data and application closer to where your users will be. In any case, you should use the same region for your Google Cloud SQL instance, the Google Cloud Storage buckets, and the Google Cloud Run service where you will host the site. For a list of supported regions, see Cloud SQL Manage instance locations.

Step 2: Create a Google Cloud Storage bucket

This stores the CSV and JSON files that you will upload whenever your data changes. It also stores generated files files in a datacommons subdirectory when you run the data management job.

  1. Go to https://console.cloud.google.com/storage/browser for your project.
  2. Next to Buckets, click Create.
  3. Enter a name for this bucket.
  4. For the Location type, choose the same regional options as for Cloud SQL above.
  5. When you have finished setting all the configuration options, click Create.
  6. In the Bucket Details page, click Create Folder to create a new folder to hold your data and name it as desired.
  7. Optionally, create separate folders to hold input and output files, or just use the same one as for the input.

    Note: If you plan to run the data management container locally, you only need to create a single folder to hold the output files.

  8. Record the folder path(s) as gs://BUCKET_NAME/FOLDER_PATH for setting the INPUT_DIR and OUTPUT_DIR environment variables below.

Step 3: Create a Google Cloud SQL instance

This stores the data that will be served at run time. The Data Commons data management job will create the SQL tables and populate them when you start the job.

  1. Go to https://console.cloud.google.com/sql/instances for your project.
  2. Next to Instances, click Create Instance.
  3. Click Choose MySQL.
  4. If necessary, enable APIs as directed.
  5. Set an instance ID. Record the instance connection name in the form of INSTANCE_ID for setting environment variables below.
  6. Set a root password, and record it for setting environment variables below.
  7. For the Location type, choose the relevant regional option.
  8. When you have finished setting all the configuration options, click Create Instance. It may take several minutes for the instance to be created.
  9. When the instance is created and the left navigation bar appears, select Users.
  10. Add at least one user and password.
  11. Select Databases.
  12. Click Create Database.
  13. Choose a name for the database or use the default, datacommons.
  14. Click Create.
  15. In the Overview page for the new instance, record the Connection name to set in environment variables in the next step.

Step 4: Create a Google Cloud Run job

Since you won’t need to customize the data management container, you can simply run an instance of the released container provided by Data Commons team, at https://console.cloud.google.com/gcr/images/datcom-ci/global/datacommons-data.

See also the Cloud Run documentation for links to more information on all the options you may set on your jobs.

  1. Go to https://console.cloud.google.com/run/ for your project.
  2. Click Create job.
  3. In the Container image URL field, enter gcr.io/datcom-ci/datacommons-data:stable.
  4. Optionally, in the Job name field, enter an alternative name as desired.
  5. In the Region field, select the region you chose as your location.
  6. Leave the default Number of tasks as 1.
  7. Expand Container, Volumes, Connections, Security and expand Settings, and set the following options:
    • Resources > Memory: 8 GiB
    • Resources > CPU: 2

    Cloud Run job

Now set environment variables:

  1. Click the Variables and Secrets tab.
  2. Click Add variable.
  3. Add names and values for the following environment variables:
    • USE_CLOUDSQL: Set to true.
    • DC_API_KEY: Set to your API key.
    • INPUT_DIR: Set to the Cloud Storage bucket and input folder that you created in step 2 above.
    • OUTPUT_DIR: Set to the Cloud Storage bucket (and, optionally, output folder) that you created in step 2 above. If you didn’t create a separate folder for output, specify the same folder as the INPUT_DIR.
    • CLOUDSQL_INSTANCE: Set to the full connection name of the instance you created in step 3 above.
    • DB_USER: Set to a user you configured when you created the instance in step 3, or to root if you didn’t create a new user.
    • DB_PASS: Set to the user’s or root password you configured when you created the instance in step 3.
    • DB_NAME: Only set this if you configured the database name to something other than datacommons.
  4. When you finished, click Done.

    Cloud Run job

  5. If you have a large amount of data, adjust the Task capacity > Task timeout option accordingly. See Set task timeout (jobs) for more details.
  6. Click Create (but don’t run it immediately).

Manage your data

Step 1: Upload data files to Google Cloud Storage

  1. Go to https://console.cloud.google.com/storage/browse and select your custom Data Commons bucket.
  2. Navigate to the folder you created in the earlier step.
  3. Click Upload Files, and select all your CSV files and config.json.

Note: Do not upload the local datacommons subdirectory or its files.

As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Cloud SQL, you run the Cloud Run job you created above.

Step 2: Start the data management Cloud Run job

Now that everything is configured, and you have uploaded your data in Google Cloud Storage, you simply have to start the Cloud Run data management job to convert the CSV data into tables in the Cloud SQL database and generate the embeddings (in a datacommons/nl subfolder).

Every time you upload new input CSV or JSON files to Google Cloud Storage, you will need to rerun the job.

To run the job:

  1. Go to https://console.cloud.google.com/run/jobs for your project.
  2. From the list of jobs, click the link of the “datacommons-data” job you created above.
  3. Click Execute. It will take several minutes for the job to run. You can click the Logs tab to view the progress.

When it completes, to verify that the data has been loaded correctly, see the next step.

Inspect the Cloud SQL database

To view information about the created tables:

  1. Go to https://console.cloud.google.com/sql/instances for your project and select the instance you created earlier.
  2. In the left panel, select Cloud SQL Studio.
  3. In the Sign in to SQL Studio page, from the Database field, select the database you created earlier, e.g. datacommons.
  4. Enter the user name and password and click Authenticate.
  5. In the left Explorer pane that appears, expand the Databases icon, your database name, and Tables. The table of interest is observations. You can see column names and other metadata.
  6. To view the actual data, in the main window, click New SQL Editor tab. This opens an environment in which you can enter and run SQL queries.
  7. Enter a query and click Run. For example, for the sample OECD data, if you do select * from observations limit 10;, you should see output like this:

    screenshot_sqlite

Advanced setup (optional): Run the data management container locally

This process is similar to running both data management and services containers locally, with a few exceptions:

  • Your input directory will be the local file system, while the output directory will be a Google Cloud Storage bucket and folder.
  • You must start the job with credentials to be passed to Google Cloud, to access the Cloud SQL instance.

Before you proceed, ensure you have completed steps 1 to 3 of the One-time setup steps above.

Step 1: Set environment variables

To run a local instance of the services container, you need to set all the environment variables in the custom_dc/env.list file. See above for the details, with the following differences:

  • For the INPUT_DIR, specify the full local path where your CSV and JSON files are stored, as described in the Getting started.
  • Set GOOGLE_CLOUD_PROJECT to your GCP project name.

Step 2: Generate credentials for Google Cloud authentication

For the services to connect to the Cloud SQL instance, you need to generate credentials that can be used in the local Docker container for authentication. You should refresh the credentials every time you rerun the Docker container.

Open a terminal window and run the following command:

gcloud auth application-default login

This opens a browser window that prompts you to enter credentials, sign in to Google Auth Library and allow Google Auth Library to access your account. Accept the prompts. When it has completed, a credential JSON file is created in
$HOME/.config/gcloud/application_default_credentials.json. Use this in the command below to authenticate from the docker container.

The first time you run it, may be prompted to specify a quota project for billing that will be used in the credentials file. If so, run this command:

  
gcloud auth application-default set-quota-project PROJECT_ID  

If you are prompted to install the Cloud Resource Manager API, press y to accept.

Step 3: Run the Docker container

From your project root directory, run:

docker run \
--env-file $PWD/custom_dc/env.list \
-v INPUT_DIRECTORY:INPUT_DIRECTORY \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \
-v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \
gcr.io/datcom-ci/datacommons-data:VERSION

The version is latest or stable.

To verify that the data is correctly created in your Cloud SQL database, use the procedure in Inspect the Cloud SQL database above.

Advanced setup (optional): Access Cloud data from a local services container

For testing purposes, if you wish to run the services Docker container locally but access the data in Google Cloud, use the following procedures.

Step 1: Set environment variables

To run a local instance of the services container, you will need to set all the environment variables, as described above in the custom_dc/env.list. You must also set the MAPS_API_KEY to your Maps API key.

Step 2: Generate credentials for Google Cloud default application

See the section above for procedures.

Step 3: Run the Docker container

From the root directory of your repo, run the following command, assuming you are using a locally built image:

  
docker run -it \
--env-file $PWD/custom_dc/env.list \
-p 8080:8080 \
-e DEBUG=true \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \
-v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \
-v OUTPUT_DIRECTORY:OUTPUT_DIRECTORY \
[-v $PWD/server/templates/custom_dc/custom:/workspace/server/templates/custom_dc/custom \]
[-v $PWD/static/custom_dc/custom:/workspace/static/custom_dc/custom \]
IMAGE_NAME:IMAGE_TAG