Test data in Google Cloud
This page shows you how to store your custom data in Google Cloud database and load it into a local instance. This is step 4 of the recommended workflow.
- Overview
- Prerequisites
- One-time setup steps
- Upload data files to Google Cloud Storage
- Point the local Data Commons site to the Cloud data
- Inspect the Cloud SQL database
Overview
Once you have tested locally, you need to get your data into Google Cloud so you can test it remotely. You can continue to run the custom Data Commons instance locally, but retrieve data from the Cloud. In this scenario, the system is set up like this:
You will upload your CSV and JSON files to Google Cloud Storage, and the custom Data Commons importer will transform, store, and query the data in a Google Cloud SQL database.
Prerequisites
- A GCP billing account and project.
- Install the gcloud CLI.
One-time setup steps
Choose a location
While you are testing, you can start with a single Google Cloud region; to be close to the base Data Commons data, you can use us-central1
. However, once you launch, you may want to host your data and application closer to where your users will be. In any case, you should use the same region for your Google Cloud SQL instance, the Google Cloud Storage buckets, and the Google Cloud Run service where you will host the site. For a list of supported regions, see Cloud SQL Manage instance locations.
Create a Google Cloud SQL instance
- Go to https://console.cloud.google.com/sql/instances for your project.
- Next to Instances, click Create Instance.
- Click Choose MySQL.
- If necessary, enable APIs as directed.
- Set an instance ID. Record the instance connection name in the form of
INSTANCE_ID
for setting environment variables below. - Set a root password, and record it for setting environment variables below.
- For the Location type, choose the relevant regional option.
- When you have finished setting all the configuration options, click Create Instance. It may take several minutes for the instance to be created.
- When the instance is created and the left navigation bar appears, select Users.
- Add at least one user and password.
- Select Databases.
- Click Create Database.
- Choose a name for the database or use the default,
datacommons
. - Click Create.
Create a Google Cloud Storage bucket
- Go to https://console.cloud.google.com/storage/browser for your project.
- Next to Buckets, click Create.
- Enter a name for this bucket.
- For the Location type, choose the same regional options as for Cloud SQL above.
- When you have finished setting all the configuration options, click Create.
- In the Bucket Details page, click Create Folder to create a new folder to hold your data.
- Name the folder as desired. Record the folder path as
gs://BUCKET_NAME/FOLDER_PATH
for setting theOUTPUT_DIR
environment variable below.
Set environment variables
- Using your favorite editor, open
custom_dc/env.list
. - Set
USE_SQLITE=false
andUSE_CLOUDSQL=true
-
Set values for all of the following:
CLOUDSQL_INSTANCE
GOOGLE_CLOUD_PROJECT
DB_NAME
DB_USER
DB_PASS
OUTPUT_DIR
See comments in the
env.list
file for the correct format for each option. - Optionally, set an
ADMIN_SECRET
to use when loading the data through the/admin
page later.
Warning: Do not use any quotes (single or double) or spaces when specifying the values.
Upload data files to Google Cloud Storage
- Go to https://console.cloud.google.com/storage/browse and select your custom Data Commons bucket.
- Navigate to the folder you created in the earlier step.
- Click Upload Files, and select all your CSV files and
config.json
.
Note: Do not upload the local datacommons
subdirectory or its files.
As you are iterating on changes to the source CSV and JSON files, you can re-upload them at any time, either overwriting existing files or creating new folders. To load them into Google SQL, follow the procedures below.
Point the local Data Commons site to the Cloud data
Generate credentials for Google Cloud default application
Before you can connect to the Cloud SQL instance, you need to generate credentials that can be used in the local Docker container. You should refresh the credentials every time you restart the Docker container.
Open a terminal window and run the following command:
gcloud auth application-default login
This opens a browser window that prompts you to enter credentials, sign in to Google Auth Library and allow Google Auth Library to access your account. Accept the prompts. When it has completed, a credential JSON file is created in
$HOME/.config/gcloud/application_default_credentials.json
. Use this in the command below to authenticate from the docker container.
If you are prompted to specify a quota project for billing that will be used in the credentials file, run this command:
gcloud auth application-default set-quota-project PROJECT_ID
If you are prompted to install the Cloud Resource Manager API, press y
to accept.
Start the Docker container with Cloud data
Run with a prebuilt image
If you have not made changes that require a local build, and just want to run the pre-downloaded image, from your repository root, run:
docker run -it \ --env-file $PWD/custom_dc/env.list \ -p 8080:8080 \ -e DEBUG=true \ -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \ -v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \ gcr.io/datcom-ci/datacommons-website-compose:stable
Run with a locally built repo
If you have made local changes and have a locally built repo, from the root of the repository, run the following:
docker run -it \ --env-file $PWD/custom_dc/env.list \ -p 8080:8080 \ -e DEBUG=true \ -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json \ -v $HOME/.config/gcloud/application_default_credentials.json:/gcp/creds.json:ro \ datacommons-website-compose:DOCKER_TAG
DOCKER_TAG
is the tag you specified when you built the repo.
Load custom data in Cloud SQL
Each time you upload new versions of the source CSV and JSON files, you need to load the new/updated data into Google Cloud SQL. Custom Data Commons allows you to reload data on the fly, while the website is running, so even multiple users can reload data.
You can load the new/updated data from Cloud Storage using the /admin
page on the site:
- Optionally, in the
env.list
file, set theADMIN_SECRET
environment variable to a string that authorizes users to load data. - Start the Docker container as described above.
- With the services running, navigate to the
/admin
page. If a secret is required, enter it in the text field, and click Load. This runs a script inside the Docker container, that converts the CSV data in Cloud Storage into SQL tables, and stores them in the Cloud SQL database you created earlier. It also generates embeddings in the Google Cloud Storage folder into which you uploaded the CSV/JSON files, in adatacommons/nl/
subfolder.
Inspect the Cloud SQL database
To view information about the created tables:
- Go to https://console.cloud.google.com/sql/instances for your project and select the instance you created earlier.
- In the left panel, select Cloud SQL Studio.
- In the Sign in to SQL Studio page, from the Database field, select the database you created earlier, e.g.
datacommons
. - Enter the user name and password and click Authenticate.
- In the left Explorer pane that appears, expand the Databases icon, your database name, and Tables. The table of interest is
observations
. You can see column names and other metadata. - To view the actual data, in the main window, click New SQL Editor tab. This opens an environment in which you can enter and run SQL queries.
-
Enter a query and click Run. For example, for the sample OECD data, if you do
select * from observations limit 10;
, you should see output like this: