Accessing Metadata via TDR
The Terra Data Repository (TDR) offers a complete and authoritative source for the metadata made available by the HCA Data Portal Data Browser. This guide will walk you through the process of accessing metadata stored within TDR.
For detailed information regarding the structure of the data within TDR, see the Terra Data Repository schema.
Finding a Catalog's Sources
Each project in the HCA Data Portal Data Browser belongs to a catalog. Each catalog consists of one or more sources, and each source is composed of properties. These properties include the name of the Google Cloud project and the name of the TDR snapshot.
A catalog's sources can be viewed via the Data Browser API /index/catalogs
endpoint.
Using curl
to make the request, and jq
to parse the response, the first step is to identify the current default catalog.
$ curl -s https://service.azul.data.humancellatlas.org/index/catalogs | jq '.default_catalog'
"dcp11"
With the name of the catalog we can now parse out the catalog's source(s) from the endpoint response.
$ curl -s https://service.azul.data.humancellatlas.org/index/catalogs | jq '.catalogs.dcp11.plugins.repository.sources'
[
"tdr:tdr-fp-c315dee1:snapshot/hca_prod_20201120_dcp2___20211101_dcp11:/2"
]
In this example, there is one source for the dcp11
catalog. The name of the
project (tdr-fp-c315dee1
) and name of the snapshot (hca_prod_20201120_dcp2___20211101_dcp11
) can be extracted from the source following the syntax below.
"tdr:{PROJECT_NAME}:snapshot/{SNAPSHOT_NAME}:{prefix}"
Running BigQuery SQL Queries
Access to the metadata in TDR can be accomplished using Google BigQuery. Queries can be run via the Cloud Console or by using the command-line tool bq
.
Example queries
Replace the following placeholders in the query with the appropriate value prior to use:
{GC_PROJECT}
: The name of the Google Cloud project.{SNAPSHOT}
: The name of the snapshot.{PROJECT_ID}
: The ID of a project in the HCA Data Portal Data Browser.
To query for all subgraphs (links) in one project:
SELECT links_id, content
FROM `{PROJECT}.{SNAPSHOT}.links`
WHERE project_id = '{PROJECT_ID}'
To query for all specimen from organism(*) entities in one project:
(*) Note: This query can be adapted for other types of biomaterials such as cell
suspensions or donor organisms by replacing the values of the table name
specimen_from_organism
and ID field specimen_from_organism_id
.
WITH contents AS (
SELECT content
FROM `{GC_PROJECT}.{SNAPSHOT}.links` AS links,
UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content
WHERE links.project_id = '{PROJECT_ID}'
)
SELECT entity.specimen_from_organism_id, entity.content
FROM `{GC_PROJECT}.{SNAPSHOT}.specimen_from_organism` AS entity
WHERE entity.specimen_from_organism_id in (
SELECT JSON_VALUE(input, '$.input_id') AS id
FROM contents, UNNEST(JSON_EXTRACT_ARRAY(contents.content, '$.inputs')) AS input
UNION ALL
SELECT JSON_VALUE(output, '$.output_id') AS id
FROM contents, UNNEST(JSON_EXTRACT_ARRAY(contents.content, '$.outputs')) AS output
)
To query for all analysis protocol(*) entities in one project:
(*) Note: This query can be adapted for other types of protocols such as library
preparation protocols or imaging protocols by replacing the values of the table
name analysis_protocol
and ID field analysis_protocol_id
.
WITH contents AS (
SELECT content
FROM `{GC_PROJECT}.{SNAPSHOT}.links` AS links,
UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content
WHERE links.project_id = '{PROJET_ID}'
)
SELECT entity.analysis_protocol_id, entity.content
FROM `{GC_PROJECT}.{SNAPSHOT}.analysis_protocol` AS entity
WHERE entity.analysis_protocol_id in (
SELECT JSON_VALUE(protocol, '$.protocol_id') AS id
FROM contents, UNNEST(JSON_EXTRACT_ARRAY(contents.content, '$.protocols')) AS protocol
)