User data

All variants must be listed with both Hugo Symbol (aka HGNC symbol) and Entrez ID for each gene. Entrez ID refers to the 'Gene ID' from the NCBI Gene database.

You might want to try my automated symbol matching script on GitHub, which will attempt to determine up to date identifiers from symbol and chromosome: http://github.com/sggaffney/gene_matcher.

For a table containing identifiers for all human genes, refer to their flat file Homo_sapiens.gene_info.gz. This file contains Hugo Symbol and Entrez ID in the 'Symbol' and 'GeneId' columns, respectively. Synonyms and external database identifiers are also listed.

If your variants are listed with Ensembl transcripts, the Ensembl Biomart can be used to obtain a map from Ensembl transcript ID to Hugo Symbol and Entrez ID.

Your mutation file should include any variants that could potentially alter the function of a pathway. This excludes silent/synonymous mutations, so remove these before uploading.

The short answer: use 'Gene length, BMR-scaled' for SNV data, and use 'Gene count' for copy number data. For anything else, read on.

The three algorithm variants are 'Gene length, BMR-scaled', 'Gene length, unscaled', and 'Gene count'. Each uses a calculation based on the hypergeometric distribution, and consider mutation to be a sampling process from the genome, but they vary in terms of the 'target size' they assign to genes and pathways.

Each uses the following target size formula for a gene set Γ:
target equation
Here, ρg is gene-specific background mutation rate (BMR) and λg is gene cds base-length.

The 'Gene length, unscaled' algorithm sets ρg to 1, preventing BMR scaling. This treats each base as having equal chance of mutation, with gene mutation probability proportional to length. This could be suitable for chromatin mark data, for example.

The 'Gene count' algorithm sets both ρg and λg to 1. This treats each gene as having equal chance of mutation. This is suitable for situations where mutation probability doesn't scale reliably with length, such as for somatic copy number aberration data.

Any genes you specify in the 'ignore' field of the upload page will be excluded from calculations of pathway mutation burden. If you ignore TP53, for example, pathway size will be calculated after excluding TP53, and patient mutation counts will exclude mutations in TP53. If an ignored gene is mutated in an enriched pathway, the mutations in this gene will still be shown in target and matrix plots, but they will have had no influence on the pathway effect size.

The 'required' genes field in the upload form restricts the analysis to pathways that contain these genes. This restriction should drastically speed up the analysis.

The following flow chart illustrates how the mutation list and pathway database are filtered before performing the pathway likelihood calculation:

Mutation rates per chromosome

If you select the 'BMR-scaled gene length" algorithm, you have the option to use a custom background mutation rate file. As described in the FAQ item "What values do you use for background mutation rate?", the default BMR values are taken from an average across multiple tissue types, from 91 cancer cell lines. The high correlation of BMR across these tissue types makes this a sensible source of per-gene BMR rates. This default may be adequate for most use cases, but further refinement is possible by creating a BMR dataset that is specific to the samples in your project or to the tissue type of your project. Increasing the accuracy of each gene's BMR estimate for improves the estimate of each gene's relative mutational 'target size', and can therefore reduce false positives (where averaged BMR might be too low) and false negatives (where averaged BMR might be too high).

Whole genome sequencing is best suited to identifying local BMR values. Its coverage of introns and intergenic regions as well as the exome can give a more complete picture of mutation frequencies than exome sequencing alone. A custom BMR file will contain three columns: hugo_symbol, entrez_id, per_Mb. The first two columns identify the gene, the third is the gene-specific per-megabase mutation rate. Ideally you would provide rates for every gene in the pathway database, but you can supply an incomplete list and PathScore will use the average mutation rate for missing genes.

Upload your custom BMR file at http://pathscore.publichealth.yale.edu/bmr , where you can maintain a repository of BMR files. Choose a title and description for your file. The title will be used in the BMR dropdown field in the project upload form, so make it short and recognizable. You will be able to download the processed BMR file, as well as a lists of unrecognized/'rejected' genes and 'ignored' genes that are not in the pathways database.

When you upload a mutation dataset, you have the option of including a column called 'annot' (short for 'annotation'). The values you specify here do not affect any pathway calculation, but will be used to add an extra dimension of information to the resulting matrix plots. The annotation strings you provide will be overlaid on the corresponding patient–gene boxes, so try to keep these to a single character. In cases where there are multiple mutations in a single gene, the most common annotation string will be used, appended with a '+'. To convey activating or silencing mutations, for example, you might use 'a' and 's', respectively, as in the example matrix below. Copy number gain or loss could be conveyed with '+' and '-'.

annot_matrix_demo.svgz

The motivation behind this feature is to aid interpretation. The observation of the same annotation across all mutations of a gene might suggest a common mutational mechanism, e.g. consistent copy number loss. Informative patterns can emerge this way, assisting the search for plausibly cancer-driving pathways among false positives.

The web interface

The figure below shows an example from the PID TCR pathway for a melanoma dataset.

tcr_pathway_demo.svgz
Loaded
The total number of variants used in the analysis, after filtering. These correspond to rows in the upload file that have valid hugo-entrez pairs for genes that are present in MSigDB.
Unused
'Unused' variants are those with valid hugo-entrez pairs, but the corresponding genes are not present in MSigDB.
Rejected
'Rejected' variants are those with invalid hugo-entrez pairs. Refer to the entry on obtaining valid identifiers to fix these entries.

Specifying genes to 'exclude' in the results pages, hides any pathways that contain these genes. This is different to the 'ignored' genes parameter, which will not skip pathways, but instead exclude the provided genes from pathway size calculation and patient mutation counts.

The 'include' field in the results pages filters out all pathways that contain the specified genes. The pathways shown will be the same as those had the 'include' genes been specified for the 'required' genes parameter on the upload page.

The Comparison page, like the Flat and Scatter results pages, uses a cutoff of P<0.05 for classifying a pathway as 'enriched'.

The scatter plot on the Comparison page shows all pathways that:

  1. are enriched in at least one of the two selected projects, and
  2. have at least one alteration in both projects.

The requirement for each pathway to be altered in both projects is necessary to obtain the respective effect sizes that are used for the scatter plot coordinates.

Project results can be downloaded as a zip archive from the status table, by clicking the download icon on the right. The archive provides data files, described below, and html files for offline graphical browsing of interactive results pages.

To browse the interactive results pages, you can run the included 'server.py' python script. Run this script with the command python server.py [port]. Port is optinoal. If serving on port 8000, for example, in the address bar of your browser enter localhost:8000. From the directory listing shown, select one of the html files.

The server script is python's built-in SimpleHTTPServer script, modified to serve svgz files a 'Content-Encoding' header of 'gzip'. Other server applications will work as long as this encoding is used.

The archive also contains the following data files:

  • The original uploaded data, as a plain text file with no extension.
  • A tab-separated results file ('_summary.txt'). This is the main results data file, which can be opened in Excel. Each row contains the following pathway attributes:
    • Full name
    • MSigDB pathway ID. These are used to name accompanying svg files
    • MSigDB info url
    • P value
    • Effective pathway size (n_effective)
    • Actual pathway size (n_actual)
    • Gene mutation frequencies (%), json-formatted
    • Shortest altered gene length (kbp)
    • Shortest altered gene name
    • Longest altered gene length (kbp)
    • Longest altered gene name
    • Average length of altered genes (kbp)
    • Variance in lengths of altered genes
  • A record of the input parameters ('_params.txt').
  • A list of 'hypermutated' patients (>500 variants).
  • A list of 'unused' variants ('_unused.txt').
  • A list of 'rejected' variants ('_rejected.txt').
  • A patient-pathway array containing all patient specific P values ('_pvalues.txt').
  • A 'matrix_txt' folder, containing one text file per tested pathway. Each file is a matrix showing mutation status for each gene in the pathway, and each patient that has a mutation in the pathway. Asterisks next to a gene symbol have the same meaning as 'red' boxes in the mutation plots: that at least one patient has this gene mutated, and no other mutation in the pathway.
  • Resource files for offline browsing, used by the accompanying html pages:
    • Tree data, in 2 files. A compressed svg file (.svgz) contains the dendrogram plot, and a text file ('.txt.reorder') contains the pathway names to label the branches, from top to bottom.
    • A javascript (.js) file, containing pathway data.

The interactive plots in the Scatter and Compare views are built using the Bokeh visualization library. This library provides the following functionality:

Hover
Each pathway is represented by a scatter plot circle. Hovering over a pathway with reveal a tooltip containing the pathway name.
Pan
The pan tool allows panning of the axes.
Zoom
Mouse wheel scrolling allows you zoom in or out. Alternatively use the 'box zoom' tool to draw a rectangular zoom area.
Point selection
Click on a pathway circle to pull up target and matrix plots beneath the scatter plot. To select multiple pathways at once, hold Shift and click on multiple circles, or use the 'box select' tool to select all pathways within a rectangular area.
Resize
The 'resize' tool, available on Scatter page, lets you make the entire plot larger or smaller by clicking and dragging on the axes.
Reset
Return to the original view by clicking on the 'reset' button.

The REST API

A REST API lets you browse, create, modify and delete resources in a web app using simple GET, POST, PUT, DELETE http requests, respectively. The API returns an http response with helpful information in JSON form.

PathScore's REST API provides a way of creating, browsing and deleting projects without having to use the browser interface. It is straightforward to construct http requests in a script or from the command line, which means that PathScore can be called programmatically and plugged into analysis pipelines.

Resource URLs are referred to as 'endpoints'. The primary endpoint for Pathscore is https://pathscore.publichealth.yale.edu/api/projects/. The trailing backslash indicates a 'collection' of projects. A GET request to this endpoint means "tell me what projects I have". A POST request means "create a project with the attached form data". Individual projects are located at api/projects/[project_id]. You can 'GET' detailed project information, including results and archive urls, or 'DELETE' the project.

The examples in this FAQ use the command line tool httpie to construct http requests. Each programming language will have its own methods. Python, for example, offers the requests module:

import requests
requests.get('http://pathscore.publichealth.yale.edu/api/projects/', auth=(token, ''))

Note the use of an authorization header in the example above. PathScore requires a temporary authentication token for any API request, in order to match users to their projects.

Uploading a project as an anonymous user does not require authentication, but other functionality, such as browsing projects, requires an authentication token. You can request a token at the auth/request-token endpoint using the POST method. This initial request requires 'basic authentication' with your username and password. Anonymous users receive a username and password automatically after uploading a project. Make sure to send your request over SSL to protect your details.

Authentication tokens expire after 1 hour. Later requests will require a new token.

Command line example, using httpie

$ http --auth=demo@example.edu:password POST https://pathscore.publichealth.yale.edu/auth/request-token

  • To avoid exposing your password, exclude it from the above command to get a password prompt.
  • Note the use of the POST method. Other request types will fail.

Example response JSON

{
    "token": "eyJhbGciOiJIUzI1NiIsImV4cCI6MT"
}

Use this token in the authorization header of additional API requests as username, with blank password. For example:

$ TK=eyJhbGciOiJIUzI1NiIsImV4cCI6MT
$ http --auth=$TK: GET https://pathscore.publichealth.yale.edu/endpoint
  • Note the addition of a colon after the token. This tells httpie to use an empty string as password.

You can browse your projects with a GET request to the /api/projects/ endpoint.

The default behavior is to return project URLs for up to ten projects, and URLs for additional paginated results pages. Exapanded project information and results filtering is made possible through query strings, as described below.

Basic use

$ http --auth=$TK: GET https://pathscore.publichealth.yale.edu/api/projects/

Response:

{
    "meta": {
        "first_url": "https://pathscore.publichealth.yale.edu/api/projects/?per_page=10&page=1",
        "last_url": "https://pathscore.publichealth.yale.edu/api/projects/?per_page=10&page=3",
        "next_url": "https://pathscore.publichealth.yale.edu/api/projects/?per_page=10&page=2",
        "page": 1,
        "pages": 3,
        "per_page": 10,
        "prev_url": null,
        "total": 25
    },
    "projects": [
        "https://pathscore.publichealth.yale.edu/api/projects/50",
        "https://pathscore.publichealth.yale.edu/api/projects/51",
        "https://pathscore.publichealth.yale.edu/api/projects/52",
        "https://pathscore.publichealth.yale.edu/api/projects/54",
        "https://pathscore.publichealth.yale.edu/api/projects/57",
        "https://pathscore.publichealth.yale.edu/api/projects/69",
        "https://pathscore.publichealth.yale.edu/api/projects/70",
        "https://pathscore.publichealth.yale.edu/api/projects/71",
        "https://pathscore.publichealth.yale.edu/api/projects/72",
        "https://pathscore.publichealth.yale.edu/api/projects/73"
    ]
}
Filtering

Projects can be filtered using the optional 'filter' query string parameter, which takes the form filter=[field_name],[operator],[value].

Filterable fields, based on upload form, include:

  • algorithm
  • upload_time
  • proj_suffix (the project name)
  • n_patients

Operators include: eq, ne, lt, le, gt, ge, like and in. Separate in values using commas. Multiple filters can be concatenated using a semicolon.

Examples:
  • proj_suffix,like,%tcga% — projects with TCGA in the name.
  • upload_time,lt,2015-12-25 — projects uploaded before Dec 25, 2015.
  • n_patients,gt,200;n_patients,lt,300 — projects with between 200 and 300 patients.
Expanded info

Add expand=True to the query string to show full project details.

Sorting

The same fields used for filtering can also be used for sorting the results. Sorting syntax takes the form sort=[field name],[asc|desc] in the query string, e.g. ?sort=n_patients,desc lists projects in order of patient count, from largest to smallest.

Putting it all together

The following request shows the most recent projects with over 200 patients that used the 'bmr-scaled gene length' algorithm:

http --auth=$TK: GET https://pathscore.publichealth.yale.edu/api/projects/?sort=upload_time,desc&filter=n_patients,gt,200;algorithm,eq,bmr_length

All users, registered and anonymous alike, can create a project by sending a POST request to the api/projects/ endpoint that includes form data and a file attachment. The upload requirements are the same as for submission through the Upload web form.

Uploads that are not accompanied by an authentication token are treated as submissions by a new anonymous user. A username and password will be automatically generated, and returned as part of the response.

The required form fields are as follows:

  • algorithmbmr_length, gene_length or gene_count
  • proj_suffix — project name, appended to project id.
  • ignore_genes — list of genes to ignore, comma-separated, no spaces.
  • required_genes — list of required genes that must be present in a pathway, comma-separated, no spaces.
  • mut_file — the mutation file, as described on the Upload page.

These fields should be form-encoded and the request should be sent with content-type multipart/form-data. In httpie this is accomplished with the '--form' flag, and files are specified using the '@' symbol, as shown in the following example:

$ http --auth=$TK: --form POST https://pathscore.publichealth.yale.edu/api/projects/ algorithm=bmr_length proj_suffix=LUAD_via_api mut_file@LUAD_test.txt ignore_genes=TP53,KRAS required_genes=CBL

  • This command creates a project named 'LUAD_via_api', using the bmr_length algorithm.
  • The mutation file is named 'LUAD_test.txt', satisfying the requirement for a 'txt' or 'csv' extension.
  • The genes TP53 and KRAS are ignored, while CBL is required.

A 'Location' header in the response gives a URL for the newly created project:

HTTP/1.0 201 CREATED
Content-Length: 90
Location: http://localhost:5000/api/projects/50

{
    "msg": "File accepted and validated. Analysis in progress.",
    "status": "Success."
}

The resource URL for an individual project takes the form https://pathscore.publichealth.yale.edu/api/projects/[project_id].

As shown in the example below, the response includes several types of project metadata:

  • User-specified upload parameters,
    • algorithm, name, ignore_genes, required_genes
  • line counts for loaded, unused and rejected mutations,
    • n_loaded, n_rejected, n_unused
  • URLs for unused and rejected mutations,
    • filtered_unused, filtered_rejected
  • web interface results URLs,
    • flat_url, tree_url, scatter_url, compare_url
  • archive URL for downloading the project as a zip file
    • archive_url

$ http --auth=$TK: GET https://pathscore.publichealth.yale.edu/api/projects/50

{
    "algorithm": "gene_count",
    "archive_url": "https://pathscore.publichealth.yale.edu/api/archives/50",
    "filtered_rejected": "https://pathscore.publichealth.yale.edu/filtered?type=rejected&proj=50",
    "filtered_unused": "https://pathscore.publichealth.yale.edu/filtered?type=ignored&proj=50",
    "ignore_genes": "TP53,KRAS",
    "n_loaded": 8320,
    "n_patients": 253,
    "n_rejected": 0,
    "n_unused": 15,
    "name": "50_LUAD_via_api",
    "required_genes": "CBL",
    "flat_url": "https://pathscore.publichealth.yale.edu/results?proj=50",
    "scatter_url": "https://pathscore.publichealth.yale.edu/scatter?proj=50",
    "self_url": "https://pathscore.publichealth.yale.edu/api/projects/50",
    "status": "Project complete.",
    "tree_url": "https://pathscore.publichealth.yale.edu/tree?proj=50",
    "upload_time": "Wed, 04 Mar 2015 19:49:41 GMT"
}

You may wish to delete projects if you are up against the upload restrictions or to unclutter your results pages. You can delete an individual project by sending a DELETE http request to the project's URL at api/projects/[project_id].

$ http --auth=$TK: DELETE https://pathscore.publichealth.yale.edu/api/projects/50

A zip archive can be obtained for a project using the archive endpoint: api/archives/[project_id], as listed in the project metadata.

The '--download' flag in httpie allows downloading of the attached zip file:

$ http --download --auth=$TK: GET https://pathscore.publichealth.yale.edu/api/archives/50

HTTP/1.1 200 OK
Content-Disposition: attachment; filename=50_SKCM_all_ignoreFishy.zip
Content-Length: 16632178
Content-Type: application/zip

Downloading 15.86 MB to "50_LUAD_via_api.zip"
 |  42.80 %    6.79 MB  326.43 kB/s  0:00:28 ETA

Uploading and browsing custom BMR files follows the same process as uploading projects—only the arguments and endpoint of the POST request differ. Send a POST request to api/bmr/, with the following fields:

  • title — a short, recognizable string. Max 32 characters. You will use the title to refer to this BMR file when creating a project that uses it.
  • tissue — (optional) a tissue of origin, max 100 characters.
  • description — (optional) A description of the file, for your reference. Max 255 characters.
  • bmr_file — the bmr file, as described on the BMR page.

Refer to project upload instructions for help with form-encoding and setting content type.

The following httpie command uploads a BMR file, with title 'lung_ccle':

$ http --auth=$TK: --form POST https://pathscore.publichealth.yale.edu/api/bmr/ title=lung_ccle tissue=lung description='16 lung samples from CCLE' bmr_file@lung_ccle.txt

As with project upload, the 'Location' header in the response gives a URL for the newly created BMR file.

You can browse your BMR archive in the same manner as you would browse projects, sending GET requests to the api/bmr/ endpoint. As with project browsing, you can filter results using url query strings.

Technical

Background mutation rate values are provided in supplementary data from Lawrence et al (2013).

Lawrence, M.S. et al., 2013. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, pp.10–14. Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3919509/

The gene-specific non-coding rates in Table S5 and 100kb region non-coding rates in Table S4 are averaged from 91 cancer cell lines from the Cancer Cell Line Encyclopedia (CCLE). As illustrated in Figure S3 in their supplementary pdf, these mutation rates are highly concordant across tissue types.

We convert these values to 'mutations per megabase' for each gene in MSigDB.

Mutation rates per chromosome

Statistics:

count    8287.000000
mean        3.207774
std         1.652648
min         0.279508
25%         2.298138
50%         2.754947
75%         3.532977
max        25.720930

We use the Bonferroni adjustment, multiplying the P value by the number of pathways in out database, currently 1321. We denote adjusted P as P*.

The tree view contains the top 50 enriched pathways (p<0.05), ranked by effect size. The tree is an 'average linkage' dendrogram, currently built using Matlab's linkage and dendrogram functions.

Yes! The code for the web app is available on Github, with a GPLv3 license.