Title: | Genomic Random Interval (GRIN) |
---|---|
Description: | Improved version of 'GRIN' software that streamlines its use in practice to analyze genomic lesion data, accelerate its computing, and expand its analysis capabilities to answer additional scientific questions including a rigorous evaluation of the association of genomic lesions with RNA expression. Pounds, Stan, et al. (2013) <DOI:10.1093/bioinformatics/btt372>. |
Authors: | Abdelrahman Elsayed [aut, cre, cph] , Xueyuan Cao [aut], Lakshmi Anuhya patibandla [aut], Stanley Pounds [aut, cph] |
Maintainer: | Abdelrahman Elsayed <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0 |
Built: | 2024-11-20 04:57:58 UTC |
Source: | https://github.com/abdel-elsayed87/grin2 |
Function return box plots for expression data by lesion groups for selected number of genes based on a specified q-value of the kruskal-wallis test results.
alex.boxplots(out.dir, alex.data, alex.kw.results, q, gene.annotation)
alex.boxplots(out.dir, alex.data, alex.kw.results, q, gene.annotation)
out.dir |
Path to the folder where the boxplots of selected genes based on the specified q value of the KW results table will be added. |
alex.data |
Output of the alex.prep.lsn.expr function. It's a list of three data tables that include "row.mtch", "alex.expr" with expression data, "alex.lsn" with lesion data. Rows of alex.expr, and "alex.lsn" matrices are ordered by gene ensembl IDs and the columns are ordered by patient ID. |
alex.kw.results |
ALEX Kruskal-Wallis test results (output of the KW.hit.express function). |
q |
minimum q value for a gene to be included in output PDF file of box plots. |
gene.annotation |
Gene annotation data either provided by the user or retrieved from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library. Data.frame should has four columns: "gene" which is the ensembl ID of annotated genes, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
Function return a PDF file with box plots for expression data by lesion groups for selected number of genes based on a specified q-value of the kruskal-wallis test results (one gene per page).
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023) Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
alex.prep.lsn.expr()
, KW.hit.express()
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # return boxplots for a list of top significant genes to a pre-specified folder using 'out.dir': dir.create(resultsFolder <- file.path(tempdir(), "temp.out")) boxplots=alex.boxplots(out.dir=resultsFolder, alex.data, alex.kw.results, 1e-15, hg19.gene.annotation) unlink(resultsFolder, recursive = TRUE)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # return boxplots for a list of top significant genes to a pre-specified folder using 'out.dir': dir.create(resultsFolder <- file.path(tempdir(), "temp.out")) boxplots=alex.boxplots(out.dir=resultsFolder, alex.data, alex.kw.results, 1e-15, hg19.gene.annotation) unlink(resultsFolder, recursive = TRUE)
Function compute the distance between subjects in the dataset based on the lesions that affect different genes assigned to the pathway of interest and return two panels of lesion and expression data of ordered subjects based on the computed distances.
alex.pathway(alex.data, lsn.data, pathways, selected.pathway)
alex.pathway(alex.data, lsn.data, pathways, selected.pathway)
alex.data |
output of the alex.prep.lsn.expr function. It's a list of three data tables that include "row.mtch", "alex.expr" with expression data, "alex.lsn" with lesion data. Rows of alex.expr, and "alex.lsn" matrices are ordered by gene ensembl IDs and columns are ordered by patient ID. |
lsn.data |
Lesion data in a GRIN compatible format. data.frame should has five columns that include "ID" with patient ID, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion type for example gain, loss, mutation, fusion, etc... |
pathways |
data.frame with three columns "gene.name" that has gene symbols, "ensembl.id" with gene ensembl ID and "pathway" that has the pathway name. |
selected.pathway |
The pathway of interest. |
Function compute the distance between subjects in th dataset based on lesions affecting different genes assigned to the pathway of interest and return two panels of lesion and expression data of ordered subjects based on the computed distances. Function also return a data.frame with lesion and expression data of the pathway genes ordered based on the hierarchical clustering analysis (same order of the subjects in the lesion and expression panels of the figure).
Function will return two panels figure of lesion and expression data of ordered subjects based on the computed distances of lesions in all genes assigned to the pathway of interest. The function will also return:
ordered.path.data |
data.frame with lesion and expression data of the pathway genes ordered based on the hiearchial clustering analysis (same order of the subjects in the lesion and expression panels of the figure). |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
alex.prep.lsn.expr()
, stats::hclust()
data(expr.data) data(lesion.data) data(hg19.gene.annotation) data(pathways) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # use lesions in all genes assigned to the jak_pathway as an example pathway: alex.path=alex.pathway(alex.data, lesion.data, pathways, "Jak_Pathway") # extract expression and lesion data (same subjects order in the figure) alex.path
data(expr.data) data(lesion.data) data(hg19.gene.annotation) data(pathways) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # use lesions in all genes assigned to the jak_pathway as an example pathway: alex.path=alex.pathway(alex.data, lesion.data, pathways, "Jak_Pathway") # extract expression and lesion data (same subjects order in the figure) alex.path
The function prepares lesion and expression data matrices for the KW.hit.express function that runs the kruskal-Wallis test for the association between lesion groups and expression level of each gene with available lesion and expression data.
alex.prep.lsn.expr( expr.mtx, lsn.data, gene.annotation, min.expr = NULL, min.pts.lsn = NULL )
alex.prep.lsn.expr( expr.mtx, lsn.data, gene.annotation, min.expr = NULL, min.pts.lsn = NULL )
expr.mtx |
Normalized log2 transformed expression data provided by the user with genes in rows and subjects in columns (first column "ensembl.ID" should be gene ensembl IDs). |
lsn.data |
Lesion data in GRIN compatible format. Data frame should has five columns that include "ID" with patient ID, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion type for example gain, loss, mutation, fusion, etc... |
gene.annotation |
Gene annotation data either provided by the user or retrieved from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library. Data.frame should has four columns: "gene" which is the ensembl ID of annotated genes, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
min.expr |
Minimum allowed expression level of the gene (the sum of expression level of the gene in all patients; useful to exclude genes with very low expression) |
min.pts.lsn |
Minimum number of patients with any type of lesions in a certain gene otherwise the gene will be excluded from the lesion matrix. |
The function use prep.lsn.type.matrix function to prepare the lesion matrix that has each gene represented in one row with all lesion types included. Next, the function will prepare lesion and expression data matrices for the KW.hit.express function that runs the kruskal-Wallis test. It only keep genes with both lesion and expression data with rows ordered by ensembl ID and columns ordered by patient's ID.
A list with the following components:
alex.expr |
Expression data with gene ensembl IDs as row names and patient IDs as column names. Rows are ordered by ensembl ID and columns ordered by patient IDs. |
alex.lsn |
Lesion data for genes in the expression data matrix with gene ensembl IDs as row names and patient IDs as column names. Rows are ordered by ensembl ID and columns ordered by patient IDs. |
alex.row.mtch |
Data.frame of two columns with ensembl ID of genes in the expression and lesion data matrices (ID should be the same in the two columns). |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5)
Function return a waterfall plot for expression data by lesion groups of a selected gene.
alex.waterfall.plot(waterfall.prep, lsn.data, lsn.clrs = NULL, delta = 0.5)
alex.waterfall.plot(waterfall.prep, lsn.data, lsn.clrs = NULL, delta = 0.5)
waterfall.prep |
Output of the alex.waterfall.prep function. It's a list of three data tables that include "gene.lsn.exp" that has patient ID, lesion type that affect this gene if any and expression level of the selected gene, "lsns" which is a data table with all lesions affecting the gene of interest in a GRIN compatible format and "stats" which is one row with the Kruskal-Wallis test result (output of the KW.hit.express function). |
lsn.data |
Lesion data in a GRIN compatible format. |
lsn.clrs |
Assigned colors per lesion types. If not specified, colors will be automatically assigned using default.grin.colors function. |
delta |
Spacing argument for the waterfall plot. |
Function return a waterfall plot for expression data by lesion groups of a selected gene. This plot offers a side by side graphical representation of lesion and expression data for each patient where lesion groups are ordered alphabetically. For each lesion category, expression data is ordered from the lowest to the highest with patient with the median expression of the gene in the middle of the panel.
Function return a waterfall plot for expression data by lesion groups of a selected gene.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
alex.prep.lsn.expr()
, KW.hit.express()
, alex.waterfall.prep()
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # To prepare lesion and expression data for a waterfall plot (WT1 gene): WT1.waterfall.prep=alex.waterfall.prep(alex.data, alex.kw.results, "WT1", lesion.data) # waterfall plot of WT1 gene: WT1.waterfall.plot=alex.waterfall.plot(WT1.waterfall.prep, lesion.data)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # To prepare lesion and expression data for a waterfall plot (WT1 gene): WT1.waterfall.prep=alex.waterfall.prep(alex.data, alex.kw.results, "WT1", lesion.data) # waterfall plot of WT1 gene: WT1.waterfall.plot=alex.waterfall.plot(WT1.waterfall.prep, lesion.data)
Function prepares lesion and expression data of a selected gene for the alex.waterfall.plot function.
alex.waterfall.prep(alex.data, alex.kw.results, gene, lsn.data)
alex.waterfall.prep(alex.data, alex.kw.results, gene, lsn.data)
alex.data |
output of the alex.prep.lsn.expr function. It's a list of three data tables that include "row.mtch", "alex.expr" with expression data, "alex.lsn" with lesion data. Rows of alex.expr, and "alex.lsn" matrices are ordered by the gene ensembl IDs and columns are ordered by patient IDs. |
alex.kw.results |
ALEX Kruskal-Wallis test results (output of the KW.hit.express function). |
gene |
Gene name or ensembl ID of the gene of interest. |
lsn.data |
Lesion data in a GRIN compatible format. Object should has five columns that include "ID" with patient ID, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion category for example gain, loss, mutation, fusion, etc... |
Function prepares lesion and expression data of a selected gene for the alex.waterfall.plot function. It return a data table with patient ID, lesion types that affect each patient if any and expression level of the gene of interest. It also extract the kruskal-wallis test result and all lesions that affect the gene of interest.
A list of four components:
gene.lsn.exp |
Data table with three columns("ID" with patient ID, "gene.name_lsn" has the type of lesion affecting the patient which can be none, gain, mutation, multiple, etc.. and "gene.name_expr" which has the expression level of the gene of interst in this particular patient. |
lsns |
Data table with all lesions affecting the gene of interst in a GRIN compatible format extracted from the lesion data file. |
stats |
One row with the Kruskal-Wallis test result for the gene of interst (output of the KW.hit.express function). |
gene.ID |
Gene name of the gene of interst. |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
alex.prep.lsn.expr()
, KW.hit.express()
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # To prepare lesion and expression data for a waterfall plot (WT1 gene): WT1.waterfall.prep=alex.waterfall.prep(alex.data, alex.kw.results, "WT1", lesion.data)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # To prepare lesion and expression data for a waterfall plot (WT1 gene): WT1.waterfall.prep=alex.waterfall.prep(alex.data, alex.kw.results, "WT1", lesion.data)
Clinical data file showing demographic and clinical outcomes of 265 newly diagnosed T-cell Acute Lymphoblastic Leukemia (T-ALL) patients that was reported by Liu, Yu, et al. (2017).
clin.data
clin.data
clin.data
A data frame with 265 rows and 11 columns:
Patient identifier
Patient gender
Patient race
Patient age in days
White Blood Cell (WBC) count
Minimal Residual Disease (MRD) percentage
MRD as a categorical variable (0 if MRD<=0.1 or 1 if MRD>0.1)
Overall survival time in years (time between diagnosis and either the last follow-up or death)
Survival status (0=alive at the last follow-up or, 1=dead)
Event-free survival time in years
Event indicator (0=censored without event or, 1=event)
Data was extracted from the supplementary material tables of the published Liu, Yu, et al. (2017) manuscript https://www.nature.com/articles/ng.3909#Sec27 and the publicly available clinical data on TARGET database. The two files were merged and selected list of variables were kept in the final clinical data file.
The function assign plotting coordinates necessary for the genome-wide lesion plot.
compute.gw.coordinates(grin.res, scl = 1e+06)
compute.gw.coordinates(grin.res, scl = 1e+06)
grin.res |
GRIN results (output of the grin.stats function). |
scl |
length of chromosome units in base pairs. Default is 1,000,000 which means that each chromosome will be divided into multiple pieces each is 1 million base pair in length. |
The function divides each chromosome into multiple units based on the specified scl value. In addition, it orders and adds two columns x.start and x.end to the chromosme size file (x.start for chr2 is equal to x.end of chr1). Function also adds x.start and x.end columns to lesion and gene annotation data files (x.start is the start position of the lesion or the gene divided by scl and x.end is the end position of the lesion or the gene divided by scl taking into consideration that the start position of the chromosomes is added consecutively based on the chromosomes length).
Function return a list of GRIN results with the following changes to allow adding genome-wide plotting coordinates:
gene.hits |
No changes, a data table of GRIN results that includes gene annotation, number of subjects and number of hits affecting each locus, p and FDR adjusted q-values showing the probability of each locus to be affected by one or a constellation of multiple types of lesions. |
gene.lsn.data |
No changes, each row represent a gene overlapped by a certain lesion. Column "gene" shows the overlapped gene ensembl ID, and ID column has the patient ID |
lsn.data |
input lesion data with two additional columns (x.start and x.end). x.start is the start position of the lesion divided by scl and x.end is the end position of the lesion divided by scl taking into consideration that the start position of the chromosomes is added consecutively based on the chromosomes length. |
gene.data |
input gene annotation data with two additional columns (x.start and x.end). x.start is the start position of the gene divided by scl and x.end is the end position of the gene divided by scl taking into consideration that the start position of the chromosomes is added consecutively based on the chromosomes length. |
chr.size |
data table showing the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs with two additional columns (x.start and x.end). x.start is the start position of the chromosome divided by scl and x.end is the end position of the chromosome divided by scl taking into consideration that the start position of the chromosomes is added consecutively based on the chromosomes length. |
gene.index |
data.frame with overlapped gene-lesion data rows that belong to each chromosome in the gene.lsn.data table. |
lsn.index |
data.frame that shows the overlapped gene-lesion data rows taht belong to each lesion in the gene.lsn.data table. |
Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN model using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # assign genomewide coordinates and prepare the results for the genomewide.lsn.plot function genome.coord=compute.gw.coordinates(grin.results)
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN model using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # assign genomewide coordinates and prepare the results for the genomewide.lsn.plot function genome.coord=compute.gw.coordinates(grin.results)
The function computes the number of hits affecting each gene by lesion category. It also compute the number of subjects with a hit in each annotated gene by lesion category as well.
count.hits(ov.data)
count.hits(ov.data)
ov.data |
a list of six data.frames that represent the output results of the find.gene.lsn.overlaps function. |
The function use the output of the find.gene.lsn.overlaps function and return the number of unique subjects affected by each lesion category in the provided list of annotated genes and regulatory features (nsubj stats). It also count the number of hits affecting each loci per lesion category (nhits stats). For example, if NOTCH1 gene was found affected by three different mutations in the same subject, this patient will be considered as one subject in the nsubj stats but in the nhits stats for this event will be counted as 3 mutations that affect NOTCH1 gene.
A list with the following components:
lsn.data |
Input lesion data |
lsn.index |
data.frame that shows the overlapped gene-lesion data rows taht belong to each lesion in the gene.lsn.data table. |
gene.data |
Input gene annotation data |
gene.index |
data.frame with overlapped gene-lesion data rows that belong to each chromosome in the gene.lsn.data table. |
nhit.mtx |
A data.frame with number of hits in each gene by lesion type (number of columns will be equal to the number of lesion types in the lsn.type column). |
nsubj.mtx |
A data.frame with number of affected subjects by lesion type in each annotated gene. |
gene.lsn.data |
Each row represent a gene overlapped by a certain lesion. Column "gene" shows the overlapped gene and "ID" column has the patient ID. |
glp.data |
data.frame ordered by gene and lesions start position. Gene start position is coded as 1 in the cty column and gene end position is coded as 4. Lesion start position is coded as 2 in the cty column and lesion end position is coded as 3. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
prep.gene.lsn.data()
, find.gene.lsn.overlaps()
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # count number of subjects affected by different types of lesions and number of hits that affect # each locus: count.nsubj.nhits=count.hits(gene.lsn.overlap)
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # count number of subjects affected by different types of lesions and number of hits that affect # each locus: count.nsubj.nhits=count.hits(gene.lsn.overlap)
Function assigns default colors for each lesion group in the whole set of GRIN plots.
default.grin.colors(lsn.types)
default.grin.colors(lsn.types)
lsn.types |
Unique lesion types as specified in the lesion data file. |
The function specifies 10 colors for different lesion types. If the number of lesion types is more than 10, the user will be asked to specify the colors manually.
Function return a vector of colors assigned to each unique lesion type.
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) lsn.types=unique(lesion.data$lsn.type) # assign colors for different lesion categories using default.grin.colors function: default.grin.colors(lsn.types)
data(lesion.data) lsn.types=unique(lesion.data$lsn.type) # assign colors for different lesion categories using default.grin.colors function: default.grin.colors(lsn.types)
Gene expression data file showing log2 normalized expression level of 420 genes (rows) in 265 newly diagnosed T-cell Acute Lymphoblastic Leukemia (T-ALL) patients in columns that was reported by Liu, Yu, et al. (2017).
expr.data
expr.data
expr.data
A data frame with 420 rows and 265 columns:
Ensembl IDs of the list of 420 genes included in the dataset
...
Data was extracted from the supplementary material tables of the published Liu, Yu, et al. (2017) manuscript https://www.nature.com/articles/ng.3909#Sec27
The function use the output of the prep.gene.lsn.data function to find lesion-gene overlaps.
find.gene.lsn.overlaps(gl.data)
find.gene.lsn.overlaps(gl.data)
gl.data |
a list of five data.frames that represent the output results of the prep.gene.lsn.data function. |
A list with the following components:
lsn.data |
Input lesion data |
gene.data |
Input gene annotation data |
gene.lsn.data |
data.frame ordered by gene and lesions start position. Gene start position is coded as 1 in the cty column and gene end position is coded as 4. Lesion start position is coded as 2 in the cty column and lesion end position is coded as 3. |
gene.lsn.hits |
data.frame on which each row represent a gene overlapped by a certain lesion. The data.frame has 11 columns that include "gene" with ensembl ID of the overlapped gene, "gene.chrom", "gene.loc.start" and "gene.loc.end" with data for the chromosome on which the gene is located, start and end positions of the gene. In addition, column "ID" has the ID of the patient with a lesion that overlapped this gene, "lsn.chrom", "lsn.loc.start", "lsn.loc.end" and "lsn.type" have data for the chromosome, lesion start, lesion end positions and the lesion type respectively. |
gene.index |
data.frame that shows row start and row end for each chromosome in the gene.lsn.data table |
lsn.index |
data.frame that shows row start and row end for each lesion in the gene.lsn.data table |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn)
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn)
The function return a genome-wide plot based on -log(10) q-value of each of the evaluated annotated genes or lesion boundaries on each chromosome. The plot is lesion type specific (gain, loss, mutation, etc...).
genomewide.log10q.plot( grin.res, lsn.grps, lsn.colors = NULL, max.log10q = NULL )
genomewide.log10q.plot( grin.res, lsn.grps, lsn.colors = NULL, max.log10q = NULL )
grin.res |
GRIN results evaluating annotated genes or lesion boundaries (output of the grin.stats function using either lesion boundaries or annotated genes as a marker input file). |
lsn.grps |
Selected lesion groups to be added to the plot. |
lsn.colors |
Colors assigned to each lesion group (if NULL, lesion colors will be assigned automatically using default.grin.colors function). |
max.log10q |
Maximum log10 q value to be added to the plot. All boundaries or genes with -log10 q smaller than this value will be set automatically to max.log10q. |
The function return a genome-wide plot based on -log(10) q-value of each of the evaluated annotated genes or lesion boundaries to be affected by a certain type of lesions (gain, loss, mutation, etc...).
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # This analysis is lesion type specific. So, user should first data extract data for a specific # lesion group of interest for example gains from the lesion data file: gain=lesion.data[lesion.data$lsn.type=="gain",] # Return lesion boundaries for gains: lsn.bound.gain=grin.lsn.boundaries(gain, hg19.chrom.size) # Run GRIN analysis Using Lesion Boundaries as Markers Instead of the Gene Annotation File: GRIN.results.gain.bound=grin.stats(gain, lsn.bound.gain, hg19.chrom.size) # Return genomewide -log10q plot for association between lesion boundaries and gain: genomewide.log10q.plot(GRIN.results.gain.bound, lsn.grps=c("gain"), lsn.colors=c("gain" = "red"), max.log10q = 10) # instead of lesion boundaries, users can also plot -log10q values for annotated genes using # genes annotation data as a marker data file: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) genomewide.log10q.plot(grin.results, lsn.grps=c("gain"), lsn.colors=c("gain" = "red"), max.log10q = 10) # User can run this same analysis for other lesion types such as mutations and deletions.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # This analysis is lesion type specific. So, user should first data extract data for a specific # lesion group of interest for example gains from the lesion data file: gain=lesion.data[lesion.data$lsn.type=="gain",] # Return lesion boundaries for gains: lsn.bound.gain=grin.lsn.boundaries(gain, hg19.chrom.size) # Run GRIN analysis Using Lesion Boundaries as Markers Instead of the Gene Annotation File: GRIN.results.gain.bound=grin.stats(gain, lsn.bound.gain, hg19.chrom.size) # Return genomewide -log10q plot for association between lesion boundaries and gain: genomewide.log10q.plot(GRIN.results.gain.bound, lsn.grps=c("gain"), lsn.colors=c("gain" = "red"), max.log10q = 10) # instead of lesion boundaries, users can also plot -log10q values for annotated genes using # genes annotation data as a marker data file: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) genomewide.log10q.plot(grin.results, lsn.grps=c("gain"), lsn.colors=c("gain" = "red"), max.log10q = 10) # User can run this same analysis for other lesion types such as mutations and deletions.
Function return a genomewide lesion plot for all lesion types affecting different chromosomes.
genomewide.lsn.plot( grin.res, ordered = FALSE, pt.order = NULL, lsn.colors = NULL, max.log10q = NULL )
genomewide.lsn.plot( grin.res, ordered = FALSE, pt.order = NULL, lsn.colors = NULL, max.log10q = NULL )
grin.res |
GRIN results (output of the grin.stats function). |
ordered |
By default the function will order the patient IDs alphabetically. However, users can specify a certain patient's order in the genomewide lesion plot by specifying ordered=TRUE and pass a data frame with new patient's order to the pt.order argument. |
pt.order |
data.frame of two columns "ID" that has patient IDs matching the unique IDs in the lesion data file and "pts.order" that has the new patient's order listed as numbers that range from 1:n.patients (Should be only specified if ordered=TRUE). |
lsn.colors |
a vector of lesion colors (If not provided by the user, colors will be automatically assigned using default.grin.colors function). |
max.log10q |
Maximum log10 q value for genes in the GRIN results table to be added to the plot. If max.log10q=100 for example, all -log10q values>100, will be adjusted to 100 in the plot. |
The function use the genome-wide plotting coordinates obtained from the compute.gw.coordinates function and plot the whole set of lesions affecting subjects included in the dataset in the middle panel of the figure. Two additional side panels show the number of affected subjects and -log10 q value of each locus to be affected by all different types of lesions.
The function return a genome-wide lesion plot (all chromosomes) in the middle panel. For each locus, Panel on the left shows -log10 q value and the Panel on the right show the number of subjects affected by all different types of lesions color coded by lesion category.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN model using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # prepare the genomewide lesion plot using genomewide.lsn.plot function with patient IDs ordered # alphabetically: genomewide.plot=genomewide.lsn.plot(grin.results, max.log10q=50) # To pass certain patients order to the genomewide.lsn.plot function, the user should specify # a certain patients order using the pt.order argument.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN model using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # prepare the genomewide lesion plot using genomewide.lsn.plot function with patient IDs ordered # alphabetically: genomewide.plot=genomewide.lsn.plot(grin.results, max.log10q=50) # To pass certain patients order to the genomewide.lsn.plot function, the user should specify # a certain patients order using the pt.order argument.
Retrieve chromosome size data from chr.info txt files available on the UCSC genome browser based on the user specified genome assembly.
get.chrom.length(genome.assembly)
get.chrom.length(genome.assembly)
genome.assembly |
User can specify one of four supported genome assemblies that include "Human_GRCh38", "Human_GRCh37", "Mouse_HGCm39" and "Mouse_HGCm38". |
Based on the genome assembly specified by the user, the function will directly retrieve chromosome size data from chr.info txt file available on the UCSC genome browser.
A data table with the following two columns:
chrom |
column has the chromosome number denoted as 1, 2, X, Y, etc.. |
size |
column has the chromosome size in base pairs. |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
# To retreive chromosome size data for hg19 genome assembly: hg19.chrom.size=get.chrom.length("Human_GRCh37") # "Human_GRCh38" can be used to retreive chromosome size data for hg38 genome assembly.
# To retreive chromosome size data for hg19 genome assembly: hg19.chrom.size=get.chrom.length("Human_GRCh37") # "Human_GRCh38" can be used to retreive chromosome size data for hg38 genome assembly.
Function directly retrieve gene and regulatory features annotation data from Ensembl BioMart database based on the specified genome assembly.
get.ensembl.annotation(genome.assembly)
get.ensembl.annotation(genome.assembly)
genome.assembly |
User can specify one of four genome assemblies that include "Human_GRCh38", "Human_GRCh37", "Mouse_HGCm39" and "Mouse_HGCm38". |
Based on the genome assembly specified by the user, the function will directly retrieve gene and regulatory features annotation data from ensembl BioMart database. Annotation data include enesembl ID, the chromosome on which the gene is located, gene start and gene end position, gene name, gene description, biotype, chromosome strand and chromosome band. Gene classes (biotypes) include protein coding genes, long noncoding RNAs (lncRNAs), microRNAs (miRNAs), small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), immunoglobulins (IGs), T-cell receptors (TCRs) and pseudogens. Regulatory features data retrieved from Ensembl regulatory build are categorized in 6 classes that include promoters, promoter flanking regions, predicted enhancers, CTCF binding sites, transcription factor (TF) binding sites and the open chromatin regions.Ensembl first imports publicly available data from different large epigenomic consortia such as ENCODE, Roadmap Epigenomics and Blueprint. All high-throughput sequencing data sets are then uniformly processed using the Ensembl Regulation Sequence Analysis (ERSA) pipeline to generate signal tracks for enriched regions also referred to as annotated features or peaks. Segmentation data provide information about promoter, promoter flanking regions, enhancers and CTCF binding sites. If TF binding probability is >0 in areas outside previously mentioned regions, it takes a label of TF binding site. If any open chromatin region did not overlap with the above features, it takes a label of unannotated open chromatin. users will also have the chance to use a list of experimentally verified enhancers/transcription start sites (TSS) using the CAGE (Cap Analysis of Gene Expression) experiment on a multitude of different primary cells and tissues from the Functional Annotation of the Mouse/Mammalian Genome (FANTOM5) project.
A list of three components:
gene.annotation |
A 9 columns data.frame with gene annotation data that include enesembl ID, chromosome, gene start and gene end position, gene name, gene description, biotype, chromosome strand, and chromosome band. |
reg.annotation.predicted |
A 5 columns data.frame with regulatory features annotation data directly retreived from Ensembl regulatory build that include enesembl ID, chromosome, description(promoter, enhancer, etc..), feature start and end positions. |
reg.annotation.validated |
A 5 columns data.frame with regulatory features annotation data for experimentally verified features retreived from FANTOM5 project that include feature ID, chromosome, description(enhancer, transcription start site (TSS)), feature start and end positions. |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
Zerbino, Daniel R., et al. (2015). The ensembl regulatory build.
Kinsella, Rhoda J., et al. (2011). Ensembl BioMarts: a hub for data retrieval across taxonomic space.
biomaRt::useEnsembl()
, biomaRt::getBM()
# Retrieve annotation data for human hg19 genome assembly: hg19.ann=get.ensembl.annotation("Human_GRCh37") # gene annotation data: hg19.gene.annotation=hg19.ann$gene.annotation # regulatory features annotation data retrieved from Ensembl regulatory build: hg19.reg.annotation=hg19.ann$reg.annotation.predicted # regulatory features annotation data retrieved from FANTOM5 project: hg19.fantom.annotation=hg19.ann$reg.annotation.validated # "Human_GRCh38" can be used instead of "Human_GRCh37" to retrieve gene and regulatory features # annotation data for human hg38 genome assembly.
# Retrieve annotation data for human hg19 genome assembly: hg19.ann=get.ensembl.annotation("Human_GRCh37") # gene annotation data: hg19.gene.annotation=hg19.ann$gene.annotation # regulatory features annotation data retrieved from Ensembl regulatory build: hg19.reg.annotation=hg19.ann$reg.annotation.predicted # regulatory features annotation data retrieved from FANTOM5 project: hg19.fantom.annotation=hg19.ann$reg.annotation.validated # "Human_GRCh38" can be used instead of "Human_GRCh37" to retrieve gene and regulatory features # annotation data for human hg38 genome assembly.
The function run association analysis between the binary lesion matrix (output of prep.binary.lsn.mtx function) and clinical outcomes of interest such as Minimal Residual Disease (MRD), Event-free Survival (EFS) and Overall Survival (OS), etc...
grin.assoc.lsn.outcome( lsn.mtx, clin.data, annotation.data, clinvars, covariate = NULL )
grin.assoc.lsn.outcome( lsn.mtx, clin.data, annotation.data, clinvars, covariate = NULL )
lsn.mtx |
Binary lesion matrix in which each type of lesions affecting certain gene is represented in a separate row for example ENSG00000148400_gain. If the gene is affected by this specific type of lesion, patient entry will be coded as 1 or 0 otherwisw. This matrix is the output of the prep.binary.lsn.mtx function. |
clin.data |
Clinical data table in which the first column "ID" should has the patient ID. |
annotation.data |
Gene annotation data either provided by the user or retrieved from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library. Data.frame should has four columns: "gene" which is the ensembl ID of annotated genes, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
clinvars |
Clinical outcome variables of interest (survival variables such as EFS and OS should be first coded as survival objects using Surv function and added as new columns to the clinical data file, binary variables such as MRD should be coded as 0, 1). |
covariate |
Covariates that the model will adjust for if any. |
The function run association analysis between the binary lesion matrix in which each type of lesions affecting certain gene is represented in a separate row (output of prep.binary.lsn.mtx function) and clinical outcomes.Function will run logistic regression models for association between each gene-lesion pair with numeric variables such as MRD that should be coded as 0 if the patient is MRD-negative and 1 if the patient is MRD positive. Function will also run COX-Proportional hazard models for association between lesions and survival objects such as Event-free survival (EFS) and oveall survival (OS). EFS and OS should be first coded as survival objects using Surv function and added as new columns to the clinical data file. If specified, the models can be also adjusted for one or a group of covariates such as risk group assignment, gender, age, etc...
Function returns a results table that has gene annotation data, and multiple columns showing results of the logistic regression model for association with binary variables such as MRD that include odds.ratio, lower and upper 95 confidence interval (CI), model p and FDR adjusted q values, in addition to the number of patients with/without lesion who experienced or did not experience the event. Results table will also include results of COXPH models for association between lesions with survival variables such as EFS, OS that include COXPH hazard ratio, lower and upper 95 CI, model p and FDR adjusted q values, in addition to the number of patients with/without the lesion who experienced or did not experience the event.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Andersen, P. and Gill, R. (1982). Cox's regression model for counting processes, a large sample study.
Therneau, T., Grambsch, P. (2000) Modeling Survival Data: Extending the Cox Model.
Dobson, A. J. (1990) An Introduction to Generalized Linear Models.
stats::glm()
, survival::coxph()
data(lesion.data) data(hg19.gene.annotation) data(clin.data) # prepare lesion data and find gene lesion overlaps: gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) gene.lsn.overlap= find.gene.lsn.overlaps(gene.lsn) # Prepare a binary lesion matrix for genes affected by a certain type of lesion in at least # 5 subjects using prep.binary.lsn.mtx function: lsn.binary.mtx=prep.binary.lsn.mtx(gene.lsn.overlap, min.ngrp=5) # Prepare EFS and OS survival objects and add two new columns to the clinical data file: library(survival) clin.data$EFS <- Surv(clin.data$efs.time, clin.data$efs.censor) clin.data$OS <- Surv(clin.data$os.time, clin.data$os.censor) # define clinical outcome variables to be included in the analysis: clinvars=c("MRD.binary", "EFS", "OS") # Run association analysis between lesions in the binary lesion matrix and clinical variables # in the clinvars object: assc.outcomes=grin.assoc.lsn.outcome(lsn.binary.mtx, clin.data, hg19.gene.annotation, clinvars) # to adjust the models for one or a group of covariates, user can specify one or a group # of covariates using the 'covariate' argument.
data(lesion.data) data(hg19.gene.annotation) data(clin.data) # prepare lesion data and find gene lesion overlaps: gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) gene.lsn.overlap= find.gene.lsn.overlaps(gene.lsn) # Prepare a binary lesion matrix for genes affected by a certain type of lesion in at least # 5 subjects using prep.binary.lsn.mtx function: lsn.binary.mtx=prep.binary.lsn.mtx(gene.lsn.overlap, min.ngrp=5) # Prepare EFS and OS survival objects and add two new columns to the clinical data file: library(survival) clin.data$EFS <- Surv(clin.data$efs.time, clin.data$efs.censor) clin.data$OS <- Surv(clin.data$os.time, clin.data$os.censor) # define clinical outcome variables to be included in the analysis: clinvars=c("MRD.binary", "EFS", "OS") # Run association analysis between lesions in the binary lesion matrix and clinical variables # in the clinvars object: assc.outcomes=grin.assoc.lsn.outcome(lsn.binary.mtx, clin.data, hg19.gene.annotation, clinvars) # to adjust the models for one or a group of covariates, user can specify one or a group # of covariates using the 'covariate' argument.
Function return a stacked bar plot with number of patients affected by all different types of lesions in a pre-specified list of genes of interest.
grin.barplt(grin.res, count.genes, lsn.colors = NULL)
grin.barplt(grin.res, count.genes, lsn.colors = NULL)
grin.res |
GRIN results (output of the grin.stats function). |
count.genes |
vector with gene names of a list of genes to be added to the bar plot. |
lsn.colors |
Lesion colors (If not provided by the user, colors will be automatically assigned using default.grin.colors function). |
Function will use the input list of gene names and extract the number of patients affected by all different types of lesions in those genes from the GRIN results table (output of the grin.stats function).
Function return a stacked bar plot with number of patients affected by all different types of lesions in the pre-specified list of genes of interest.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # run GRIN analysis using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # specify a list of genes to be included in the bar plot (driver genes) count.genes=as.vector(c("TAL1", "FBXW7", "PTEN", "IRF8","NRAS", "BCL11B", "MYB", "LEF1","RB1", "MLLT3", "EZH2", "ETV6", "CTCF", "JAK1", "KRAS", "RUNX1", "IKZF1", "KMT2A", "RPL11", "TCF7", "WT1", "JAK2", "JAK3", "FLT3")) # return the stacked barplot grin.barplt(grin.results, count.genes)
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # run GRIN analysis using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # specify a list of genes to be included in the bar plot (driver genes) count.genes=as.vector(c("TAL1", "FBXW7", "PTEN", "IRF8","NRAS", "BCL11B", "MYB", "LEF1","RB1", "MLLT3", "EZH2", "ETV6", "CTCF", "JAK1", "KRAS", "RUNX1", "IKZF1", "KMT2A", "RPL11", "TCF7", "WT1", "JAK2", "JAK3", "FLT3")) # return the stacked barplot grin.barplt(grin.results, count.genes)
The function evaluates Copy number variations that include gain and deletions as boundaries based on unique lesion start and end positions. This analysis is lesion type specific and covers the entire genome.
grin.lsn.boundaries(lsn.data, chrom.size)
grin.lsn.boundaries(lsn.data, chrom.size)
lsn.data |
Lesion data file that should be limited to include either gain or deletions. If gains are splitted to gain and amplifications based on the log2Ratio value of the CNV segmentation file, the two categories can be included in the same data table, same for homozygous and heterozygous deletions. |
chrom.size |
Chromosize size table that should include two columns "chrom" with the chromosome number and "size" with the chromosome size in base pairs. |
The function evaluates Copy number variations that include gain and deletions as boundaries and return a table of ordered boundaries based on the unique start and end positions of different lesions on each chromosome. If gains are splitted to gain and amplifications based on the log2Ratio value of the CNV segmentation file, the two categories can be included in the same analysis, same for homozygous and heterozygous deletions. Boundary will be the region between each unique start and end positions where large size lesions will be splitted into multiple boundaries based on other smaller size lesions that affect the same region in other patients if any. This analysis is meant to cover the entire genome, so regions without any annotated genes or regulatory features will be incuded will be assesed in the analysis. The first boundary for each chromosome will start from the first nucleotide base on the chromosome till the start position of the first lesion that affect the chromosome. Similarly, the last boundary will start from the end position of the last lesion that affect the chromosome till the last base on the chromosome.
Function return a data.frame with five columns:
gene |
Ordered boundaries by unique start and end positions of different lesions on each chromosome. |
chrom |
Chromosome on which the bounday is located. |
loc.start |
Boundary start position. |
loc.end |
Boundary end position. |
diff |
Boundary size in base pairs. |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
data(lesion.data) data(hg19.chrom.size) # This analysis is lesion type specific. So, user should first data extract data for a specific # lesion group of interest for example gains from the lesion data file: gain=lesion.data[lesion.data$lsn.type=="gain",] # Return lesion boundaries for gains: lsn.bound.gain=grin.lsn.boundaries(gain, hg19.chrom.size) # Run GRIN analysis Using Lesion Boundaries markers Instead of the gene annotation file: GRIN.results.gain.bound=grin.stats(gain, lsn.bound.gain, hg19.chrom.size) # same analysis can be done for mutations, deletions and structural rearrangments.
data(lesion.data) data(hg19.chrom.size) # This analysis is lesion type specific. So, user should first data extract data for a specific # lesion group of interest for example gains from the lesion data file: gain=lesion.data[lesion.data$lsn.type=="gain",] # Return lesion boundaries for gains: lsn.bound.gain=grin.lsn.boundaries(gain, hg19.chrom.size) # Run GRIN analysis Using Lesion Boundaries markers Instead of the gene annotation file: GRIN.results.gain.bound=grin.stats(gain, lsn.bound.gain, hg19.chrom.size) # same analysis can be done for mutations, deletions and structural rearrangments.
Function use GRIN results table and prepare the lesion matrix that the user can pass to the oncoprint function from ComplexHeatmap package to geneate an OncoPrint for a selcted list of genes.
grin.oncoprint.mtx(grin.res, oncoprint.genes)
grin.oncoprint.mtx(grin.res, oncoprint.genes)
grin.res |
GRIN results (output of the grin.stats function). |
oncoprint.genes |
Vector of ensembl IDs for the selected list of genes to be added to the OncoPrint. |
Function will use the input list of ensembl IDs to prepare a data table of lesions that affect these genes (each row is a gene and each column is a patient ID). This lesion matrix is compatible and can be passed to oncoprint function in ComplexHeatmap library to prepare an OncoPrint for lesions in the selected list of genes.
Function uses the output results of grin.stats function and return data table of lesions that affect a group of selected genes (each row is a gene and each column is a patient ID).
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN analysis using grin.stats function: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # specify a list of genes to be included in the oncoprint (driver genes): oncoprint.genes=as.vector(c("ENSG00000148400", "ENSG00000171862", "ENSG00000171843", "ENSG00000156531", "ENSG00000162367", "ENSG00000096968", "ENSG00000105639", "ENSG00000118513","ENSG00000102974", "ENSG00000133703")) # prepare the oncoprint lesion matrix: oncoprint.mtx=grin.oncoprint.mtx(grin.results, oncoprint.genes) # user can also specify a list of top significant genes in the GRIN constellation test: # for example: select genes affected by two types of lesion with q2.nsubj<0.01: genes.const = grin.results$gene.hits[grin.results$gene.hits$q2.nsubj < 0.01, ] # get ensembl.ids for this list of genes selected.genes=as.vector(genes.const$gene) oncoprint.mtx.const=grin.oncoprint.mtx(grin.results, selected.genes)
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # Run GRIN analysis using grin.stats function: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # specify a list of genes to be included in the oncoprint (driver genes): oncoprint.genes=as.vector(c("ENSG00000148400", "ENSG00000171862", "ENSG00000171843", "ENSG00000156531", "ENSG00000162367", "ENSG00000096968", "ENSG00000105639", "ENSG00000118513","ENSG00000102974", "ENSG00000133703")) # prepare the oncoprint lesion matrix: oncoprint.mtx=grin.oncoprint.mtx(grin.results, oncoprint.genes) # user can also specify a list of top significant genes in the GRIN constellation test: # for example: select genes affected by two types of lesion with q2.nsubj<0.01: genes.const = grin.results$gene.hits[grin.results$gene.hits$q2.nsubj < 0.01, ] # get ensembl.ids for this list of genes selected.genes=as.vector(genes.const$gene) oncoprint.mtx.const=grin.oncoprint.mtx(grin.results, selected.genes)
The function run the Genomic Random Interval (GRIN) analysis to determine whether a certain locus has an abundance of lesions or a constellation of multiple types of lesions that is statistically significant.
grin.stats(lsn.data, gene.data = NULL, chr.size = NULL, genome.version = NULL)
grin.stats(lsn.data, gene.data = NULL, chr.size = NULL, genome.version = NULL)
lsn.data |
data.frame with lesion data prepared by the user in a GRIN compatible format. Object should has five columns that include "ID" with patient ID, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion category for example gain, loss, mutation, fusion, etc... For Single Nucleotide Variants (SNVs), loc.start will be the same as loc.end. For Copy Number Alterations (CNAs) such as gain and deletions, loc.start and loc.end should be the gain or deletion start and end positions respectively. For structural rearrangements such as inversions and translocations, each rearrangement should be coded in two different lines, one line for chromosome A involved in the translocation break-point and the second line for chromosome B break-point. For inversions on the same chromosome, the two lines will include the two breakpoints of the inversion. An example lesion data in a GRIN compatible format can be found at the GRIN2.0 package data folder (lesion.data.rda). |
gene.data |
data.frame with the gene annotation data either provided by the user or directly retreived from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library if the genome.version is specified. Object should has four columns "gene" which is the ensembl ID of annotated genes to which the lesion data will be overlapped, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
chr.size |
data.frame with the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs. It should has two columns that include "chrom" with the chromosome number and "size" for the size of the chromosome in base pairs. Chromosome size data can be either provided by the user or directly retreived from UCSC genome browser using get.chrom.length function included in the GRIN2.0 library if genome.version is specified. |
genome.version |
Genome assembly should be only specified if the user selected not to provide gene annotation, chromosome size files, and directly retrieve those files from ensembl BioMart database, and UCSC genome browsers using get.ensembl.annotation and get.chrom.length functions respectively. Currently, the function support four genome assemblies that include "Human_GRCh38", "Human_GRCh37", "Mouse_HGCm39", and "Mouse_HGCm38". |
The function run the Genomic Random Interval (GRIN) analysis and evaluates the probability of each gene locus to be affected by different types of lesions based on a convolution of independent but non-identical Bernoulli distributions to determine whether this locus has an abundance of lesions that is statistically significant.In addition, FDR-adjusted q value is computed for each locus based on Pounds & Cheng (2006) estimator of the proportion of tests with a true null (pi.hat). The function also evaluates if a certain locus is affected by a constellation of multiple types of lesions and return the GRIN results table.
A list with the following components:
gene.hits |
data table of GRIN results that include gene annotation, number of subjects affected by each lesion type for example gain, loss, mutation, etc.., and number of hits affecting each locus. The GRIN results table will also include P and FDR adjusted q-values showing the probability of each locus of being affected by one or a constellation of multiple types of lesions. |
lsn.data |
input lesion data |
gene.data |
input gene annotation data |
gene.lsn.data |
each row represent a gene overlapped by a certain lesion. Column "gene" shows the overlapped gene ensembl ID and "ID"" column has the patient ID. |
chr.size |
data table showing the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs. |
gene.index |
data.frame with overlapped gene-lesion data rows that belong to each chromosome in the gene.lsn.data table. |
lsn.index |
data.frame that shows the overlapped gene-lesion data rows taht belong to each lesion in the gene.lsn.data table. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
prep.gene.lsn.data()
, find.gene.lsn.overlaps()
, count.hits()
, prob.hits()
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # if gene annotation and chromosome size files will be provided by the user: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # to directly retrieve gene annotation and chromosome size files from Ensembl BioMart database, # and UCSC genome browsers using get.ensembl.annotation and get.chrom.length functions respectively, # users can select to specify certain genome assembly using the 'genome.version' argument: # "Human_GRCh37" can be used for the GRCH37 (hg19) genome assembly, and "Human_GRCh38" can be used # for the GRCH38 (hg38) genome assembly
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # if gene annotation and chromosome size files will be provided by the user: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # to directly retrieve gene annotation and chromosome size files from Ensembl BioMart database, # and UCSC genome browsers using get.ensembl.annotation and get.chrom.length functions respectively, # users can select to specify certain genome assembly using the 'genome.version' argument: # "Human_GRCh37" can be used for the GRCH37 (hg19) genome assembly, and "Human_GRCh38" can be used # for the GRCH38 (hg38) genome assembly
Function return a plot with all types of lesions that spans either a gene or regulatory feature of interest with GRIN statistics added.
grin.stats.lsn.plot(grin.res, feature = NULL, lsn.clrs = NULL, expand = 5e-04)
grin.stats.lsn.plot(grin.res, feature = NULL, lsn.clrs = NULL, expand = 5e-04)
grin.res |
GRIN results for genes or regulatory features (output of the grin.stats function) |
feature |
Feature ensembl ID from Ensembl regulatory build or FANTOM5 project. An ensembl ID of a gene can be provided as well. |
lsn.clrs |
Assigned colors per lesion types. If not specified, colors will be automatically assigned using default.grin.colors function. |
expand |
Controls ratio of the feature locus (start and end position) to the whole plot with default value = 0.0005 (setting expand=0 will only plot the locus from the start to the end position without any of the upstream or downstream regions of the feature). |
Function return a plot with all lesions that affect either a gene regulatory feature of interest. Top panel of the plot will has all different types of lesions affecting the loci color coded according to the figure legend. Lower panel of the plot has all the GRIN statistics of the feature that include number of subjects affected by each type of lesions, -log10 p, and –log10q values showing if the feature is significantly affected by the corresponding lesion category. This plot has no panel for transcripts table as regulatory features typically do not have this kind of information.
Function return a plot with all types of lesions that spans either a gene or regulatory feature of interest in addition to the locus GRIN statistics without adding the transcripts panel.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # run GRIN analysis grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Plots showing different types of lesions and GRIN stats for a gene of interest (WT1): grin.stats.lsn.plot(grin.results, feature="ENSG00000184937") # same function can be used to plot lesion data and GRIN statistics of regulatory features # that typically do not have transcripts track to add to the plot.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # run GRIN analysis grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Plots showing different types of lesions and GRIN stats for a gene of interest (WT1): grin.stats.lsn.plot(grin.results, feature="ENSG00000184937") # same function can be used to plot lesion data and GRIN statistics of regulatory features # that typically do not have transcripts track to add to the plot.
The dataset has the start and end positions in base pairs of all 22 autosomes in addition to X and Y chromosome cytobands for Human-GRCh37 (hg19) genome assembly.
hg19_cytoband
hg19_cytoband
hg19_cytoband
A data frame with 862 rows and 5 columns:
The chromosome number.
The cytoband start position on the chromosome in base pairs.
The cytoband end position on the chromosome in base pairs.
The cytoband name.
The coloring scheme of the cytobands.
The Chromosome cytobands data file was downloaded from the UCSC genome browser for GRCh37 genome assembly https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/.
The file has the size of 22 autosomes in addition to X and Y chromosomes in base pairs directly retrieved from chr.info txt files available on the UCSC genome browser using get.chrom.length function and "Human-GRCh37" as a genome assembly option (hg19).
hg19.chrom.size
hg19.chrom.size
hg19.chrom.size
A data frame with 24 rows and 2 columns:
The chromosome number.
The chromosome length in base pairs.
Chromosome size data directly retrieved from chr.info txt files available on the UCSC genome browser using get.chrom.length function and "Human-GRCh37" as a genome assembly option (hg19).
The file has an example annotation data of 420 genes (same set of genes in the gene expression data file) directly retrieved from Ensembl BioMart database using get.ensembl.annotation function and "Human-GRCh37" as a genome assembly option (hg19).
hg19.gene.annotation
hg19.gene.annotation
hg19.gene.annotation
A data frame with 420 rows and 9 columns:
Column has the gene ensembl ID
The chromosome on which the gene is located
Gene start position in base pairs
Gene end position in base pairs
Description of the gene name
Gene symbol
Gene classes that include protein coding genes, long noncoding RNAs (lncRNAs), microRNAs (miRNAs), small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), immunoglobulins (IGs), T-cell receptors (TCRs) and pseudogens.
The chromosome strand on which the gene is located forward (1) or reverse (-1).
The chromosome band on which the gene is located.
Data was directly retrieved from Ensembl BioMart database using get.ensembl.annotation function and "Human-GRCh37" as a genome assembly option (hg19).
The dataset has the start and end positions in base pairs of all 22 autosomes in addition to X and Y chromosome cytobands for Human-GRCh38 (hg38) genome assembly.
hg38_cytoband
hg38_cytoband
hg38_cytoband
A data frame with 1,549 rows and 5 columns:
The chromosome number.
The cytoband start position on the chromosome in base pairs.
The cytoband end position on the chromosome in base pairs.
The cytoband name.
The coloring scheme of the cytobands.
The Chromosome cytobands data file was downloaded from the UCSC genome browser for GRCh38 genome assembly https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/.
Function uses Kruskal-Wallis test to evaluate the association between lesion groups and expression level of the same corresponding gene.
KW.hit.express(alex.data, gene.annotation, min.grp.size = NULL)
KW.hit.express(alex.data, gene.annotation, min.grp.size = NULL)
alex.data |
output of the alex.prep.lsn.expr function. It's a list of three data tables that include "row.mtch", "alex.expr" with expression data, "alex.lsn" with lesion data. Rows of alex.expr, and "alex.lsn" matrices are ordered by gene ensembl IDs and columns are ordered by patient ID. |
gene.annotation |
Gene annotation data either provided by the user or retrieved from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library. Data.frame should has four columns: "gene" which is the ensembl ID of annotated genes, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
min.grp.size |
Minimum number of subjects in a lesion group to be included in the KW test (there should be at least two groups with number of patients > min.grp.size) to run the KW test for a certain gene. |
The function uses the ensembl IDs in each row of the row.mtch file and run the Kruskal-Wallis test for association between lesion groups of the gene in the "hit.row" column with expression level of the gene in the "expr.row" column. IDs in the two columns should be the same if the KW test will be used to evaluate association between lesion groups and expression level of the same corresponding gene. If the same patient is affected with multiple types of lesions in the same gene for example gain AND mutations, the entry will be denoted as "multiple" and patients without any type of lesions will be coded as "none".
A data table with multiple columns that include:
gene |
ensembl ID of the gene of interest. |
gene.name |
Gene name of the gene of interest. |
p.KW |
Kruskal-Wallis test p-value. |
q.KW |
Kruskal-Wallis test FDR adjusted q-value. |
_n.subjects |
Multiple columns with number of subjects with each type of lesion affecting the gene, number of subjects without any lesion and number of subjects with multiple types of lesions. |
_mean |
Multiple columns with mean expression level of the gene in subjects with each type of lesion, mean expression in subjects without any lesion and mean expression in subjects with multiple types of lesions. |
_median |
Multiple columns with median expression of the gene in subjects with each type of lesion, median expression in subjects without any lesion and median expression in subjects with multiple types of lesions. |
_sd |
Multiple columns with standard deviation of the expression level of the gene in subjects with each type of lesion, standard deviation in subjects without any lesion and standard deviation in subjects with multiple types of lesions. |
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Myles Hollander and Douglas A. Wolfe (1973), Nonparametric Statistical Methods. New York: John Wiley & Sons. Pages 115–120.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run Kruskal-Wallis test for association between lesion groups and expression level of the # same corresponding gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=1, min.pts.lsn=5) # run Kruskal-Wallis test for association between lesion groups and expression level of the # same corresponding gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5)
Lesion data file showing copy number variations, single nucleotide variations and structrual rearrangments affecting 265 newly diagnosed T-cell Acute Lymphoblastic Leukemia (T-ALL) patients that was reported by Liu, Yu, et al. (2017).
lesion.data
lesion.data
lesion.data
A data frame with 6,887 rows and 5 columns:
patient identifier for the patient affected by the genomic lesion
the chromosome on which the lesion is located
the lesion start position in base pairs
the lesion end position in base pairs
the lesion type for example gain, loss, mutation, fusion, etc...
extracted from the supplementary material tables of the published Liu, Yu, et al. (2017) manuscript https://www.nature.com/articles/ng.3909#Sec27
Function prepare a plot with all types of lesions that spans either a gene or a region of interest.
lsn.transcripts.plot( grin.res, genome, gene = NULL, transTrack = TRUE, lsn.clrs = NULL, chrom = NULL, plot.start = NULL, plot.end = NULL, lesion.grp = NULL, spec.lsn.clr = NULL, extend.left = NULL, extend.right = NULL, expand = 5e-04, hg38.transcripts = NULL, hg19.cytoband = NULL, hg38.cytoband = NULL )
lsn.transcripts.plot( grin.res, genome, gene = NULL, transTrack = TRUE, lsn.clrs = NULL, chrom = NULL, plot.start = NULL, plot.end = NULL, lesion.grp = NULL, spec.lsn.clr = NULL, extend.left = NULL, extend.right = NULL, expand = 5e-04, hg38.transcripts = NULL, hg19.cytoband = NULL, hg38.cytoband = NULL )
grin.res |
GRIN results (output of the grin.stats function). |
genome |
either "hg19" or "hg38" genome assemblies can be specified based on the genome assembly that has been used to prepare the lesion data. |
gene |
Gene name of interest. |
transTrack |
In case of plots that span large genomic region such as a chromosome band or the whole chromosome, this argument should be specified as 'FALSE' to exclude the transcripts track from the plot. |
lsn.clrs |
Lesion colors for the regional gene plot (If not provided by the user, colors will be automatically assigned using default.grin.colors function). |
chrom |
chromosome number (should be only specified in the locus plots where plot.start and plot.end for the locus of interest are specified). |
plot.start |
start position of the locus of interest. |
plot.end |
end position of the locus of interest. |
lesion.grp |
lesion group of interest (should be only specified in locus plots when chrom, plot.start, plot.end are specified). |
spec.lsn.clr |
color assigned to the lesion of interest (should be specified when chrom, plot.start, plot.end and lesion.grp are specified). |
extend.left |
specified number will be used to manually align the left side of the gene transcripts track directly retrieved from ensembl database with the gene lesions track if needed. |
extend.right |
specified number will be used to manually align the right side of the gene transcripts track directly retrieved from ensembl database with the gene lesions track if needed. |
expand |
Controls ratio of the gene locus (start and end position) to the whole plot with default value = 0.0005 (setting expand=0 will only plot the gene locus from the start to the end position without any of the upstream or downstream regions of the gene). |
hg38.transcripts |
transcripts data retrieved from annotation hub for hg38 version 110 (should be only specified if genome="hg38"). |
hg19.cytoband |
hg19 chromosome bands start and end data in base pair (should be only specified if genome="hg19"). |
hg38.cytoband |
hg38 chromosome bands start and end data in base pair (should be only specified if genome="hg38"). |
Function return a plot with all lesions that affect either a gene or a region of interest. Top panel of the regional gene plot has the transcripts track with all transcripts annotated to the gene of interest directly retrieved from ensembl database. The middle panel will has all different types of lesions affecting the gene color coded according to the figure legend. Lower panel of the plot has all the GRIN statistics of the gene that include number of subjects affected by each type of lesions, -log10 p, and –log10q values showing if the gene is significantly affected by the corresponding lesion category. If a certain locus is specified, only transcripts track and the lesion panel will be returned (GRIN results panel will not be added to the plot). In case of plots that span large genomic region such as a chromosome band or the whole chromosome and by specifying transTrack=FALSE, transcripts track will not be added to the plot as well.
Function will return either a gene plot with the transcripts track, lesions panel and GRIN statistic for the gene of interest, a plot with all lesions and transcripts aligned to a certain locus of interest if chrom, plot.start and plot.end were specified or a plot with all lesions affecting a region of interest without the transcripts track.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) data(hg19_cytoband) data(hg38_cytoband) # run GRIN analysis using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Plots showing different types of lesions affecting a gene of interest with a transcripts # track that show all the gene transcripts retrieved from Ensembl (hg19 genome assembly): WT1.gene.plot=lsn.transcripts.plot(grin.results, genome="hg19", gene="WT1", hg19.cytoband=hg19_cytoband) # Plots showing different types of lesions affecting a region of interest with a transcripts # track added to the plot: locus.plot=lsn.transcripts.plot(grin.results, genome="hg19", hg19.cytoband=hg19_cytoband, chrom=9, plot.start=21800000, plot.end=22200000, lesion.grp = "loss", spec.lsn.clr = "blue") # Plots Showing Different Types of Lesions Affecting a region of Interest without plotting the # transcripts track (this will allow plotting a larger locus of the chromosome such as a # chromosome band (should specify transTrack = FALSE): noTranscripts=lsn.transcripts.plot(grin.results, genome="hg19", transTrack = FALSE, hg19.cytoband=hg19_cytoband, chrom=9, plot.start=19900000, plot.end=25600000, lesion.grp = "loss", spec.lsn.clr = "blue") # Plots Showing Different Types of Lesions Affecting the whole chromosome: chrom.plot=lsn.transcripts.plot(grin.results, genome="hg19", transTrack = FALSE, hg19.cytoband=hg19_cytoband, chrom=9, plot.start=1, plot.end=141000000) # for GRCh38 (hg38) genome assembly, users should first call the AnnotationHub() web resource then # specify ah[["AH113665"]] to retrieve the human hg38 gene transcripts. This formal class # EnsDb object should be called afterwards in the 'hg38.transcripts' argument to return gene and # regional plots.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) data(hg19_cytoband) data(hg38_cytoband) # run GRIN analysis using grin.stats function grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Plots showing different types of lesions affecting a gene of interest with a transcripts # track that show all the gene transcripts retrieved from Ensembl (hg19 genome assembly): WT1.gene.plot=lsn.transcripts.plot(grin.results, genome="hg19", gene="WT1", hg19.cytoband=hg19_cytoband) # Plots showing different types of lesions affecting a region of interest with a transcripts # track added to the plot: locus.plot=lsn.transcripts.plot(grin.results, genome="hg19", hg19.cytoband=hg19_cytoband, chrom=9, plot.start=21800000, plot.end=22200000, lesion.grp = "loss", spec.lsn.clr = "blue") # Plots Showing Different Types of Lesions Affecting a region of Interest without plotting the # transcripts track (this will allow plotting a larger locus of the chromosome such as a # chromosome band (should specify transTrack = FALSE): noTranscripts=lsn.transcripts.plot(grin.results, genome="hg19", transTrack = FALSE, hg19.cytoband=hg19_cytoband, chrom=9, plot.start=19900000, plot.end=25600000, lesion.grp = "loss", spec.lsn.clr = "blue") # Plots Showing Different Types of Lesions Affecting the whole chromosome: chrom.plot=lsn.transcripts.plot(grin.results, genome="hg19", transTrack = FALSE, hg19.cytoband=hg19_cytoband, chrom=9, plot.start=1, plot.end=141000000) # for GRCh38 (hg38) genome assembly, users should first call the AnnotationHub() web resource then # specify ah[["AH113665"]] to retrieve the human hg38 gene transcripts. This formal class # EnsDb object should be called afterwards in the 'hg38.transcripts' argument to return gene and # regional plots.
The function order lesion types based on their average size and assign the proportion of the oncoprint rectangle that should be color filled based on the average size of each lesion type.
onco.print.props(lsn.data, clr = NULL, hgt = NULL)
onco.print.props(lsn.data, clr = NULL, hgt = NULL)
lsn.data |
data.frame with 5 columns including "ID" which is the subject identifier, "chrom" which is the chromosome on which the lesion is located, "loc.start" with lesion start position, "loc.end" which is the lesion end position), and "lsn.type" which is the lesion category for example gain, mutation, etc..). |
clr |
Lesion colors (If not provided by the user, colors will be automatically assigned using default.grin.colors function). |
hgt |
Manually assign the proportion of the oncoprint rectangle that should be color filled for each lesion group. |
Some patients might be affected by two or more lesion types in the same gene for example gain AND mutations. To get all lesion types represented in the same rectangle in the oncoprint, this function order lesion types based on the average size of each type and assign the proportion of the oncoprint rectangle that should be color filled based on the average size of each lesion type. Color filled proportion of the oncoprint rectangles can be also specified by the user for each lesion type based on the hgt parameter.
Function return a list of three lists specifying the color assigned to each lesion type, the proportion of the rectangle that should be color filled in the oncoprint based on the average size of each lesion type, and the legend parameters.
Lakshmi Patibandla [email protected], Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) onco.props=onco.print.props(lesion.data, hgt = c("gain"=4, "loss"=3, "mutation"=2, "fusion"=1)) # if hgt argument is not specified, the lesion category "mutation" for single point mutations will # be assigned size=1 because it has the smallest average lesion size and will have the smallest # proportion of the filled oncoprint rectangles 1/4=0.25
data(lesion.data) onco.props=onco.print.props(lesion.data, hgt = c("gain"=4, "loss"=3, "mutation"=2, "fusion"=1)) # if hgt argument is not specified, the lesion category "mutation" for single point mutations will # be assigned size=1 because it has the smallest average lesion size and will have the smallest # proportion of the filled oncoprint rectangles 1/4=0.25
This function order and index gene annotation data by chromosome on which the gene is located, gene start, and end positions.
order.index.gene.data(gene.data)
order.index.gene.data(gene.data)
gene.data |
data.frame with gene annotation data either provided by the user or retrieved from ensembl BioMart database using get.ensembl.annotation function included in the GRIN2.0 library. data.frame should has four columns that include "gene" which is the ensembl ID of the annotated genes to which the lesion data will be overlapped, "chrom" which is the chromosome on which the gene is located, "loc.start" which is the gene start position, and "loc.end" the gene end position. |
A list with the following components:
gene.data |
Input gene annotation data |
gene.index |
data.frame with two columns of ordered row start and row end based on the number of genes annotated to each chromosome. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(hg19.gene.annotation) ordered.genes=order.index.gene.data(hg19.gene.annotation)
data(hg19.gene.annotation) ordered.genes=order.index.gene.data(hg19.gene.annotation)
This function order and index lesion data by lesion type, the chromosome on which the lesion is located , and subject.
order.index.lsn.data(lsn.data)
order.index.lsn.data(lsn.data)
lsn.data |
data.frame with lesion data prepared by the user in a GRIN compatible format. The data.frame should has five columns that include "ID" which is a column with id of the patient affected by the lesion, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion type for example gain, loss, mutation, fusion, etc... |
A list with the following components:
lsn.data |
Input lesion data |
lsn.index |
data.frame with row start and row end for each type of lesions affecting each subject on a certain chromosome. For example, if a certain patient is affected by 1 deletion on chromosome 5, row start wil be equal to row end for loss on chromosome 5. However, if the patient is affected by 4 deletions, difference between row.start and row.end will be 3. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) ordered.lsn=order.index.lsn.data(lesion.data)
data(lesion.data) ordered.lsn=order.index.lsn.data(lesion.data)
The dataset has a list of genes annotated to a group of different pathways.
pathways
pathways
pathways
A data frame with 121 rows and 3 columns:
Gene symbol.
Gene ensembl ID.
The pathway to which the gene is annotated.
Data was extracted from the supplementary material tables of the published Liu, Yu, et al. (2017) manuscript https://www.nature.com/articles/ng.3909#Sec27
Prepares a lesion matrix with each gene affected by a certain type of lesion as a row and each patient as a column.
prep.binary.lsn.mtx(ov.data, min.ngrp = 0)
prep.binary.lsn.mtx(ov.data, min.ngrp = 0)
ov.data |
list of six data.frames that represent the output results of the find.gene.lsn.overlaps function. |
min.ngrp |
if specified, rows with number of patients affected by a specific type of lesion that's less than the specified number will be discarded (default is 0; function will return all genes affected by a lesion in at least one patient), for example if only one patient is affected by gain in MYB gene. |
The function uses the output results of the find.gene.lsn.overlaps function and create a binary lesion matrix with each gene affected by certain lesion type as a row and each patient as a column. Rownames are labelled as gene.ID_lesion.type (for example: ENSG00000118513_gain for gains affecting MYB gene). The entry for each patient in the table will be denoted as 1 if the patient is affected by this specific type of lesion in the gene, for example gain in MYB gene (ENSG00000118513) or 0 otherwise.
The function returns a binary lesion matrix with each row labelled as gene.ID_lesion.type and each column is a patient. Entry for each patient in the table will be denoted as 1 if the gene is affected by this specific type of lesion or 0 otherwise.
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
prep.gene.lsn.data()
, find.gene.lsn.overlaps()
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # prepare the lesion binary matrix with a minimum of 5 patients affected by the lesion to be # included in the final matrix: lsn.binary.mtx=prep.binary.lsn.mtx(gene.lsn.overlap, min.ngrp=5)
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # prepare the lesion binary matrix with a minimum of 5 patients affected by the lesion to be # included in the final matrix: lsn.binary.mtx=prep.binary.lsn.mtx(gene.lsn.overlap, min.ngrp=5)
This function prepare gene and lesion data for later GRIN computations.
prep.gene.lsn.data(lsn.data, gene.data, mess.freq = 10)
prep.gene.lsn.data(lsn.data, gene.data, mess.freq = 10)
lsn.data |
data.frame with lesion data prepared by the user in a GRIN compatible format. The data.frame should has five columns that include "ID" which is the patient ID, "chrom" which is the chromosome on which the lesion is located, "loc.start" which is the lesion start position, "loc.end" the lesion end position and "lsn.type" which is the lesion type for example gain, loss, mutation, fusion, etc... |
gene.data |
gene annotation data with four required columns: "gene" has ensembl ID, "chrom" which is chromosome on which the gene is located, "loc.start" gene start position, "loc.end" which is the gene end position |
mess.freq |
message frequency: display message every mess.freq^th lesion block (default is 10). |
This function order and index gene and lesion data for later computations. Output of this function is used to ovelap gene and lesion data using find.gene.lsn.overlaps function.
A list with the following components:
lsn.data |
Input lesion data. |
gene.data |
Input gene annotation data. |
gene.lsn.data |
data.frame ordered by gene and lesions start positions. Gene start position is coded as 1 in the cty column and gene end position is coded as 4. Lesion start position is coded as 2 in the cty column and lesion end position is coded as 3. |
gene.index |
data.frame that shows row start and row end for each chromosome in the gene.lsn.data table. |
lsn.index |
data.frame that shows row start and row end for each lesion in the gene.lsn.data table. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
order.index.gene.data()
, order.index.lsn.data()
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation)
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation)
The function prepare a lesion matrix with all types of lesions affecting certain gene as a row and each patient as a column.
prep.lsn.type.matrix(ov.data, min.ngrp = 0)
prep.lsn.type.matrix(ov.data, min.ngrp = 0)
ov.data |
list of six data.frames that represent the output results of the find.gene.lsn.overlaps function. |
min.ngrp |
if specified, rows with number of patients affected by all different types of lesions that's less than the specified number will be discarded (default is 0; will return all genes affected by any type of lesions in at least one patient). |
The function returns a lesion matrix with each row as a gene and each column is a patient. If a gene is affected by one type of lesions in a certain patient, the entry will be labelled by lesion type (for example: gain OR mutation). However, if the same gene is affected by more than one type of lesions in a certain patient (for example: gain AND mutation), the entry will be labelled as "multiple". If the gene is not affected by any lesion, the entry for this patient will be labelled as "none".
The function returns a lesion matrix with all types of lesions affecting certain gene as a row and each patient as a column.
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
prep.gene.lsn.data()
, find.gene.lsn.overlaps()
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # prepare the lesion matrix with a minimum of 5 patients affected by any type of lesion in the # gene to be included in the final matrix lsn.type.mtx=prep.lsn.type.matrix(gene.lsn.overlap, min.ngrp=5)
data(lesion.data) data(hg19.gene.annotation) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # prepare the lesion matrix with a minimum of 5 patients affected by any type of lesion in the # gene to be included in the final matrix lsn.type.mtx=prep.lsn.type.matrix(gene.lsn.overlap, min.ngrp=5)
The function evaluates the probability of a locus to be affected by one or a constellation of multiple types of lesions.
prob.hits(hit.cnt, chr.size = NULL)
prob.hits(hit.cnt, chr.size = NULL)
hit.cnt |
output results of the count.hits function with number of subjects and number of hits affecting each locus. |
chr.size |
data.frame with the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs. The data.frame should has two columns "chrom" with the chromosome number and "size" for the size of the chromosome in base pairs. |
The function computes p-value for the probability of each locus (gene or regulatory feature) to be affected by different types of lesions based on a convolution of independent but non-identical Bernoulli distributions to determine whether a certain locus has an abundance of lesions that is statistically significant.In addition, FDR-adjusted q value is computed for each locus based on Pounds & Cheng (2006) estimator of the proportion of tests with a true null (pi.hat). The function also evaluates if a certain locus is affected by a constellation of multiple types of lesions and computes a p and adjusted q values for the locus to be affected by one type of lesions (p1), two types of lesions (p2), etc...
A list with the following components:
gene.hits |
data table of GRIN results that include gene annotation, number of subjects affected by each lesion type for example gain, loss, mutation, etc.., and number of hits affecting each locus. The GRIN results table will also include P and FDR adjusted q-values showing the probability of each locus of being affected by one or a constellation of multiple types of lesions. |
lsn.data |
input lesion data |
gene.data |
input gene annotation data |
gene.lsn.data |
each row represent a gene overlapped by a certain lesion. Column "gene" shows the overlapped gene ensembl ID and "ID"" column has the patient ID. |
chr.size |
data table showing the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs. |
gene.index |
data.frame with overlapped gene-lesion data rows that belong to each chromosome in the gene.lsn.data table. |
lsn.index |
data.frame that shows the overlapped gene-lesion data rows taht belong to each lesion in the gene.lsn.data table. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
prep.gene.lsn.data()
, find.gene.lsn.overlaps()
, count.hits()
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # count number of subjects affected by different types of lesions and number of hits that affect # each locus: count.subj.hits=count.hits(gene.lsn.overlap) # compute the probability of each locus to be affected by one or a constellation of multiple # types of lesion hits.prob=prob.hits(count.subj.hits, hg19.chrom.size)
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # prepare gene and lesion data for later computations: prep.gene.lsn=prep.gene.lsn.data(lesion.data, hg19.gene.annotation) # determine lesions that overlap each gene (locus): gene.lsn.overlap=find.gene.lsn.overlaps(prep.gene.lsn) # count number of subjects affected by different types of lesions and number of hits that affect # each locus: count.subj.hits=count.hits(gene.lsn.overlap) # compute the probability of each locus to be affected by one or a constellation of multiple # types of lesion hits.prob=prob.hits(count.subj.hits, hg19.chrom.size)
Function return waterfall plots for top significant genes in the KW results table based on the specified q value.
top.alex.waterfall.plots(out.dir, alex.data, alex.kw.results, q, lsn.data)
top.alex.waterfall.plots(out.dir, alex.data, alex.kw.results, q, lsn.data)
out.dir |
Path to the folder where the waterfall plots of selected genes based on the specified q value of the KW results table will be added. |
alex.data |
output of the alex.prep.lsn.expr function. It's a list of three data tables that include "row.mtch", "alex.expr" with expression data, "alex.lsn" with lesion data. Rows of alex.expr, and "alex.lsn" matrices are ordered by gene ensembl IDs and columns are ordered by patient ID. |
alex.kw.results |
ALEX Kruskal-Wallis test results (output of the KW.hit.express function). |
q |
Maximum allowed KW q-value threshold for a gene to be plotted based on the output of the KW.hit.express function. |
lsn.data |
Lesion data in a GRIN compatible format. |
Function will return waterfall plots for top significant genes in the KW results table based on the user specified q-value threshold of the KW test.The plots will be added to the user specified outdir folder.
Function will return waterfall plots for top significant genes in the KW results table.
Abdelrahman Elsayed [email protected] and Stanley Pounds [email protected]
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
alex.prep.lsn.expr()
, KW.hit.express()
, alex.waterfall.prep()
, alex.waterfall.plot()
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # return waterfall plots for a list of top significant genes to a pre-specified folder: dir.create(resultsFolder <- file.path(tempdir(), "temp.out")) waterfall.plts=top.alex.waterfall.plots(out.dir=resultsFolder, alex.data, alex.kw.results, 1e-15, lesion.data) unlink(resultsFolder, recursive = TRUE)
data(expr.data) data(lesion.data) data(hg19.gene.annotation) # prepare expression, lesion data and return the set of genes with both types of data available # ordered by gene IDs in rows and patient IDs in columns: alex.data=alex.prep.lsn.expr(expr.data, lesion.data, hg19.gene.annotation, min.expr=5, min.pts.lsn=5) # run KW test for association between lesion groups and expression level of the same gene: alex.kw.results=KW.hit.express(alex.data, hg19.gene.annotation, min.grp.size=5) # return waterfall plots for a list of top significant genes to a pre-specified folder: dir.create(resultsFolder <- file.path(tempdir(), "temp.out")) waterfall.plts=top.alex.waterfall.plots(out.dir=resultsFolder, alex.data, alex.kw.results, 1e-15, lesion.data) unlink(resultsFolder, recursive = TRUE)
The function Write GRIN results to an excel file with multiple sheets that include GRIN results, lesion data, gene annotation data, chromosome size, gene-lesion overlap and methods paragraph.
write.grin.xlsx(grin.result, output.file)
write.grin.xlsx(grin.result, output.file)
grin.result |
output results of the grin.stats function. |
output.file |
output file name ".xlsx". |
This function return an excel file with seven sheets that include:
gene.hits |
data table of GRIN results that include gene annotation, number of subjects affected by each lesion type for example gain, loss, mutation, etc.., and number of hits affecting each locus. The GRIN results table will also include P and FDR adjusted q-values showing the probability of each locus of being affected by one or a constellation of multiple types of lesions. |
gene.lsn.data |
each row represent a gene overlapped by a certain lesion. Column "gene" shows the overlapped gene ensembl ID and "ID"" column has the patient ID. |
lsn.data |
input lesion data |
gene.data |
input gene annotation data |
chr.size |
data table showing the size of the 22 autosomes, in addition to X and Y chromosomes in base pairs. |
interpretation |
provides some details about the content of each sheet in the output excel file and interpretation of each column in the "gene.hits" GRIN results table. |
method.paragraph |
include a paragraph that explains the GRIN model and cite some references. |
Stanley Pounds [email protected]
Pounds, Stan, et al. (2013) A genomic random interval model for statistical analysis of genomic lesion data.
Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # to directly retreive gene annotation and chromosome size files from Ensembl BioMart database, # UCSC genome browsers and run the GRIN analysis: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Write GRIN results in to an excel sheet ".xlsx" using write.grin.xlsx function.
data(lesion.data) data(hg19.gene.annotation) data(hg19.chrom.size) # to directly retreive gene annotation and chromosome size files from Ensembl BioMart database, # UCSC genome browsers and run the GRIN analysis: grin.results=grin.stats(lesion.data, hg19.gene.annotation, hg19.chrom.size) # Write GRIN results in to an excel sheet ".xlsx" using write.grin.xlsx function.