Summary
The development of cancer is an evolutionary process involving the sequential acquisition of genetic alterations that disrupt normal biological processesenabling tumor cells to rapidly proliferate and eventually invade and metastasize to other tissues. We investigated the genomic evolution of prostate cancer through the application of three separate classification methodseach designed to investigate a different aspect of tumor evolution. Integrating the results revealed the existence of two distinct types of prostate cancer that arise from divergent evolutionary trajectoriesdesignated as the Canonical and Alternative evolutionary disease types. We therefore propose the evotype model for prostate cancer evolution wherein Alternative-evotype tumors diverge from those of the Canonical-evotype through the stochastic accumulation of genetic alterations associated with disruptions to androgen receptor DNA binding. Our model unifies many previous molecular observationsproviding a powerful new framework to investigate prostate cancer disease progression.
Keywords: prostate cancercancer evolutionevotype modelevotypesAR bindingordering
Graphical abstract

Highlights
-
•
Analysis of genomic data from localized prostate cancer reveals divergent evolution
-
•
The shift away from the canonical trajectory is characterized by AR dysregulation
-
•
Tumors can be classified into evotypes according to evolutionary trajectory
-
•
The evotype model unifies many previous molecular observations
Woodcock et al. show that individual evolutionary trajectories in prostate cancer can be classified into two broad categoriestermed evotypes. This motivates a model of prostate cancer evolution in which Alternative-evotype tumors diverge from those of the Canonical-evotype due to acquired genetic alterations that alter the androgen receptor cistrome.
Introduction
Tumor evolution is a dynamic process1 involving the accumulation of genetic alterations that disrupt normal cellular processesleading to pathological phenotypes.2 While some cancers can be categorized into subtypesoften utilizing pronounced genomic or transcriptomic differencesthe evolutionary processes that give rise to this variation are complex and not well understood.3 Howeverit has been shown that the order of events in some hematological malignancies can be related to prognosis and treatment susceptibility.4,5,6,7
In prostate cancersubtyping schemes have been proposed based on the presence of specific molecular alterations,8 combinations of alterations,9 or gene expression profiles.10 Howeverdetailed investigations by ourselves11 and others12,13 have shown substantial heterogeneity between tumors that presents challenges for simple or consistent subtype assignments.14 Studies investigating evolutionary differences between prostate cancer disease types by categorizing molecular events as “early” or “late” have been shown to be informative in early-onset15 and aggressive disease,16 and the temporal order of genetic alterations has also been shown to be related to the ETS subtype.11 Howeverthe evolutionary factors that drive the emergence of prostate cancer subtypes remains largely unexplored.
To investigate how evolutionary behavior manifests in the variation observed in prostate cancer genomeswe performed three separate analyseseach of which probes different aspects of tumor evolution. In each analysiswe classified the tumors in an unsupervised fashion and subsequently identified sets of tumors that shared the same classes across the analyses. Through this approachwe can identify tumors that display consistent evolutionary properties and use this information to identify likely mechanisms driving prostate cancer evolution.
Results
Data collection and pre-processing
We compiled a dataset from 159 patients with intermediate or low-risk prostate adenocarcinoma sampled after radical prostatectomywho were otherwise treatment naive (87 published previously11). These were whole-genome sequenced (target depth: 50×) along with matched blood controls (target depth: 40×)and 123 summary measurements were generated (STAR Methods; Figure S1).
We adapted an unsupervised neural network with a single hidden layer to perform feature learning on this datasetidentifying associations between inputs to obtain a reduced-dimension set of 30 features (STAR Methods). Using the trained neural networkwe can recast the data for each sample in terms of these features into a form known as the feature representation. Reconstructing the original inputs from the feature representation gave a reconstruction error of 12%indicating that these featuresand the inputs to which they correspondcontain a substantial proportion of the information in the original data. Our approach is a white-box methodmeaning we can identify which inputs contribute to each featureand so we labeled each feature with a brief descriptor of the associated genomic aberrations (Figure S2). We can perform analysis on the feature representation itself while allowing comparison with the results of other analyses using a selection of the original inputs that correspond to the features (STAR Methods).
Classifying tumors by patterns of co-occurring genomic features
Despite the reduced dimensionality of the feature representationapplication of standard clustering methods remains problematic due to the high dimension of features (30) relative to the sample size (159). To mitigate thiswe adopted a two-stage clustering method utilizing a discrimination score we calculated for each feature that quantified the value of each feature in predicting disease relapse (STAR Methods). In the first stagewe applied k-medoid clustering to the feature representation of those features with a high discrimination score (STAR Methods). In the second stagewe performed hierarchical clustering on the cluster centers (medoids) returned in the first stage. The results are shown in Figure 1.
Figure 1.
Co-occurrence of genetic alterations distinguishes three metaclusters
After performing feature extractionwe calculated a discrimination score quantifying the relevance of each feature in predicting relapse (green heatmap). Fourteen features (red) were used as inputs for k-medoid clustering with 11 clusters. The medoids of each cluster were used as inputs to hierarchical clustering using all featureswhich revealed three main metaclustersMC-AMC-B1and MC-B2with different profiles as indicated by the dendrogram. The main heatmap shows the medoid feature values for the patients in each clusterordered by the hierarchical clustering (scale to right). The number of samples in each cluster is given below the corresponding cluster medoid. Metacluster colors are denoted by the text above the dendrogram.
We identified two distinct metaclusters that were characterized by different sets of aberrations. Metacluster A (MC-A) showed a high probability of features corresponding to intra-chromosomal structural variants (SVs)SPOP mutationschromothripsisand loss of heterozygosity (LOH) in regions 5q15–5q23.1 (spanning CHD1) and 6q14.1–6q22.32 (MAP3K7ZNF292). MC-B showed more frequent ETS fusionsas well as LOHaffecting 17p (TP53) and regions 19p13.3–13.2 and 22q11.21–22q11.22. The dendrogram indicated additional differences within MC-Band so we further divided it into subclasses MC-B1 and MC-B2with MC-B2 displaying near-ubiquitous TP53 LOH and exhibiting higher probability of ETS fusionsinter-chromosomal chained SVs (cSVs)and LOH at 10q23.1–10q25.1 (PTEN) and 5q11.1–5q14.1 (IL6STPDE4D).
Classifying tumors by mechanism of DNA double-strand breaks
We investigated the influence of androgen receptor (AR) on the DNA breakpoints in these samples. AR is known to precipitate DNA double-strand breaks (DSBs) in conjunction with topoisomerase II-beta,17 and AR-associated breakpoints are frequent in early-onset prostate cancer.15,18 Furthermoreit has also been shown that AR-binding behavior can be altered by CHD1 deletion.19 We used a permutation test (STAR Methods) to classify tumors based on whether breakpoints occurred significantly more (labeled as Enriched) or less (Depleted) often proximal to AR-binding sites (ARBSs) than expected if they were independent of AR or Indeterminate tumors that displayed no statistically significant association (Figure 2A).
Figure 2.
Classification by proximity of DNA breakpoints to ARBSs reveals common genetic alterations
(A) The proportion of DNA breakpoints within 20 kilobases (kb) of an ARBS for each patientnormalized by the number of proximal breakpoints expected by chance (vertical axis). Tumor samples are ordered according to this normalized proportion (horizontal axis). Classes were determined based on whether the tumor displayed more (Enriched) or fewer (Depleted) proximal breakpoints than expected or there was no statistical significance (Indeterminate).
(B) Heatmaps of genomic features for each patientordered as above. Statistically significant relationships for the three classes are shown in the relationship columnwhere EDand I indicate the EnrichedDepletedand Indeterminate classesrespectively. Braces indicate no relationship between the enclosed classes but that they both display significant differences to the remaining class. Relationships are ordered so the leftmost class(es) are those showing a significantly greater proportion of the corresponding genetic alteration. For Bernoulli variablessignificance was determined with chi-squared test followed by a Fisher’s exact test for each pairwise relationship; for continuous variablesa Kruskal-Wallace test with Tukey’s honestly significant difference (HSD) was used (false discovery rate [FDR]-adjusted p < 0.05 for all tests).
Investigating the ARBS groups in conjunction with the genetic alterations associated with the features (Figure 2B)we found that Depleted tumors had the highest percentage genome altered (PGA) and the highest frequency of multiple copy-number alterations (CNAs)chromothripsiskataegisand SPOP mutations (relationship columnFigure 2B). Enriched and Indeterminate tumors displayed no significant differences for any CNAsbut both showed higher frequencies of CNAs covering PTEN and TP53 than the Depleted group (relationship columnFigure 2B). In the case of ETS fusions and inter-/intra-chromosomal cSV ratiothe Enriched group showed greater amounts than the Intermediate groupwhich in turn showed greater enrichment than the Depleted group. Both Enriched and Depleted tumors displayed higher numbers of breakpoints than Indeterminate tumors. We identified these ARBS groups in two additional datasets: a set of low-intermediate risk tumors from the Canadian Prostate Cancer Genome Network13 and high-risk tumors from the Melbourne Prostate Cancer Research Group in Australia (unpublished). Clustering these groups by CNA proportions showed that groups classified as Depleted clustered together (Figure S3)confirming the association between these CNAs and ARBS-distal breakpoint prevalence.
Classifying tumors through the evolutionary order of key events
The order in which genetic alterations generally occur in tumor evolutionsubsequently referred to as the “ordering profile,” can be inferred using the estimated proportion of tumor cells that display each genetic alteration in each sample.11 We adapted a Plackett-Luce mixture model20 to create a probabilistic model for the relative order of genomic aberrations given the relative subclonal fractions of SPOP mutations and the key CNAs that were identified in our feature extraction (STAR Methods). As a mixture modelit can be used to extract distinct ordering profiles within the population. Inference with this model was performed with differing numbers of clustersand the results used in Bayesian model selection determined that two ordering profiles were optimal (STAR Methods). We therefore defined two classesOrdering-I and Ordering-IIand each tumor was assigned to one of these classes by their mixture weights (Figure 3).
Figure 3.
Samples can be differentiated by order of genetic alterations
Phylogenetic trees from individual tumors were used to estimate two ordering profiles using a Plackett-Luce (P-L) mixture model. Tumors were assigned to Ordering-I (top) or Ordering-II (bottom). Horizontal box and whisker plots (5th/25th/75th/95th percentiles) represent the spread of bootstrap estimates of the negative P-L coefficient () for the ith genetic alteration (x axis). Herethe lower the value of the earlier the genetic alteration is likely to occur. The y axis shows the proportion of samples in the mixture component in which the genetic alteration was observed. Colors of the box and whiskers denote the chromosome on which the aberration occurred. Genetic alterations were annotated if they were identified as an ETS fusionoccurred with a proportion above 0.25or were identified in the earliest 5 events; these have chromosomal regions givenwith notable driver genes in the region given in brackets where applicable. Other genetic alterations were not annotated and are displayed with reduced transparency.
The two profiles displayed notable differences. Tumors corresponding to Ordering-I frequently experienced an early 8p LOH (spanning NKX3.1) and ETS fusions. Less frequent LOH of regions covering the RB1BRCA2CDH1TP53or PTEN gene could also occur. This profile occasionally displayed a very early LOH of 1q42.12–42.3. Tumors of Ordering-II consistently displayed early LOH events covering MAP3K7 and 13q (EDNRBRB1BRCA2). Howeverthe earliest eventsa mutation of the SPOP gene and LOH covering CHD1were less frequent. Ordering-II also displayed more frequent copy-number gains. Both orderings showed late gains of chromosome 19. When comparing the occurrence of aberrations between individuals within each orderingwe found that the relative order of alterations was highly variableindicating that they arise stochastically (Figure S4; STAR Methods).
Integrating analyses reveals disease types distinguished by their evolutionary trajectories
Establishing the concordance of these three classification methods (Figure 4A) revealed a remarkable relationship: MC-A is largely a subset of the Depleted group (22/27)and both are almost entirely subsets of Ordering-II (26/27 and 30/32respectively). Quantifying the strength of the pairwise associations using Cramer’s V statistic gives metaclusters and ARBS groups (V = 0.69)metaclusters and orderings (V = 0.58)and ARBS and orderings (V = 0.62). These values indicate a strong association between cluster assignments in these three groups (STAR Methods). We can therefore infer that there exists a subset of tumors that exhibit all the corresponding properties: an evolutionary trajectory (Ordering-II)a breakpoint mechanism (ARBS:Depleted)and characteristic patterns of aberrations (metacluster:MC-A). We therefore propose the evotype model for prostate cancer evolution (Figure 4B)in which canonical AR DNA binding is disruptedthrough the effect of genetic alterations or other causescoercing tumor evolution along an alternative trajectory that results in a distinct form of the disease. We can therefore classify tumors by which path a tumor is most likely to adhere towhich we refer to as its evotype. To perform this classificationwe adopted a majority-vote approach and defined tumors that were assigned to at least two of MC-ADepletedor Ordering-II as belonging to the Alternative-evotype (n = 34) to distinguish them from Canonical-evotype (n = 125) tumors that evolve via the standard route. Each evotype is characterized by a different propensity for certain aberrations (Figure 4C)but we found that no single aberration was necessary or sufficient for assignment to either evotype. Howeverthere were several pairwise combinations of genetic alterations that did result in fixation to one of the evotypes (Figure S5). There were no statistically significant associations (p = 0.05) between the evotypes and tumor stageGleason gradeor prostate-specific antigen (PSA) levels (Figure S6).
Figure 4.
Integrating results reveal multiple evolutionary trajectories converging to two disease types
(A) A comparison of how tumors were classified in each of the three previous methods. Each side of the triangle corresponds to a classification methodwherein each bar in the triangle denotes a group identified by that method. Values at the intersections of each bar show the number of tumors that were consistent with both classes. Values outside the main triangle denote the total number of tumors in that class. Colors are those used in previous figures.
(B) A schematic of the evotype model for prostate cancer evolution.
(C) The prevalence of each genetic aberration in each evotypeas determined using the majority consensus of the three classifiers. Aberrations with significant differences between evotypes are colored by the evotype displaying the highest proportion (FDR-adjusted p < 0.05Fisher’s exact test).
(D) A surface plot showing the probability density of a tumor being assigned to the Canonical-evotype relative to the number of aberrations. Common modes of evolutionary progression follow regions of high density as the number of aberrations increases. Exemplars of such routes are indicated by black dashed lines. These are labeled according to their likely evotypea behavioral descriptorand notable driver genes affected by aberrations that are prevalent in the areas along the path to convergence (Figures S7 and S8).
The lack of consistent genetic alterations indicates that there may be multiple individual routes of progression for each evotype. We investigated these trajectories in more detail by developing a stochastic model of the acquisition of genetic alterations and tracking the probability of assignment to each evotype as the aberrations accumulate (Figure 4D; STAR Methods). Initiallythe probability density is concentrated at 0.78the proportion of Canonical-evotype tumors in our sample set. As the number of aberrations increasesthe density diverges to accumulate at 1 (corresponding to unambiguous assignment to the Canonical-evotype) and 0 (Alternative-evotype). In this modelan individual tumor will follow a trajectory through this probability landscapedependent on the type and order of aberrations. Due to randomness in the occurrence of genetic alterationsthere are an enormous number of possible routesbut investigating patterns of aberrations in areas of high probability density reveals common modes of behavior (Figures S7 and S8). Exemplars for these modes are given by the dashed lines in Figure 4D. Notablywhen an SPOP mutation occurs firstit confers high probability (0.91) of progression to the Alternative-evotype (Alternative:Rapid). Other routes to the Alternative-evotype involve the accumulation of multiple individual LOH events involving genes such as MAP3K7CHD1or EDNRB (Alternative:Incremental) in any order. LOH of IL6ST or gain of region 8p23.3–8p22 strongly influences convergence after a number of aberrations have already accumulated (Alternative:Abrupt). Converselyfixation to the Canonical-evotype is dependent on a few key aberrations. Early TP53 loss or ERG gene fusion promotes almost certain fixation to the Canonical-evotype (Canonical:Rapid). Alternativelyloss of regions covering PTEN or CDH1 can coerce a relatively quick progression toward this evotypebut these are rarely the final convergent event in the trajectory (Canonical:Moderate). Indeedthere are aberrations that are often the last step in convergence to the Canonical-evotypeparticularly LOH of 19p13.3–19p13.2 or 22q11.21–22q11.22 or gains of chromosome 19 or region 22q11.1–22q11.23 (Canonical:Punctuated).
The lack of a single genetic alteration unique to the Alternative-evotype indicates that there may be multiple mechanisms for acquired AR dysregulation that we observe in prostate cancer. We therefore investigated potential mechanisms of AR dysregulation. It has previously been shown that CHD1 protein is involved in AR bindingwhich causes DNA loops that can precipitate DSBs (Figure 5Aadapted from Metzger et al.21). As LOH of the CHD1 locus is significantly associated with the Alternative-evotype (Figure 4C) and is an early event in tumor evolution (Figure 3)we hypothesized that loss of CHD1 in these tumors would be associated with fewer DSBs precipitated through the DNA loop mechanism. We therefore tested whether pairs of adjacent ARBSs required for DNA loops to form by this mechanism are significantly more or less frequently close to DSBsdependent on CHD1 status (STAR Methods). We found that CHD1 wild-type (WT) tumors more frequently displayed DSBs close to pairs of ARBSs than tumors that displayed a CHD1-associated LOH (Figure 5B; p = 3.91 × 10−5). Extrapolating our hypothesis to the evotypeswe found a significant difference between Canonical- and Alternative-evotype tumors (Figure 5C; p = 3.41 × 10−9). This relationship also holds in CHD1 WT tumors of both evotypes (Figure 5D; p = 3.40 × 10−4). These results indicate that CHD1 LOH can drive AR dysregulation in prostate cancer but that other mechanisms also exist in Alternative-evotype tumors.
Figure 5.
Frequency of AR-induced DNA loops associated with DSBs is associated with CHD1 loss and evotype status
(A) A simplified schematic of AR binding to ARBSswhere CHD1 protein is part of a complex that induces DNA loop formation and subsequent DSBsdenoted by the red X.
(B) A notched box and whisker plot shows that adjacent proximal ARBS pairs that are required for DNA loops to form were observed less frequently in the vicinity of breakpoints in CHD1-deficient tumors than CHD1 wild-type tumors.
(C) DSB-associated ARBS pairs occurred less frequently in tumors of the Alternative-evotype than the Canonical-evotype.
(D) DSB-associated ARBS pairs occurred less frequently in CHD1 wild-type tumors of the Alternative-evotype than the Canonical-evotype. All p values were determined through a one-sided Mann-Whitney U test.
Discussion
Taken togetherour findings reveal prostate cancer disease types that arise as a result of divergent trajectories of a stochastic evolutionary process in which specific genetic alterations can tip the balance toward convergence to either route. Unlike the evolution of specieswhich involves ongoing adaptation to a perpetually changing environmenttumor evolution has a definable endpoint—a disease state that leads to the death of the host. It follows that the more “evolved” tumors are closer to this endpointwhich has obvious implications for risk stratification. We therefore proposed that our evolutionary model implied two factors associated with riskthe evotype itself and the degree of progression relative to that evotype. We investigated this principle using follow-up information based on time to biochemical recurrence (serum PSA > 0.2 ng/mL for two consecutive measurements) after prostatectomy.
Initiallywe found that classifying by evotype alone provides a significant association with time to biochemical recurrence (Figure 6A; p = 0.0256)displaying a higher hazard ratio (HR = 2.30) than stratification by well-known genetic alterations such as PTEN loss (HR = 1.42p = 0.336; Figure S9A)TP53 loss (HR = 2.03p = 0.0497; Figure S9B)or ETS status (HR = 1.64p = 0.179; Figure S9C). Howeverit performed worse than other metrics known to be associated with outcomesuch as tumor mutational burden (TMB)which led to HR = 4.50 and p = 0.000110 (Figure 6B)or histopathological grading via the ISUP Gleason grade scorewhich gave HR = 4.69 and p = 0.0000629 (Figure 6C).
Figure 6.
Utility of evotype model in survival analysis
Kaplan-Meier plots for (A) the evotypes; (B) 20 tumors with greatest tumor mutational burden (high-TMB) against the remainder (low-TMB); (C) ISUP Gleason grade; (D) 10 tumors with highest TMB for each evotype (high-TMB Alternative and high-TMB Canonical) against the remainder (low-TMB combined); (E) the ISUP Gleason grade ≥3 tumors in the high-TMB evotype classes (Evo-TMB-Gleason high) and the remainder (Evo-TMB-Gleason low); (F) Alternative-evotype tumors in MC-A (MC-A/Alternative) and Canonical-evotype tumors in MC-B2 (MC-B2/Canonical) and the remainder (MC-B1/combined); and (G) ISUP Gleason grade ≥ 3 tumors of either MC-A/Alternative or MC-B2/Canonical (MC-A/B2-Gleason high combined) against the remainder (MC-A/B1/B2-Gleason low combined). For each comparisonwe provide the hazard ratio (HR) and p value calculated with the Cox proportional hazard test; p values were adjusted for Gleason gradeTMBand age at diagnosis if they were not used to create the sets used in the comparison (padj) and Harrell’s C-index. In (D) and (F)these values are given for the denoted class in comparison to the remainder only. Endpoint is time to biochemical recurrence.
To illustrate how information on the evolutionary path might improve risk stratificationwe adopted two approaches to determine which tumors were the most advanced relative to their evotype. In the firstwe classified the 10 tumors of both evotypes with the highest TMB as advanced (denoted high-TMB Alternative and high-TMB Canonical) and compared these to all other tumors. We found that high-TMB tumors of both evotypes displayed high HRs (Figure 6D; HR > 6) compared to all previous metricsnotably outperforming the 20 high-TMB tumors when evotype was not used (Figure 6B; HR = 4.50). To investigate how this risk determinant might be used in conjunction with current clinical prognostic methodswe compared the 10 high-TMB tumors of both evotypes that were also ISUP Gleason grade 3 with all other tumorswhich further improved performance (HR = 7.28p = 5.16 × 10−7; Figure 6E). In the second approachwe hypothesized that metaclusters MC-A and MC-B2 were representative of advanced tumors of the Alternative-evotype and Canonical-evotyperespectivelyas these tumors displayed many of their characteristic genetic alterations (Figure 1). Stratifying by tumors belonging to both MC-A and the Alternative-evotype yielded HR = 3.64 and p = 0.00363with those in MC-B2 and the Canonical-evotype giving HR = 6.14 and p = 4.60 × 10−5in comparison to the tumors that were in neither group (Figure 6F). These relationships were still significant when adjusted for TMBGleason gradeand age at diagnosis (adjusted p values [padj] = 0.00913 and 0.000492)showing that TMB itself is not driving this result. As beforewe compared these advanced tumors that were also ISUP Gleason grade 3 to all other tumorswhich provided even better performance (HR = 7.66p = 2.84 × 10−8; Figure 6G). The findings in Figures 6E and 6G indicate that Gleason grade and evolutionary progression provide complementary information on prognosis. Note that these findings are illustrativeas a robust optimization of thresholds or sets of genetic alterations for risk evaluation requires full validation with an independent dataset and therefore remains outside the scope of this study.
Furthermorethe evotype model provides additional context to relationships between individual aberrations reported in previous studies. Co-occurring genomic alterations that have been identified previously can be related to particular evotypes. For the Canonical-evotypethis includes LOH events affecting PTEN and CDH22 or PTEN and TP53.23 ConverselyCHD1 losses have previously been observed in conjunction with SPOP mutations,24,25 as has LOH affecting MAP3K726 and 2q2227; all these aberrations are associated with the Alternative-evotype. The most widely used basis for genomic prostate cancer subtyping is the ETS statuswhere tumors are classified by the presence or absence of an ETS gene fusion into ETS+ and ETS−respectively.1,8,9,11 We found that 94% of Alternative-evotype tumors were ETS−and indeedalterations such as SPOP mutations and CHD1 LOH that are characteristic of this evotype have previously been associated with ETS− tumors.11,28 Converselythe Canonical-evotype exhibits both ETS+ (66%) and ETS− (34%) tumors. When removing Alternative-evotype tumors from the ETS classificationwe found that there were no significant differences in risk (Figure S9D) or prevalence of any of the genomic features between ETS+ and ETS− tumors of the Canonical-evotype (Figure S9E). This is consistent with its definition as a distinct disease type independent of ETS status.
Classification by evotype could have epidemiological implications. For instancenon-White racial groups display an increased incidence of many Alternative-evotype aberrations29,30,31 and may therefore have a higher predisposition for this disease type. Converselycancers arising in younger patients have enrichment for ARBS-proximal breakpoints18 and are reported to develop via a similar evolutionary progression to the Canonical-evotype.15,18 It may also be possible to tailor treatment strategies to each evotype. In particularcancers with aberrations found more commonly in the Alternative-evotype have been shown to be susceptible to ionizing radiation24 and have a better response to treatment with PARP inhibitors32 and androgen ablation.25
Our evolutionary model for prostate cancer disease types provides a conceptual framework that unifies the results of many previous studies and has significant implications for our understanding of progressionprognosisand treatment of this disease. As evolution through the sequential acquisition of synergistic genetic alterations is a process common to many tumorsthe principlesanalytical approachand conceptual framework outlined here are widely applicableand we anticipate them leading to insights into disease behavior in other cancer types.
Limitations of study
In this studywe present evidence supporting the existence of at least two distinct evolutionary paths in prostate cancerwhich underpins the concept of classifying these cancers into evotypes. Howeverthe precise criteria that differentiate Canonical-evotype tumors from those of the Alternative-evotype remain to be rigorously defined. Our statistical classification may therefore have incorrectly assigned some tumors to an evolutionary path that does not reflect their true nature. Additionallythere is the possibility that a single prostate may contain tumor cell subpopulations following both trajectories. Although there was no evidence for this in the datasets we analyzedthe most appropriate way to classify such cases remains undetermined. It is also likely that there are other evolutionary paths yet to be discoveredand so assigning these tumors to either of the two evotypes we describe here is incorrect. Another caveat is that our patient cohort predominantly consists of men of White-European ancestry treated in the UKAustraliaand Canada and therefore does not represent the global population. Thereforewhile our findings are robust within the context of our study populationcaution is warranted when extrapolating these results to other ethnic groups.
Consortia
The members of the CRUK ICGC Prostate Group members are Adam LambertAnne BabbageClare L. VerrillClaudia BuhigasDan BerneyIan G. MillsNening DennisSarah ThomasSue MersonThomas J. MitchellWing-Kit Leungand Alastair D. Lamb.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Critical commercial assays | ||
| Quant-iT OicoGreen dsDNA Assay Kit | ThermoFisher | Cat#P7589 |
| Deposited data | ||
| DNA BAM files | This paper | EGA accession: EGAS00001000262 |
| Software and algorithms | ||
| Burrows-Wheeler Aligner | Li and Durbin33 | https://bio-bwa.sourceforge.net/ |
| CGP core WGS analysis pipelines | Cancer Genome Project | https://github.com/cancerit/dockstore-cgpwgs |
| Battenberg | Nik-Zainal et al.34 | https://github.com/Wedge-lab/battenberg |
| SeqKat | https://doi.org/10.1101/287839 | https://github.com/cran/SeqKat |
| ChainFinder | Baca et al.35 | N/A |
| Telomerecat | Farmery et al.36 | https://github.com/cancerit/telomerecat |
| PLMIX | Mollica and Tardella20 | https://cran.r-project.org/web/packages/PLMIX/ |
| PlackettLuce | Turner et al.37 | https://cran.rstudio.com/web/packages/PlackettLuce/ |
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contactDavid C. WedgePh.D. ([email protected]).
Materials availability
There are no tangible materials produced by this study that are available for distribution.
Data and code availability
Sequencing data generated for this study have been deposited in the European Genome-phenome Archive with accession code EGAS00001000262. Processed data and code used in this manuscript is available at https://github.com/woodcockgrp/evotypes_p1/ and via https://doi.org/10.5281/zenodo.10214795.
Experimental model and study participant details
Cancer samples from radical prostatectomyand matched blood controlswere collected from 205 patients treated at the Royal Marsden NHS Foundation TrustLondonat the Addenbrooke’s HospitalCambridgeat Oxford University Hospitals NHS Trustand at Changhai HospitalShanghaiChinaas described previously.38,39 Ethical approval was obtained from the respective local ethics committees and from The Trent Multicentre Research Ethics Committee. All patients were consented to ICGC standards. 159 of the samples passed stringent quality control for copy number profiles and structural variantsand were used in this study.
Method details
DNA preparation and DNA sequencing
DNA from frozen tumor tissue and whole blood samples (matched controls) was extracted and quantified using a ds-DNA assay (UK-Quant-iT PicoGreen dsDNA Assay Kit for DNA) following the manufacturer’s instructions with a Fluorescence Microplate Reader (Biotek SynergyHTBiotek). Acceptable DNA had a concentration of at least 50 ng/μl in TE (10mM Tris/1mM EDTA)and displayed an optical density 260/280 ( / ) ratio between 1.8 and 2.0. Whole Genome Sequencing (WGS) was performed at IlluminaInc. (Illumina Sequencing FacilitySan DiegoCA USA) or the BGI (Beijing Genome InstituteHong Kong)as described previously,38,39 to a target depth of 50X for the cancer samples and 30X for matched controls.38 The Burrows-Wheeler Aligner33 (BWA) was used to align the sequencing data to the GRCh37 reference human genome.
Generation of summary measurements
We generated 123 summary measurements from the WGS data using a number previously published algorithmsso we briefly outline those below. These are grouped into measurements that were generated with similar or related algorithms; default parameters were used unless otherwise stated. The processed data is given alongside the code at https://github.com/woodcockgrp/evotypes_p1/.
Numbers of SNVsindels and structural variants - 10 fields
SNVsinsertions and deletions were detected using the Cancer Genome Project Wellcome Trust Sanger Institute pipeline as described previously.38 In briefSNVs were detected using CaVEMan with a cut-off ‘somatic’ probability of 0.95. Insertions and deletions were called using a modified version of Pindel.40 Variant allele frequencies of all indels were corrected by local realignment of unmapped reads against the mutant sequence. Structural variants were detected using Brass.38 Total numbers of SNVsindels and rearrangements per sample were calculated (1 field each)as were types of indel (3 fields: insertiondeletion and complex) and structural variants (4 fields: large insertions or deletionstandem duplications and translocations).
Percentage genome altered - 3 fields
This was calculated as the percent total of the genome that is affected by CNAs.41 We also recorded the percentage affected by clonal and subclonal CNAs (i.e.CNAs with CCF = 1 and CCF 1 respectively).
Ploidy - 1 field
We adopt the same approach as detailed previously,11 where whole genome duplicated samples were those which had an average ploidyas identified with the Battenberg algorithmgreater than 3. These samples were designated as tetraploid and assigned a value of 1 in our datasetotherwise the sample was diploid (assigned 0).
Kataegis - 1 field
Kataegis was identified using SeqKat https://github.com/cran/SeqKat. The datum was set to 1 if kataegis was identified and 0 if not.
ETS status - 1 field
A positive ETS status was assigned if a DNA breakpoint involving ERGETV1ETV3ETV4ETV5ETV6ELK4or FLI1 and partner DNA sequences was detected and the fusion was in-frame. The datum was set to 1 if there was ETS fusion detected or 0 if not.
Gene fusions - 2 fields
We reported the number of in-frame gene fusions in the sample (counts) and if there was a gene fusion affecting the TMPRSS2/ERG genes (1 or 0).
Breakpoints - 14 fields
Breakpoints were identified with Chainfinder35 version 1.01. Total number of breakpointstotal number of chained breakpoints (i.e.where the breakpoints are interdependent)number of chainsthe number of breakpoints in the longest chainthe number of breakpoints involved in the chained eventsand the maximum number of chromosomes involved in a chain were recorded as integer counts (6 fields). We also calculated the proportion of all breakpoints that were in chained events (1 field - ) and the averagemedian and maximum number of chromosomes involved in a chain (3 fields - ). Information about the type of breakpoint was also recordedincluding the number of deletion bridgesintra-chromosomal and inter-chromosomal events (3 fields - counts) and the inter-chromosomal to intra-chromosomal ratio (1 field - set to zero if there were no intra-chromosomal breakpoints).
Mutated driver genes - 26 fields
A set of driver genes were identified from our previous publication.11 Using the CaVEMan outputwe determined any non-synonymous mutations in the exonic regions of these genes as a mutated driver gene; the corresponding field was assigned a value 0 if no such mutations were identified and 1 if there were.
Copy number alterations - 60 fields
We followed our previous approach11 to identify consistently aberrant regions. A permutation test was developed where CNAs detected from each sample were placed randomly across the genome and then the total number of times a region was hit by each type of CNA in this random assignment was compared to the number of times a region was hit in the actual data. This process was repeated 100,000 times and recurrent (or enriched) regions were defined as having a false discovery rate (FDR) of less than 0.05. This was performed separately for gainsloss of heterozygosity (LOH) and homozygous deletions (HD). We identified small regions initially and these were amalgamated into larger regions defined as the regions between chromosomal positions when the difference between the number of CNAs identified in the data and expected frequency (if this process were uniformly random) dropped to zero. For each sampleif a breakpoint corresponding to a gainLOH or HD occurred in each regionthen the respective datum was set to 1and 0 otherwise.
Telomere lengths - 1 field
Telomere lengths were estimated as described in our previous publication.36 A mean correction was applied to batches to compensate for the effects of a change in chemistry during the projecttherefore the value is continuous in the range .
Chromothripsis - 4 fields
The identified copy number breakpoints were segmented in inter-breakpoint distance along the genome using piecewise constant fitting (pcf from the R package copynumber v1.22.0). Regions with a density higher than 1 breakpoint per 3Mb were flagged as high-density regions. A chromothripsis region was then defined as a high-density region with a number of copy number breakpoints ; a non-random segment size distribution (Kolmogorov-Smirnov test against the exponential distribution); at most three allele-specific copy number states covering more than fraction of the region; and the proportion of each type of structural variant is random with equal probability (multinomial test )where TD = tandem duplicationDel = deletionH2Hi = head-to-head inversion and T2Ti = tail-to-tail inversion. We recorded the presence or absence of chromothripsis (1 or 0 respectively)the proportion of all breakpoints in chromothripsis events ()the number of chromothripsis events in each sample (counts) and the size of the largest chromothripsis region (counts).
Quantification and statistical analysis
In this section we aim to provide a largely non-technical overview of each of our methods we used to perform the analysis in the studyfollowed by more technical description for those who wish to fully understand and reproduce our methodology.
Statistics
Prior to the study we predetermined we would use Fisher’s Exact Test for 2x2 contingency tables and Chi-squared test for contingency tables of greater dimensionality and this is applied throughout. Associations between genetic alterations and ARBS clusters was identified using one-tailed Fisher Exact Test with 0.05corrected for multiple testing using the False Discovery Rate. Relationships were determined dependent on the variable type: for Bernoulli variablessignificance was determined with Chi-squared test followed by a one-tailed Fisher exact test for each pairwise relationship; one-tailed tests were used as a two-tailed test would not have revealed the direction of the relationship. For continuous variables a Kruskal-Wallace test with Tukey’s HSD was used (adjusted 0.05 for all tests). Significance of Depleted groups across countries clustering together was determined using the Approximately Unbiased Multiscale Bootstrap procedure. Associations between evotypes and individual genetic alterations was conducted with a two-tailed Fisher Exact Testcorrected for multiple testing using the False Discovery Rate. The associations with ARBS pairs were established with a one-sided Mann-Whitney U-test with 0.05. Statistics associated with the Kaplan-Meier plot were calculated using log rank methodsand significance level was set at 0.05. Cramer’s V statistic was used to determine the strength of the associations in between the cluster assignments. As we only claim an association between patients assigned to MC-A (Metaclusters)the Depleted group (ARBS) and Ordering IIwe combined metaclusters MC-B1 and MC-B2 into one class and the Enriched and Depleted ARBS groups into one class for this comparison.
Unsupervised feature extraction
The summary measurements detailed above form the dataset for further analysis. Howeverit contains a number of different data types (binaryproportionscontinuousinteger counts)it is high dimensional relative to the number of patientsand it undoubtedly contains highly correlatedcooccurring or equivalent events that may confound the analysis. To address this we performed a feature extraction preprocessing step prior to the analysis. As our downstream analysis will be investigating genomic patterns that are indicative of evolutionary behaviorit is critical that the results of these analyses can be easily interpreted. This necessitates methodology where the links between input variables that correspond to the features are identifiable. We therefore opted for a latent feature approach as the basis of our feature extraction as these can provide an interpretable representation of the relationships between the inputs.42 Latent feature (or latent variable) analysis provides a way of reformulating the data into a reduced set of features that encapsulate the underlying relationships between the original inputs. The data can be recast in terms of these latent featureswhich is known as the latent feature representationand the downstream analysis performed directly on this.
There have been many latent feature models proposedeach with associated inference methods for the features (a process called feature learning). These included methods such as non-negative matrix factorization,43 Bayesian non-parametric methods44 and neural networks.45 Howevernone of these were able to fulfill all of our requirements above. We therefore created a bespoke method for feature extraction on this dataset.
Neural networks for feature extraction
We utilized a Restricted Boltzmann Machine46 (RBM) neural network as the basis of our feature learning method. We chose to use an RBM as it is extensible to multiple data types47,48 and can provide interpretable hidden unitswith appropriate modifications.49 An RBM is functionally similar to another type of neural network architecture called an autoencoder.45 Autoencoders are a class of network types that compress (encode) the data into a transformed representation (the code)and then decompress (decode) in an attempt to reconstruct the original data. A measure of the error between the reconstruction and the original data is used to update the parameters through backpropagation. Typically the code layer contains fewer units than the input/output layers and this bottleneck means that the learning process attempts to compress the information in the dataset into a more a compact representation in the code.
In contrastthe basic RBM unit consists of only two layersknown as the visible and the hidden layers. The RBM is formulated as a probabilistic networkmeaning each unit represents a random variable rather than a fixed value. As suchthe hidden layer performs a similar function to the code layer in the autoencoderalbeit with a probabilistic representation. It has been shown that the RBM is equivalent to the graphical model of factor analysis50 and so each hidden unit can be interpreted as a latent feature. Another distinction from the autoencoder formulation is that there is only one weight matrixwhich used to update both the visible and hidden layers. This means that the information on the transformation from visible units (input representation) to the hidden units (feature representation) is encapsulated in this matrix. Hence we also refer to it as the input-feature map.
The restricted Boltzmann machine
The standard RBM formulation46 consists of Bernoulli random variables for all visible and hidden units where with respective biases and a matrix of weights. Training of an RBM is based on minimizing the free-energy of the visible unitsas a low free-energy corresponds to a state where the data is explained well through the model parameterization. Energy-based probability distributions take the form
| (Equation 1) |
where is the energy function and Z is a normalizing factor. This is the probability of observing the joint pair. The energy function in an RBM is given as
| (Equation 2) |
In this formulation,
| (Equation 3) |
which is difficult to calculate due to the number of possible combinations of and .
As we want training to be conducted with respect to the energy at the visible unitswe need to marginalize over in Equation 1 to calculate the likelihood of observing the visible unit corresponding to a single data sample from dataset .
| (Equation 4) |
where is the full parameter set. To simplify notationwe write as with no loss of generality. To perform training through gradient descentwe need to calculate the gradient of the negative log likelihood for each parameter we wish to update. The partial derivative of the logarithm of Equation 4 takes the form
| (Equation 5) |
| (Equation 6) |
We then calculate the expected values using the entire training set
| (Equation 7) |
which can be used to update the model parameters via gradient descent. The term corresponds to the expected energy state invoked from observing the data samplesand the is the expected energy state of the model configurationsboth contingent on the current model parameters. As suchthey are often called and respectively. Calculating the partial derivatives with respect to the parameters gives
| (Equation 8) |
| (Equation 9) |
| (Equation 10) |
which are used to construct the update equations
| (Equation 11) |
| (Equation 12) |
| (Equation 13) |
for learning rates ν and η. The values can be estimated easily by taking the arithmetic mean.
The terms are generally difficult to calculate as they involve summation over all possible configurations of and . An alternative is to perform Gibbs sampling using the conditional probabilities as these are far easier to calculate due to the conditional independence between units in the same layer. We can estimate the conditional probability of values of the hidden layer from the visible layer and vice versa thus
| (Equation 14) |
| (Equation 15) |
The form of and depends on the activation function. This function that inputs the products of the units in one layer and their corresponding weightsand outputs a probability that a unit is active. In this studywe use a logistic sigmoid (or simply “sigmoid”) functionwhich is given by
| (Equation 16) |
where x is dependent on the layer we are samplingand so the individual hidden and visible probabilities can be written as
| (Equation 17) |
| (Equation 18) |
A sample is drawn by setting the corresponding unit to 1 with probability given by the value for or as appropriate. These can then be used to calculate estimates for and by marginalization over the conditional variable. In practice a full Gibbs sample every update iteration would be prohibitively slow and so we used an approximation called contrastive divergence,46 in which the Gibbs sampler is initialized using the input data and a limited number of Gibbs steps are performed. In our implementation we use one contrastive divergence step (i.e.)and so the data (or mini-batches of the data) is presented as a matrix and used to sample the hidden unit valueswhich are then used to update the values of the visible units. These values are used to update the network parameters using stochastic gradient descent (SGD).51
During trainingthe results of these updates are stored in three matrices that correspond to the weights as well as the network representation of the tumor data at the visible and hidden layers. These matrices correspond to the network reconstruction of the data (visible layer) the latent feature representation of the data (hidden layer)and the input-feature mapping (weights). When the network is trainedthese can be extracted and utilized in the analysis.
Modifications to the base RBM
We made a number of simple modifications to the base RBM described above to ensure the feature representation was interpretablegeneralizablestable and reproducible. These modifications are described below.
Data integration
Our data consisted of multiple different modalities; unlike conventional multiomics approaches which have a large number of a data points from a small number of sourceswe have a small number of data points from a large number of sources. As suchdata integration needed to be carefully considered. The RBM can be modified to incorporate inputs of multiple modalitiessometimes through modification of the energy function.52,53 Howeverwe decided to avoid this complication and standardize all our inputs by ranking all integer and continuous variables prior to rescaling to . Specificallyour transformations were
-
(1)
Binary – set as ,
-
(2)
Integer – rank and scale to ,
-
(3)
Continuous – rank and scale to .
For the integer and continuous cases we used ranking as this decouples the value from the distribution of the inputs and after scaling to the new value can be interpreted as the probability that the corresponding visible unit is active. As suchall inputs are treated equally in the machinations of the RBM. These transformations do not affect the hidden unitswhich remain a Bernoulli random variable.
Non-negative weights
Neural networks are considered as black-box approaches as the transformations they perform are highly complex. To improve interpretability of the network machinations we imposed a non-negativity constraint to the weight updatesspecifically by penalizing negative values. We use an approach in which a quadratic barrier function is subtracted from the likelihood for each negative weight.49 Mathematicallythis is written as
| (Equation 19) |
where α denotes the strength of the penaltyand
| (Equation 20) |
This leads to the update rule
| (Equation 21) |
where is a matrix containing the negative entries of Wwith zeros elsewhere. This formulation is equivalent to a -norm penalty on the negative weightsand so penalizes more strongly negative weights to a greater degree. When used in the training schemethis coerces network weights to non-negative solutionssimplifying the interpretation of the input-feature map. This can be considered to be a non-linear extension of non-negative matrix factorization,43 and similarly can be used to represent the underlying structure of the data by its partswhich is synonymous with latent features here.
As weights can no longer trade off against each other with counteracting weights of opposing signsthis means that the lowest free-energy state corresponds to a state with minimal redundancy and so during training the hidden units compete to convey information about a single input.54 This means that the input will only be represented in small number of latent variablesso when the initial number of hidden units is of similar order to the number of data inputsthis results in some of the biases or weights converging to a negligible valueand the corresponding hidden layer activations converge to an arbitrary fixed value. The latter are then called dead units. This is of fundamental importance to our method as it can be used as an estimate of the intrinsic dimensionality of the data.
Hidden unit pruning
During trainingwe prune the dead units to improve the speed of the algorithm. However determining dead units is not straightforward in a probabilistic network such as the RBM as the values in the network at each state will vary stochastically. To circumvent thiswe apply an -norm penalty on the hidden unit activationswhich penalize a non-zero activation value.55 This coerces the values for all patient samples to be zerorather than some arbitrary valueand these can then be easily identified and removed with a thresholding approach. This penalty function is calculated over all training data samplesso for consistency with Equation 4 we can formulate the likelihood for each sample as
| (Equation 22) |
where and β is a parameter describing the strength of this penalty. We calculate the gradient of the additional likelihood term with respect to each of the hidden unit biaseswhich is given as
| (Equation 23) |
| (Equation 24) |
We can then write the vector of gradients for all hidden unit biases as . The corresponding update rule can therefore be written as
| (Equation 25) |
In our training algorithmwe prune dead units every 50 iterations after the first 1000 iterations.
Sparsity
Sparsity is a desirable property for latent space representationsas it means that the information is conveyed in a concise form. The penalty measure defined in Equation 22 introduces sparsity as it penalizes hidden units which are highly active thus coercing the network toward a sparse configuration.55 Further sparsity measures were not used in training as the weight matrixwhich defines the input to feature mappingwill be filtered at a later stage.
Overfitting
A concern with any neural network formulation is the tendency to overfit the datawhich in this application would lead to a feature set that was not representative of the true underlying structureand therefore not generalizable. To mitigate thiswe employed a number of countermeasuresnamely
-
(1)
DropConnect,
-
(2)
Max-norm regularization,
-
(3)
Bootstrap aggregating,
-
(4)
Early Stopping.
With DropConnect,56 a predetermined proportion of weights in the network are randomly set to zero with uniform probability at each training iteration. This helps prevent overfitting by temporarily disrupting correlations between featuresso they are more likely to learn features that are independent of the state of other features.
When using max-norm regularization,57 we set an absolute value on the norm of each weight vector that form the input to a single hidden unit. If a vector becomes too large then we rescale the vector so that it obeys the constraint. It is possible for non-negative weights to continue increasing throughout training as the binary nature of some inputs means that when present they were already in the maximal output of the sigmoid activation function so the precise value is irrelevant. Max-norm regularization prevents this occurrence and facilitates comparison between weight matrices of different runs.
For bootstrap aggregating58 (bagging)multiple networks with the same initial architecture were trained on subsets of the data and the outputs amalgamated. In our feature learning representationwe extracted the weight matrix from each of the networks and merged them according to the cosine distance between features.
Finallywhen implementing early stopping59 we need to compare the performance of the network on the training set to the performance on an unseen validation set. If the network performs similarly on the training and validation sets then it is a good indicator that it will return generalizable outputs. Beginning with the subsets extracted for ensemble learningwe use data omitted when the subset was sampled as the validation setwhich is propagated through the network. As the RBM is formulated as an energy-based modelearly stopping is predicated by comparing the free energy in the training set to that in the validation set.60 In general overfitting-mitigation strategiesthe free energy (or reconstruction error in error-based networks) is monitored and if the free energy in the training set decreases while the free energy in the validation set increasesthat indicates overfitting is occurring and training is stopped. We adopted a more stringent approach in which the samples in the training set are randomly assigned to subsets of equal size to the validation set and so the free energy values are directly comparable. During trainingif the free energy of the validation set increases above the largest free energy of the training subsets for an extended period (10 iterations) then training is stopped and the entire run is discarded and training repeated. This means that the network is able to model unseen data (the validation set) as well as it does the training set when accounting for variation in energy values resulting from sampling the validation set. If overfitting is suspectedthe entire run is discarded and another training run performed; as our main objective is to derive the input-feature mapping via the weight matrixthis avoids the situation in which we retain a weight matrix that has not had time to converge to a solution consistent with those runs that completed without interruption.
Convergence to global solution
As we are training multiple networks and amalgamating the resultsit is important that each network converges to the global solution or the results will be incongruous. Furthermoreas the RBM is trained by stochastic gradient descentit is possible that the algorithm may get stuck in a local optima. To minimize the chance of this occurrencewe used the cyclical learning rate scheme,61 in which learning rates for each of the variables oscillates between zero and a maximal value throughout training. The maximal value is subject to decay so that the maximal training rate will diminish throughout training to zero. This approach has been shown to help convergence to the global solution and has the advantage that the learning rate parameters do not need to be tuned.61
Amalgamation of feature matrices
Each individual network run provides a similarbut not identicalweight matrix. As suchweight matrices from each network run were amalgamated and filtered to form the final input-feature map. Numbers of featuresthe inputs they representtheir magnitude and order would not necessarily occur the same in each network and so we constructed an algorithm based on the cosine similarityis which depicted in Figure S10and outlined in Algorithm 1. Note that Low magnitude weights were those less than 50% of the maximum weight value for each hidden unit.
Algorithm 1.
Pseudocode for amalgamating weight matrices
Synthetic data
To investigate whether our RBM network can identify true associations in data of multiple typeswe trained the network on a synthetic dataset with known associations and data generation methods. The values in the synthetic dataset were generated from function that encapsulated simple relationships when applied to binary latent variables; we also utilized various statistical distributions on top of these relationships to model types like proportions and counts that we might find in the real dataset. The synthetic data is constructed in seven ‘blocks’ to aid interpretation. In the first six blocks5 latent variables are mapped to 10 observed variables in exactly the same way. The difference between the blocks is the statistical distributions used to generate the values in the data. The final block consists of latent variables mapped to distributions from all the previous types. In total there were 72 ‘observed variables’ generated from 34 ‘latent variables’. To generate the data for a single synthetic samplethe 34 binary latent variables were sampled uniformly with probability of the latent variable being 1 set at 0.5. This was done for 200 synthetic samples to create the synthetic dataset.
Latent variable to observed variable mappings
We denote the ith simulated observed variables by and the jth latent variables as . The first block consists of a simple binary mapping designed to determine if the network can extract the correct relationships with no sources of noise. These relationships are written as
| (Equation 26) |
| (Equation 27) |
| (Equation 28) |
| (Equation 29) |
| (Equation 30) |
| (Equation 31) |
From this we model several different types of unambiguous relationships as described on the rightincluding logical relationships AND OR and NOT .
We build the next five blocks on exactly the same relationships. Block 2 utilizes introduces noise into the mappingrequiring a uniformly sampled random value to be greater than a threshold as well as the latent variable being equal to 1 for the relationship to be passed through to the observed variable(s). returns 1 if the condition if fulfilled and 0 otherwise. Block 2 mappings can be written thus:
| (Equation 32) |
| (Equation 33) |
| (Equation 34) |
| (Equation 35) |
| (Equation 36) |
| (Equation 37) |
| (Equation 38) |
Note that where the latent variable maps to many observed variablesa random value is sampled separately for each observed value so they are correlated rather than identical.
Block 3 reflects binary to continuous value [0,1] relationships in which the observed value(s) are zero if the latent value is zero but take a uniformly distributed random value [0,1] if the latent variable is 1. The mappings therein can be written as:
| (Equation 39) |
| (Equation 40) |
| (Equation 41) |
| (Equation 42) |
| (Equation 43) |
| (Equation 44) |
| (Equation 45) |
Block 4 introduces sampling from a parametric probability distributionnamely the beta distribution to reflect proportions and other continuous values bounded by 0 and 1. Here we aim to model situations where there are generally lower or higher values depending on the status of the latent variable. Therefore the main difference between this block and previous ones is that the observed value is sampled from one of two differently parameterized beta distributions depending if the latent variable is 1 of 0. The mappings are:
| (Equation 46) |
| (Equation 47) |
| (Equation 48) |
| (Equation 49) |
| (Equation 50) |
| (Equation 51) |
| (Equation 52) |
Block 5 is similar to block 3where the observed value(s) are zero if the latent feature is zero. Howeverwhen the latent feature is 1 then the observed value is sampled from a Poisson distribution to simulate count data. All observed variables are rescaled by the maximum of but we leave this step out of the mapping equations below for simplicity and to aid comparison with the other blocks. We therefore write the mappings as:
| (Equation 53) |
| (Equation 54) |
| (Equation 55) |
| (Equation 56) |
| (Equation 57) |
| (Equation 58) |
| (Equation 59) |
In Block 6 we aim to model the situation where a perturbation to a cellular process elicits different levels of counts (such as gene expression in mRNA data for instance). The format is similar to block 4 but with Poisson distributions used instead of beta distributions. We adopt the same rescaling scheme as used in block 5.
| (Equation 60) |
| (Equation 61) |
| (Equation 62) |
| (Equation 63) |
| (Equation 64) |
| (Equation 65) |
| (Equation 66) |
Finally we include latent variables that are mapped to generating functions of more than one of the types in the blocks above. Simulated count data is again rescaled as before. The maps are given as
| (Equation 67) |
| (Equation 68) |
| (Equation 69) |
| (Equation 70) |
RBM results on synthetic data
We trained 2000 networks using 80% of the synthetic data as the training set (chosen uniformly at random). The remainder of the data was used as a validation set for early stopping using the procedure described above. To investigate if overfitting occurs and if our early stopping procedure could identify potential overfittingwe allowed training to continue if overfitting was suspected and monitored the behavior of the free energy. We found that overfitting was suspected in 123/2000 (6.15%) of the training runs and early stopping would have been invoked in these cases. We investigated what occurred by plotting the free energy of the training sets along with that of the validation set. We provide an example of a well-behaved profile (Figure S11A) and some examples of training runs that would have been stopped in Figures S11B–S11D. We observe that the free energy values oscillate with the periodicity of the cyclical learning rate described above and in the second half of training there are iterations where the free energy values decrease significantly across all setswhich corresponds to removal of dead hidden units in network pruning. In the samples where overfitting was not suspectedthe free energy of the validation set decreased at the same rate at the free energy values of the training setsremaining below the maximum value set by the training sets. This is exactly how we would expect the free energies to behave if there were no overfitting.
In training runs where overfitting was suspectedwe found that this was generally because the free energy of the validation set had increased to a slightly higher value than the maximum value of the training sets; this always occurred during the latter half of training when network pruning was taking place (as in Figures S11B and S11C). This could indicate the network is starting to overfit. Occasionally there would be runs where the free energy of the validation set was notably higher than the maximum of the training sets (e.g.Figure S11D)which is more concerning and could indicate a higher degree of overfitting. However in every case we noted that in the general trend of the free energy of the validation set was to decrease until training stopped at the maximum number of iterations; this actually indicates that the data is not being overfit as we would expect the free energy to increase if the model was losing generalizability by incorporating aspects only present in the training sets. Therefore it is inconclusive whether these runs are actually overfit. Nonethelessour early stopping procedure is very conservative and would have caught and removed these suspect runsleaving only indisputably non-overfit runsand so we are confident that overfit networks will not contribute to the final results.
We next performed training with early stopping enforced to investigate how well the network captures known relationships in the data. When training was completewe amalgamated the weight matrices using the procedure described in Figure S10 and the final input feature encoding is given in Figure S12.
We can use the direct binary mappings of Block 1 to investigate how the RBM attempts to encode the relationships provided in the hidden data. The algorithm unambiguously identifies the one-to-one mapping of feature 1 to input 1 of this blockas well as the one-to-many mapping from feature 2 to inputs 234 and 5. In features 3 and 4the algorithm can identify that input 6 and 8 arise from the same featureas do inputs 7 and 8but inputs 6 and 7 are not directly associated. The algorithm cannot identify the inverse relationship between inputs 9 and 10 and encodes them in two separate features (5 and 6). This is consistent with the way our approach is constructed as a positive weight matrix can only encode relationships in which a feature is active leading to inputs that are active rather than inactive feature giving rise to active inputs.
A similar pattern is seen in blocks 2–6where the algorithm is generally able to extract the correct (or logical) associations encapsulated in the features. There are two exceptions to thiswhich we have highlighted by annotations A and B. Annotation A shows a different encoding of the situation in which the original features both map to the same input (as in features 3 and 4 in block 1). Herethese are encoded as three inferred features where the first two encode a strong association with each input exclusive to each feature and a weak association with the shared input and the third feature encodes a strong association with the shared input with weak associations to both of the exclusive inputs. It is important to note that although this encoding does not precisely replicate the original input mappingthe network has still learned a logical way of encoding this relationship. Thereforewe do not consider this encoding incorrect. Annotation B shows inputs that are not assigned to any feature. Both of these are one-to-one mappings in block 6 (Poisson low/high) indicating that these relationships in this data type might be too subtle for the algorithm to distinguish. Note the one-to-one mapping of the 9th input in the block and the one-to-many mapping are identified correctly. In block 7the block consisting of one-to-many mappings of data of multiple typesthe algorithm identifies the correct number of features (4) and the correct inputs are assigned to each feature. These results provide empirical evidence that our RBM algorithm can extract relationships across a number of data types.
Consistency of results
There are several sources of variation between runs on the same dataset: the RBM is intrinsically a probabilistic networktraining it is a stochastic process and we are using different sets of data in each run. Howeveralthough variation between the features extracted in each run is inevitableit is important that they are consistent as an ensemble so we can be confident that the result is stable and the final amalgamated weights reflect the true relationships in the data.
To quantify the consistency between feature setswe use an approach based on the Hamming distance. If we describe a latent feature as a binary string that is equal to 1 with a non-zero weight is present and zero elsewherewe can then calculate the Hamming distance between individual featureswhich returns the number of input mappingsmthat are not shared between those features. We can use this to identify which feature in set is equivalent to a given feature in set (as that has the minimum Hamming distance) and then identify the feature in set that is most incongruous to their equivalent feature (as this has the maximum Hamming distance). We write the Hamming distance of the incongruous feature as which can be expressed as
| (Equation 71) |
We can use this metric to assess how the input/feature mapping differs between runs. We randomly sampled weight matrices (without replacement) from the 2000 matrices generated by networks trained on the synthetic data and amalgamated these as described above. We repeated this 100 times to give 100 feature setsand then we calculated for all pairwise combinations of i and j. Of the 10,000 resulting comparisonsthe greatest number of differences between equivalent features was 2which occurred 72 times (0.72% of the time). Figure S13 shows the histogram of the values for which reveals that the most common difference was 1which occurred 6001 times (60.01%). There was no difference in the feature maps in 39.27% of the runs. This is remarkably consistent given the aforementioned sources of variation.
Network training on real data
We used the real data to train various versions of the networks across a number of runs to obtain distinct outputs for use in the analysis. The goal of the first run was to extract the amalgamated weight matrix that describes the input to feature mapping (Figure S2). We used this to determine the latent feature representation used in the clustering (MP Figure 1) by training a new network run with the weight matrix initialized to the amalgamated weight matrix and setting the weight learning rate to zero. Learning of the biases was enabledas these may be different from the biases in the previous networks due to the removal of low magnitude weights. Once the remaining network parameters have converged during trainingtaking further iterations is equivalent to sampling the hidden units/feature representation for each patient. We therefore averaged the hidden unit values taken every 10 iterations during the final 1000 iterations to obtain the final feature representation.
The input-feature map was used to extract an informed subset of genetic alterations from the original inputs - these were used to determine associations in the ARBS analysis (MP Figure 2)as well as the inputs for the Ordering Analysis (MP Figure 3).
Two-stage clustering
The dimensionality of the feature representation is still quite large for conventional clustering techniques. Therefore we adopted a two-stage approach where we first clustered by those features that were most informative of clinical outcomecalculated the centroids of these first-stage clusters for all featuresand then clustered these in the second-stage of clustering to produce MP Figure 1. Here we provide more details on identification of informative features using a discrimination score and the clustering methods used.
Discrimination score
There have been several methods proposed for quantifying the relative importance of the units of a neural network.62 Howevermost of these are generally formulated to discover the inputs that are important in discerning the output.63,64 In our applicationwe wish to quantify the discriminative capacity of each of the features (hidden layer) with respect to the clinical outcome. As we utilize non-negative weights to determine the relevance of the inputs to the hidden units in the feature extractionfor consistency we adopt a similar strategy to determine the relevance of the hidden units to adverse clinical outcome as determined by biochemical relapse.
To obtain the discrimination scores for each featurewe modified the architecture of the base RBM so that it was similar to ClassRBM.65 This adds an extra classification layerwhich is fully connected to the hidden layerthe units of which contain the values of the classes. In ClassRBMthere is another set of weights that denote the strength of the connection between the hidden and classification layersand these are trained in the same bi-directional fashion as the input weights.
Howeverin our application we wish to uncover underlying relationships in the data (encapsulated by the features) in an unbiased way and then determine how relevant these features are to determining the clinical outcome. We therefore performed the learning of the latent feature representation and the discrimination scores separately to ensure that learning the classification weights to ensure that the latent representation remains unbiased by the knowledge of the clinical outcomeand the algorithm for feature learning described above can still be considered as unsupervised. This was done by fixing the input weights to the amalgamated weight matrix described above but then training the class weights using contrastive divergence as described above.
We also enforced a non-negative constraint on these class weightssimilar to the input weights. To get our discrimination scorewe take the absolute value of the weights corresponding to relapse minus the weights corresponding to no-relapse. This can be expressed mathematically as
| (Equation 72) |
were are the class-weights associated with relapseand are those associated with no relapse.
These values can be considered as heuristic quantity relating importance of the corresponding feature to the clinical outputsimilar to how the component loadings quantify the explained variance of the corresponding principal component in principle component analysis (PCA). As there is no set rule for determining the number of featuresso we followed a similar approach to that conventionally used in PCA and selected the number of features using the cumulative distribution. We chose a cut off of 0.9 of the total cumulative discrimination scorewhich resulted in 14 out of 30 features being selected for the initial clustering phase.
Clustering
Clustering of tumors was performed on the latent feature representation in a two-stage process to facilitate the identification of clusters that were relevant to clinical outcome. As the feature representation for each patient can be considered as a vector containing the probabilities that the corresponding feature is activeit is appropriate to use a distance measure that quantifies the distance between probabilities. As suchwe calculated the mean Jensen-Shannon (J-S) divergence66 between tumors in a pairwise fashion.
For a pair of patients A and Brepresented by the latent feature representation in hidden layers and the mean J-S divergence can be written as
| (Equation 73) |
where is the midpoint of and . The additive terms in the square brackets in Equation 73 represent the Kullback-Leibler divergence between each element of the latent feature representation for either patient and the corresponding element of the midpoint vector.
As we are not using a Euclidean distance metricclustering through k-means is not appropriate and so we used k-medoid clustering for the first stage; this is similar to k-means but selects a representative data point (medoid) as the centroid for each cluster instead of the mean. Using the silhouette method,67 we determined that 11 clusters was optimal. For the second stage of clusteringwe used hierarchical clustering to cluster the medoids themselves (again using the J-S divergence)and this was used to generate and order clusters by the dendrogram MP Figure 1.
DNA breakpoint proximity to androgen receptor binding site
To examine the proximity of DNA breakpoints to androgen receptor binding sites (ARBS)we designed a permutation approach that quantifies the departure from a random distribution of the breakpoints across the genome. We downloaded processed ChIP-seq data targeting AR for 13 primary prostate cancer tumors from Gene Expression Omnibus (GEO: GSE70079)68 and amalgamated them for use as the ARBS locations. For each of our 159 sampleswe simulated the scenario whereby breakpoints were randomly distributed across the genome. The simulation of breakpoints was performed chromosome-wise. For each chromosome we simulate 1000 sets of N breakpoint positionswhere N is the number of breakpoints we observed on that chromosome. These positions are randomly distributed (with a uniform distribution) across the full chromosome. Therefore the simulations intrinsically take into account the size of the chromosomes. We used the R package RegioneR69 with genome assembly GRCh37masked for assembly gaps (AGAPS mask) and intra-contig ambiguities (AMB mask) to keep the possible chromosomal locations consistent with what could be observed in the real data.
To detect significant departure from a uniform random distributionwe calculated the proportion of breakpoints within 20,000 base pairs (bp) of an ARBS for the observed and permuted data ( and respectively). If the tumor was classified as Enrichedelse if the tumor was classified as Depleted. Otherwise the difference is not significant and the tumor was classified as Indeterminate. The level of enrichment or depletion of breakpoints in the proximity of ARBS used in MP Figure 2A was estimated according to the following formula:
| (Equation 74) |
To check the method was not inherently biasedwe performed the analysis on the breakpoints derived from the UK dataset (as in MP Figure 2) and compared these to the classes derived if the position of the AR-binding sites was distributed uniformly across the genome (Figure S14). As expectedwe find that almost all tumors in the randomized set were classed as Indeterminatein stark contrast to the real data.
The agglomerative hierarchical clustering of the ARBS groups across AustralianCanadian and UK datasets was generated using the R package pvclust70 v2.0.0 using the ward.D2 clustering method with squared Euclidean distance (100,000 iterations). This package also enabled the estimation of the Approximately Unbiased Multiscale Bootstrap (AU) p values for the Depleted group. These clustering results were confirmed by a partitional clustering approach using the R packages cluster v2.1.0 and factoextra v1.0.5.
ARBS pairs required for DNA loop formation
In MP Figure 5 we investigated the role of AR in the formation of DNA loops that can precipitate DNA double-strand breaks (DSBs). For each ARBS that was previously identified as proximal to a DNA breakpoint in each patientwe determined the proportion that were also proximal to another ARBSas required by the mechanism for DNA loop formation described previously21 and depicted in MP Figure 5A. As it has been shown that the two ARBS involved in the DNA loop involved in the TMPRSS2/ERG fusion are separated by 19972 base pairs,21 we set the threshold for the second proximal ARBS as 30,000 base pairs in the direction away from the DNA breakpoint. p values were determined as before.
In all boxplots the red line denotes the medianthe blue box encapsulates the interquartile rangeand the black dashed lines denote the range of data not considered outliers; outliers (red dots) are as defined in the MATLAB boxplot() function default settings. The size of the angular ‘notch’ corresponds to a confidence interval around the mediansuch that if two notches are not overlapping then there is approximately 95% confidence that the median of the two groups differ. The folded notch in MP Figure 5D indicates that the notch extends past the interquartile range by the folded amount.
Ordering
We previously estimated consensus ordering of events by estimating phylogenetic trees from the cancer cell fraction (CCF) that contained each aberrationand applying the Bradley-Terry model to determine the most consistent order of events.11 We recently released a study71 that improves this approach using a Plackett-Luce model,72,73 which we also utilize in this study. We provide a complete description of the method for reproducibility.
There are a number of sources of uncertainty when attempting to determine the order of events from bulk DNA sequencing. In particularwe often cannot infer the true phylogenetic tree for each patientand furthermore it is impossible to determine the relative timing of events on parallel branches. Howeverwe can estimate the set of possible trees using the relative cancer cell fractions (CCFs) of the genomic aberrations involvedand from these we can estimate a set of possible orderings. Therefore we created an algorithm where we (sampled) a single possible tree from the dataand using this we sampled a viable order of events for each patient. This is repeated multiple times so that the uncertainty in these estimates is encapsulated in the output distributions. Algorithms of this type are called Monte Carlo simulations to emphasise the use of randomness in the procedure.
We adopted the Plackett-Luce model72,73 to construct a probability distribution over the relative rankings of a finite set of itemsthe parameters of which can then be estimated from a number of individual rankings. This can be used to quantify the expected rank of each item relative to the others across the population. In our applicationan item corresponds to an eventnamely the emergence and fixation of a novel copy number alteration (CNA) identified in the extracted features. Ranking these events therefore relates to the order in which they would be expected to occur. We also utilized a Plackett-Luce mixture model,20 which allow us to determine whether there are subpopulations in the data with different orderings.
The Plackett-Luce model
Given a set of CNA occurrences for each patient with associated subclonalitywe would like to infer the order which these events generally occur. To do this we used a Plackett-Luce modelwhich is formulated as a ranking methodand returns a value quantifying the ranking preference. We use a different interpretationnamely the orderingwhich is defined as the inverse of the ranking pref. 74 Like the Bradley-Terry modelthe Plackett-Luce model does not return any temporal information outside the expected order of events.
We have a set of N copy number events we are interested in,
| (Equation 75) |
then we can apply Luce’s choice axiom,72 which states that the probability of selecting one event over another from a set of events is independent of the presence or absence of the other events in the set. We can therefore write the probability of observing event i as
| (Equation 76) |
where are the coefficients that quantify the relative probability of observing the event. To reflect the ordering aspect of our application we refer to this value as the proclivity. Plackett73 used this formalism to construct a generative model in which all N events are randomly sampled from C without replacement (i.e.a permutation). If we let correspond to a permutation of the set C such that and then we write the probability density of a single ordering as
| (Equation 77) |
where is the proclivity associated with event and is the set of possible events after events have occurred.
Plackett-Luce mixtures
We hypothesized that there may be more than one set of copy number orderings present in our populationand so analysing all events in one ordering scheme may not be appropriate. Furthermorethe inhibition of AR-associated breakpoints implies that some CNAs may be found more frequently with a select set of otherswhich is in violation of Luce’s choice axiom. We therefore implemented a mixture modeling approach,20,74 which reinstates Luce’s choice axiom as the selection of each CNA can be considered as independent conditional on the mixture component. Such a finite mixture model assumes that the population consists of a numberGof subpopulations. In this setting the probability of observing the ordering for the sample is
| (Equation 78) |
where are the weight parameters (not to be confused with the weight matrix in the RBM) that quantify the probability that sample s belongs to subgroup g. The appropriate parameter values can be determined using maximum likelihood estimation via an EM algorithm.20 The number of mixture components can be chosen using the Bayesian Information Criterion (BIC) estimationwhich is given by
| (Equation 79) |
where is the parameter set that maximizes the log likelihood N is the number of parametersand M is the number of samples.
Implementation
The general formulation of the Plackett-Luce model takes a matrix containing the sequence of events for each patient as its input. Howeverwe do not know the order in which these events occurredonly the presence and cancer cell fraction (CCF) of each CNA for each patient. As suchwe first estimate the phylogenetic trees for each patientand then determine the order of events from this. As we only have one tissue sample for each patientthere is often uncertainty in the tree topology and the possible sequence of eventsand so we use a Monte-Carlo sampling scheme in which we sample the trees and sequence of eventsand use these to estimate the distribution of possible orderings through the Plackett-Luce model. Samples with 0 or 1 CNA were not used in this analysis.
Another issue arises due to censoringwhich occurs when the sample is taken before all aberrations that would occur have occurredresulting in missing data. These are called partial-orderings in the Plackett-Luce frameworkand the general approach to addressing this is to reformulate the model so that all missing events are implicitly ranked lower than the observed data.20,37 This may not be appropriate for our analysis as we may have multiple subgroupsand we anticipate that distinct aberrations may have similar or equivalent effects in each subtype and thus will rarely co-occur despite being indicative of the same type. For instancethe absence of a very early aberration may be due to the occurrence of another less frequent aberrationso including it at the bottom of the order would bias the rankings toward more frequent aberrations. As suchour algorithm works in two phases.
-
(1)
Determine the number of mixture components and assign patients to each component,
-
(2)
Estimate the ordering profiles of each component.
These are distinct as we treat the creation of the phylogenetic trees in a slightly different way in each of these processes to account for censoring. When estimating the number of componentswe calculate trees only using the observed CNAs. Howeverwhen estimating the full ordering profileswe introduce another sampling step into our Monte-Carlo scheme where we explicitly sample a number of additional CNAs with probability proportional to the subclonality of the aberration in tumors of each mixture component. Sampling in this way reduces the bias toward more frequent aberrations.
Assign samples to mixture components
In the first phasewe
-
(1)
Sample phylogenetic trees for each patient,
-
(2)
Sample sequence of events for each patient that are consistent with trees,
-
(3)
Calculate Bayesian Information Criterion (BIC) for 1–10 mixture components,
-
(4)
Repeat steps 1–3 1000 times,
-
(5)
Determine number of mixture components which consistently had lowest BIC score,
-
(6)
Assign patients to mixture components.
The phylogenetic trees are created by initially sorting the CNAs of each patient in descending order of CCF obtained from the output of the Battenberg algorithmiterating through them and sampling the possible parents with uniform probability. The CCF of a parent cannot be greater than the sum of the CCF of their childrenso viable parents are defined as ones where their CCF is greater than that of their current children plus the CCF of the CNA under consideration. The position in the sequence when the CNA occurred is sampled as any position after the parentwith uniform probability. The ordering estimates and assignment to the mixture components using the R package PLMIX as this incorporates mixture models and partial rankings (so the absence of a CNA from a sequence would not penalize its position in the ordering). A vector of assignments was retained for each sample runand the final assignment was determined by the most frequent assignment over the course of 1000 runs.
Estimate ordering profiles of each component
In the second phasewe
-
(1)
Sample phylogenetic trees for each patient,
-
(2)
Sample sequence of events for each patient that are consistent with trees,
-
(3)
Augment sequence with additional CNAs to alleviate censorship bias,
-
(4)
Calculate ordering profiles for each mixture component,
-
(5)
Repeat steps 1–4 1000 times,
-
(6)
Amalgamate results to determine final ordering profiles of each mixture component.
The phylogenetic trees and sequence of events were initially determined as before. Howeverinstead of utilizing partial rankings in the PL modelwe explicitly augmented the data with additional CNAs to account for those unobserved due to censorship. The probability of CNA being added to the sequence of events is equal to the proportion of subclonal occurrences relative to the total number of occurrences in the subpopulation defined by the mixture component. This can be written as
| (Equation 80) |
where and denote the number of subclonal and total occurrences respectively of CNA in mixture component g. As events that are predominantly subclonal have a higher chance of being unobserved due to censorshipthis sampling scheme will mitigate this to a degree. Converselyevents that are predominantly clonal (i.e.early) may be unobserved due to factors other than censoringand these have a reduced chance of being imputed. Calculating these values using the patient samples for each mixture components rather than the entire population means that only CNA subclonality relevant to each subpopulation are considered. Imputation is performed by drawing a uniform random numberrfor each patient and including the CNA in the set of additional CNAs for each patient if . The set of additional CNAs for each patient are shuffled uniformly and added to the sequence. We then calculate the ordering for each mixture component individually using the Plackett-Luce model without partial ranking. This process is repeated 1000 times and the proclivity for each CNA is calculated and used to create an empirical distribution for proclivity for each CNAwhich are used to create the box-plots in MP Figure 3.
Synthetic data
We generated a synthetic dataset to evaluate our method. We simulated two subpopulationsA and B that each had exclusive sets of 5 ‘early’ and 5 ‘late’ CNAs as well as common sets of 5 early and 5 late CNAs. These can be written as
| (Equation 81) |
| (Equation 82) |
| (Equation 83) |
| (Equation 84) |
| (Equation 85) |
| (Equation 86) |
We used these sets to simulate the order of the events in each tumorand then created a set of CCF values consistent with this order. The main principle in the simulation is that early events will generally occur before late eventsbut there is no intrinsic order in the sets of early and late events themselves.
To obtain the set of events for each simulationYwe first sampled how many events occurred by the time of sampling from a Poisson distribution followed by the subpopulation S to draw fromwith . To reflect the early/late ordering and exclusive/common nature of the CNAswe first sampled 5 events (or e events if ) from the pooled set of common and early events for that subpopulationwithout replacementthat is . If we then pooled the events that had not been sampled from alreadydenoted here as with the set of late events and sampled the remaining events from this set. We then randomly sampled the order in which the events in occurredfollowed by those in . We then sampled how many of these events were clonaluniformly at random as assigned these a CCF of 1. To obtain the CCFs values of the subclonal population we ranked the subclonal eventsand assigned a CCF value as a linear function of their rank .
We created synthetic CCF values for 200 tumors and used these as inputs into our algorithm described above. The BIC scores are shown in Figure S15Awhere two mixture components has the lowest BIC score and so the algorithm has identified the correct number of subpopulations. These subpopulations were used in to establish ordering profileswhich are shown in Figures S15B and S15C. We find that the algorithm has correctly identified all of the unique early and late events for each subpopulationas well as the common early and late events.
BIC scores for real data
Bayesian Information Criterion (BIC) scores were determined for each mixture component for each of the 1000 runs are shown in Figure S16. The BIC score was lowest for two mixture components for every sampled orderingand so this was taken as the value to use in subsequent analysis.
Statistical model of evotype convergence
We created a statistical model describing how the probability of convergence to the Canonical- or Alternative-evotypes changes as genetic alterations accumulate (MP Figure 4D). We assume that the accumulation of such aberrations in each individual tumor followed a stochastic process in which the order and relative timing of the aberrations occurred with some degree of randomness/stochasticity. Similar to the Ordering analysiswe utilized a statistical algorithm in which we simulated a number of possible aberrations consistent with the possible phylogenetic treesand then estimated the probability that tumors with these aberrations converged to the Canonical-evotype (the probability of convergence to the Alternative-evotype is 1 minus the probability of convergence to the Canonical-evotype). As many of the genetic alterations will occur clonally (i.e.with ) there is considerable uncertainty in the order in which they would have occurred. We incorporate this uncertainty into our approach using a method akin to propagation of errors through Monte Carlo simulation,75 in which we sample (with uniform probability) from the space of possible trees for each tumor and calculate the probability of tumors with those genetic alterations converging to the Canonical-evotype. Repeating this random sampling many times provides a distribution of outputs that incorporates the uncertainty arising from our inputs in a principled fashionenabling downstream analysis.
Our algorithm is outlined in Figure S17. The algorithm iterates through an increasing number of aberrations (Loop i)performing several Monte-Carlo repeats of ordering samples (Loop j).
As individual evolutionary trajectories involve the stochastic accumulation of multiple genomic aberrationsis it impossible to specify each evolutionary route. Howeverwe can determine common modes of evolution by tracking the genetic alterations prevalent in tumors at the point of convergence to either evotype in our model. Through this we can identify paths in the probability density surface plot that correspond to the accumulation of these genetic alterations (black dashed linesMP Figure 4D)and the aberrations that distinguish them from each other.
Modeling the stochastic accumulation of genomic aberrations
We model the accumulation of aberrations in a tumor as a Poisson process.76 For each iteration of Loop i we update the mean number of aberrations this is then used as the input parameter to a Poisson random number generator to draw the number of aberrations to be samplednin each iteration of Loop j. We then identified those tumors with sufficient aberrations and selected one with uniform probabilityand used the data for these to sample a phylogenetic tree using the relative CCFs of the aberrations. We then used the phylogenetic trees to sample an order of occurrence for the aberrationsand retained the first n. The aberrations used were the SPOP mutations and the CNAs identified in the feature extraction; inter-intra chromosomal breakpointsETS status and chromothripsis are not included as these do not have associated CCFs and therefore cannot be used to determine the order of events.
We again use the data to calculate probability that tumors with the set of aberrations will be assigned to the Canonical-evotype. For a set of sampled aberrationswe identified the patients for which where denotes the full set of aberrations present in patient k. We can then identify which of these were assigned to the Canonical-evotype. We can now calculate the probabilities
| (Equation 87) |
| (Equation 88) |
where denotes the number of tumors that obey the condition in brackets. We can now calculate the conditional probability
| (Equation 89) |
We performed 100,000 samples and thus obtained 100,000 values for each . We input these values into a nonparametric density estimation scheme using Gaussian kernels with bandwidth 0.025. As we are estimating the probability density function of a set of probabilitieswhich are bound at we ensured support only over this interval using the reflection method.77 We performed this sampling step for .
Identifying genetic alterations in the convergent evolutionary trajectories
We used our model simulations to investigate the common evolutionary trajectories involved in convergence to each evotype (black dashed lines MP Figure 4D) as well as the aberrations that characterize them. In the modeling processwe recorded the order of genetic alterations for each of the trajectories used to calculate the pdf. We extracted each trajectory that had converged to the Canonical- or Alternative-evotypes (i.e.had a or 1) and assigned these into sets by the number of genetic alterations in the trajectories i.e.. We than ran a filtering step for each set where we removed any trajectories that had occurred in sets corresponding to fewer genetic alterationsmeaning we were left with trajectories that only converged to either evotype with the final genetic alteration for each set. We can then identify the position and frequency of occurrence of each genetic alteration in each set. The results of this are plotted in the bottom pane in Figures S7 and S8 for Canonical- and Alternative-evotype tumors respectively. Using this information we can calculate the pdf values for frequent combinations of genetic alterations in orderand use these to create the representative paths through the probability density (black dashed lines; MP Figure 4D).
Acknowledgments
The authors would like to thank those men with prostate cancer and the subjects who have donated their time and samples for this study. We acknowledge support from Cancer Research UK C5047/A29626C5047/A22530C309/A11566C368/A6743A368/A7990and C14303/A17197; the Dallaglio Foundation; Prostate Cancer UK; and Prostate Cancer Research. We also acknowledge NIHR support to The Biomedical Research Centre at the Institute of Cancer Research and the Royal Marsden NHS Foundation Trust; Cancer Research UK funding to The Institute of Cancer Research and the Royal Marsden NHS Foundation Trust CRUK Centre; and the National Cancer Research Institute (National Institute of Health Research [NIHR] Collaborative Study: “Prostate Cancer: Mechanisms of Progression and Treatment (PROMPT)”) (grant G0500966/75466). We thank the National Institute for Health Research; Hutchison Whampoa Limited; University of Cambridge; the Human Research Tissue Bank (Addenbrooke’s Hospital)which is supported by the NIHR Cambridge Biomedical Research Centre; and The Core Facilities at the Cancer Research UK Cambridge Institute. The Cambridge Human Research Tissue Bank and A.Y.W. are supported by the NIHR Cambridge Biomedical Research Centre. C.E.M. was supported by a CRUK Major Centre Award through the CRUK Cambridge Centre Early Detection Programme and Urological Malignancies Programme. A.J.G. was funded by an SNF postdoctoral mobility fellowship (P2BSP3_178591). J.H.R.F. was supported by a Cancer Research UK Programme Grant to Simon Tavaré (C14303/A17197)aspartiallywas A.G.L. A.G.L. acknowledges the support of the University of St Andrews. A.G.L. and J.H.R.F. also acknowledge the support of the Cambridge Cancer Research Fund. This work was supported by The Francis Crick Institutewhich receives its core funding from Cancer Research UK (CC2008)the UK Medical Research Council (CC2008)and the Wellcome Trust (CC2008) (M.T. and P.V.L.). For the purpose of open accessthe authors have applied a CC BY public copyright license to any author-accepted manuscript version arising from this submission. M.T. was a postdoctoral fellow supported by the European Union’s Horizon 2020 research and innovation program (Marie Skłodowska-Curie grant agreement no. 747852-SIOMICS). P.V.L. is a CPRIT Scholar in Cancer Researchacknowledges CPRIT grant support (RR210006)and was a Winton Group Leader in recognition of the Winton Charitable Foundation’s support toward the establishment of The Francis Crick Institute. G.S.B. was supported by the Academy of Finlandthe Cancer Society of Finlandand the Sigrid Jusélius Foundation. The Melbourne Prostate Cancer Research Group was supported by NHMRC project grants 1024081 (N.M.C. and C.M.H.) and 1047581 (C.M.H. and N.M.C.)as well as a federal grant from the Australian Department of Health and Aging to the Epworth Cancer CentreEpworth Hospital (A.J.C.N.M.C.and C.M.H.). B.J.P. was supported by a Victorian Health and Medical Research Fellowship. The Canadian Prostate Cancer Genome Network was supported by Prostate Cancer Canada with funds from Movember and by the Ontario Institute for Cancer Research. P.C.B. was supported by the NIH/NCI under award number P30CA016042. V.B. was funded by a fellowship from the Canadian Institutes for Health Research and was supported by a Michael Smith Foreign Study Supplement from the Canadian Institutes for Health Research. We also acknowledge support from the Bob Champion Cancer Research Trustthe Masonic Charitable Foundationthe King Familyand the Stephen Hargrave Foundation. Computation used the Oxford Biomedical Research Computing (BMRC) facilitya joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the NHSthe NIHRor the Department of Health.
Author contributions
D.J.W.C.S.C.and D.C.W. developed the concepts in the study. D.C.W. initiated and supervised the study. Analysis was performed by D.J.W. and A.S. The manuscript was written by D.J.W.C.S.C.and D.C.W. with contributions and revisions from all authors. The STAR Methods was written by D.J.W.A.S.A.Z.and D.S.B. with contributions from all authors. UK sample collection and clinical data generation were overseen by V.G.A.Y.W.F.C.H.G.S.B.C.S.F.D.E.N.Y.-J.L.D.S.B.and R.A.E. UK whole-genome sequencing data processing was performed by G.G.Y.X.A.B.and D.S.B. Data from Australia were generated and provided by B.J.P.C.-H.J.A.J.C.N.M.C.and C.M.H. Data from Canada were provided by V.B.M.F.R.G.B.and P.C.B. Bioinformatics measurements were created by D.J.W.A.S.R.T.C.E.M.A.J.G.V.B.E.A.M.T.S.C.D.and J.H.R.F.with oversight from P.V.L.Z.K.-J.P.C.B.A.G.L.and D.C.W. Data generationstandardizationand storage were overseen by D.S.B. R.A.E. and C.S.C. co-directed the project. The full CRUK ICGC Prostate Group created and maintains overall project direction; the principal investigators are G.S.B.C.E.M.A.G.L.D.S.B.R.A.E.C.S.C.and D.C.W.
Declaration of interests
An international patent related to this work has been published under international publication no. WO 2023/047140 A1. R.A.E. has the following conflicts of interest to declare: she has received honoraria from GU-ASCOJanssenThe University of Chicagoand Dana-Farber Cancer Institute USA as a speaker and educational honorarium from Bayer and Ipsen and is a member of the external expert committee to Astra Zeneca UK. She undertakes private practice as a sole trader at The Royal Marsden NHS Foundation Trust and 90 Sloane Street SW1X 9PQ and 280 Kings Road SW3 4NXLondonUK.
Published: February 292024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2024.100511.
Contributor Information
Daniel S. BrewerEmail: [email protected].
Rosalind A. EelesEmail: [email protected].
Colin S. CooperEmail: [email protected].
David C. WedgeEmail: [email protected].
Supplemental information
References
- 1.Nowell P.C. The clonal evolution of tumor cell populations. Science. 1976;194:23–28. doi: 10.1126/science.959840. [DOI] [PubMed] [Google Scholar]
- 2.Fearon E.R.Vogelstein B. A genetic model for colorectal tumorigenesis. Cell. 1990;61:759–767. doi: 10.1016/0092-8674(90)90186-i. [DOI] [PubMed] [Google Scholar]
- 3.Gerstung M.Jolly C.Leshchiner I.Dentro S.C.Gonzalez S.Rosebrock D.Mitchell T.J.Rubanova Y.Anur P.Yu K.et al. The evolutionary history of 2,658 cancers. Nature. 2020;578:122–128. doi: 10.1038/s41586-019-1907-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Nangalia J.Nice F.L.Wedge D.C.Godfrey A.L.Grinfeld J.Thakker C.Massie C.E.Baxter J.Sewell D.Silber Y.et al. DNMT3A mutations occur early or late in patients with myeloproliferative neoplasms and mutation order influences phenotype. Haematologica. 2015;100:e438–e442. doi: 10.3324/haematol.2015.129510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ortmann C.A.Kent D.G.Nangalia J.Silber Y.Wedge D.C.Grinfeld J.Baxter E.J.Massie C.E.Papaemmanuil E.Menon S.et al. Effect of mutation order on myeloproliferative neoplasms. N. Engl. J. Med. 2015;372:601–612. doi: 10.1056/NEJMoa1412098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Papaemmanuil E.Gerstung M.Bullinger L.Gaidzik V.I.Paschka P.Roberts N.D.Potter N.E.Heuser M.Thol F.Bolli N.et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 2016;374:2209–2221. doi: 10.1056/NEJMoa1516192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Papaemmanuil E.Gerstung M.Malcovati L.Tauro S.Gundem G.Van Loo P.Yoon C.J.Ellis P.Wedge D.C.Pellagatti A.et al. Clinical and biological implications of driver mutations in myelodysplastic syndromes. Blood. 2013;122:3616–3699. doi: 10.1182/blood-2013-08-518886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tomlins S.A.Rhodes D.R.Perner S.Dhanasekaran S.M.Mehra R.Sun X.W.Varambally S.Cao X.Tchinda J.Kuefer R.et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–648. doi: 10.1126/science.1117679. [DOI] [PubMed] [Google Scholar]
- 9.Kaffenberger S.D.Barbieri C.E. Molecular subtyping of prostate cancer. Curr. Opin. Urol. 2016;26:213–218. doi: 10.1097/MOU.0000000000000285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Luca B.A.Brewer D.S.Edwards D.R.Edwards S.Whitaker H.C.Merson S.Dennis N.Cooper R.A.Hazell S.Warren A.Y.et al. Desnt: A poor prognosis category of human prostate cancer. Eur. Urol. Focus. 2018;4:842–850. doi: 10.1016/j.euf.2017.01.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wedge D.C.Gundem G.Mitchell T.Woodcock D.J.Martincorena I.Ghori M.Zamora J.Butler A.Whitaker H.Kote-Jarai Z.et al. Sequencing of prostate cancers identifies new cancer genesroutes of progression and drug targets. Nat. Genet. 2018;50:682–692. doi: 10.1038/s41588-018-0086-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cancer Genome Atlas Research Network The molecular taxonomy of primary prostate cancer. Cell. 2015;163:1011–1025. doi: 10.1016/j.cell.2015.10.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fraser M.Sabelnykova V.Y.Yamaguchi T.N.Heisler L.E.Livingstone J.Huang V.Shiah Y.J.Yousif F.Lin X.Masella A.P.et al. Genomic hallmarks of localizednon-indolent prostate cancer. Nature. 2017;541:359–364. doi: 10.1038/nature20788. [DOI] [PubMed] [Google Scholar]
- 14.Boyd L.K.Mao X.Lu Y.J. The complexity of prostate cancer: genomic alterations and heterogeneity. Nat. Rev. Urol. 2012;9:652–664. doi: 10.1038/nrurol.2012.185. [DOI] [PubMed] [Google Scholar]
- 15.Gerhauser C.Favero F.Risch T.Simon R.Feuerbach L.Assenov Y.Heckmann D.Sidiropoulos N.Waszak S.M.Hübschmann D.et al. Molecular evolution of early-onset prostate cancer identifies molecular risk markers and clinical trajectories. Cancer Cell. 2018;34:996–1011.e8. doi: 10.1016/j.ccell.2018.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Espiritu S.M.G.Liu L.Y.Rubanova Y.Bhandari V.Holgersen E.M.Szyca L.M.Fox N.S.Chua M.L.K.Yamaguchi T.N.Heisler L.E.et al. The evolutionary landscape of localized prostate cancers drives clinical aggression. Cell. 2018;173:1003–1013.e15. doi: 10.1016/j.cell.2018.03.029. [DOI] [PubMed] [Google Scholar]
- 17.Haffner M.C.Aryee M.J.Toubaji A.Esopi D.M.Albadine R.Gurel B.Isaacs W.B.Bova G.S.Liu W.Xu J.et al. Androgen-induced TOP2B-mediated double-strand breaks and prostate cancer gene rearrangements. Nat. Genet. 2010;42:668–675. doi: 10.1038/ng.613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Weischenfeldt J.Simon R.Feuerbach L.Schlangen K.Weichenhan D.Minner S.Wuttig D.Warnatz H.J.Stehr H.Rausch T.et al. Integrative genomic analyses reveal an androgen-driven somatic alteration landscape in early-onset prostate cancer. Cancer Cell. 2013;23:159–170. doi: 10.1016/j.ccr.2013.01.002. [DOI] [PubMed] [Google Scholar]
- 19.Augello M.A.Liu D.Deonarine L.D.Robinson B.D.Huang D.Stelloo S.Blattner M.Doane A.S.Wong E.W.P.Chen Y.et al. CHD1 loss alters AR binding at lineage-specific enhancers and modulates distinct transcriptional programs to drive prostate tumorigenesis. Cancer Cell. 2019;35:817–819. doi: 10.1016/j.ccell.2019.04.012. [DOI] [PubMed] [Google Scholar]
- 20.Mollica C.Tardella L. Bayesian Plackett-Luce mixture models for partially ranked data. Psychometrika. 2017;82:442–458. doi: 10.1007/s11336-016-9530-0. [DOI] [PubMed] [Google Scholar]
- 21.Metzger E.Willmann D.McMillan J.Forne I.Metzger P.Gerhardt S.Petroll K.von Maessenhausen A.Urban S.Schott A.K.et al. Assembly of methylated KDM1A and CHD1 drives androgen receptor-dependent transcription and translocation. Nat. Struct. Mol. Biol. 2016;23:132–139. doi: 10.1038/nsmb.3153. [DOI] [PubMed] [Google Scholar]
- 22.Kluth M.Runte F.Barow P.Omari J.Abdelaziz Z.M.Paustian L.Steurer S.Christina Tsourlakis M.Fisch M.Graefen M.et al. Concurrent deletion of 16q23 and PTEN is an independent prognostic feature in prostate cancer. Int. J. Cancer. 2015;137:2354–2363. doi: 10.1002/ijc.29613. [DOI] [PubMed] [Google Scholar]
- 23.Kluth M.Harasimowicz S.Burkhardt L.Grupp K.Krohn A.Prien K.Gjoni J.Haß T.Galal R.Graefen M.et al. Clinical significance of different types of p53 gene alteration in surgically treated prostate cancer. Int. J. Cancer. 2014;135:1369–1380. doi: 10.1002/ijc.28784. [DOI] [PubMed] [Google Scholar]
- 24.Shenoy T.R.Boysen G.Wang M.Y.Xu Q.Z.Guo W.Koh F.M.Wang C.Zhang L.Z.Wang Y.Gil V.et al. CHD1 loss sensitizes prostate cancer to DNA damaging therapy by promoting error-prone double-strand break repair. Ann. Oncol. 2017;28:1495–1507. doi: 10.1093/annonc/mdx165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Boysen G.Rodrigues D.N.Rescigno P.Seed G.Dolling D.Riisnaes R.Crespo M.Zafeiriou Z.Sumanasuriya S.Bianchini D.et al. SPOP-mutated/CHD1-deleted lethal prostate cancer and abiraterone sensitivity. Clin. Cancer Res. 2018;24:5585–5593. doi: 10.1158/1078-0432.CCR-18-0937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rodrigues L.U.Rider L.Nieto C.Romero L.Karimpour-Fard A.Loda M.Lucia M.S.Wu M.Shi L.Cimic A.et al. Coordinate loss of MAP3K7 and CHD1 promotes aggressive prostate cancer. Cancer Res. 2015;75:1021–1034. doi: 10.1158/0008-5472.CAN-14-1596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Liu W.Lindberg J.Sui G.Luo J.Egevad L.Li T.Xie C.Wan M.Kim S.T.Wang Z.et al. Identification of novel CHD1-associated collaborative alterations of genomic structure and functional assessment of CHD1 in prostate cancer. Oncogene. 2012;31:3939–3948. doi: 10.1038/onc.2011.554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Barbieri C.E.Baca S.C.Lawrence M.S.Demichelis F.Blattner M.Theurillat J.P.White T.A.Stojanov P.Van Allen E.Stransky N.et al. Exome sequencing identifies recurrent SPOPFOXA1 and MED12 mutations in prostate cancer. Nat. Genet. 2012;44:685–689. doi: 10.1038/ng.2279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Faisal F.A.Sundi D.Tosoian J.J.Choeurng V.Alshalalfa M.Ross A.E.Klein E.Den R.Dicker A.Erho N.et al. Racial variations in prostate cancer molecular subtypes and androgen receptor signaling reflect anatomic tumor location. Eur. Urol. 2016;70:14–17. doi: 10.1016/j.eururo.2015.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Mao X.Yu Y.Boyd L.K.Ren G.Lin D.Chaplin T.Kudahetti S.C.Stankiewicz E.Xue L.Beltran L.et al. Distinct genomic alterations in prostate cancers in chinese and western populations suggest alternative pathways of prostate carcinogenesis. Cancer Res. 2010;70:5207–5212. doi: 10.1158/0008-5472.CAN-09-4074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ren S.Wei G.H.Liu D.Wang L.Hou Y.Zhu S.Peng L.Zhang Q.Cheng Y.Su H.et al. Whole-genome and transcriptome sequencing of prostate cancer identify new genetic alterations driving disease progression. Eur. Urol. 2018;73:322–339. doi: 10.1016/j.eururo.2017.08.027. [DOI] [PubMed] [Google Scholar]
- 32.Boysen G.Barbieri C.E.Prandi D.Blattner M.Chae S.S.Dahija A.Nataraj S.Huang D.Marotz C.Xu L.et al. SPOP mutation leads to genomic instability in prostate cancer. Elife. 2015;4:e09207. doi: 10.7554/eLife.09207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Li H.Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Nik-Zainal S.Van Loo P.Wedge D.C.Alexandrov L.B.Greenman C.D.Lau K.W.Raine K.Jones D.Marshall J.Ramakrishna M.et al. The life history of 21 breast cancers. Cell. 2012;149:994–1007. doi: 10.1016/j.cell.2012.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Baca S.C.Prandi D.Lawrence M.S.Mosquera J.M.Romanel A.Drier Y.Park K.Kitabayashi N.MacDonald T.Y.Ghandi M.et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153:666–677. doi: 10.1016/j.cell.2013.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Farmery J.H.R.Smith M.L.NIHR BioResource - Rare Diseases. Lynch A.G. Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data. Sci. Rep. 2018;8:1300. doi: 10.1038/s41598-017-14403-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Turner H.L.van Etten J.Firth D.Kosmidis I. Modelling rankings in R: The PlackettLuce package. Comput. Stat. 2020;35:1027–1057. doi: 10.1007/s00180-020-00959-3. [DOI] [Google Scholar]
- 38.Cooper C.S.Eeles R.Wedge D.C.Van Loo P.Gundem G.Alexandrov L.B.Kremeyer B.Butler A.Lynch A.G.Camacho N.et al. Analysis of the genetic phylogeny of multifocal prostate cancer identifies multiple independent clonal expansions in neoplastic and morphologically normal prostate tissue. Nat. Genet. 2015;47:367–372. doi: 10.1038/ng.3221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Mao X.Yu Y.Boyd L.K.Ren G.Lin D.Chaplin T.Kudahetti S.C.Stankiewicz E.Xue L.Beltran L.et al. Distinct genomic alterations in prostate cancers in Chinese and Western populations suggest alternative pathways of prostate carcinogenesis. Cancer Res. 2010;70:5207–5212. doi: 10.1158/0008-5472.CAN-09-4074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Ye K.Schulz M.H.Long Q.Apweiler R.Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Hieronymus H.Schultz N.Gopalan A.Carver B.S.Chang M.T.Xiao Y.Heguy A.Huberman K.Bernstein M.Assel M.et al. Copy number alteration burden predicts prostate cancer relapse. Proc. Natl. Acad. Sci. USA. 2014;111:11139–11144. doi: 10.1073/pnas.1411446111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Muthén B. In: The SAGE Handbook of Quantitative Methodology for the Social Sciences. Kaplan D.editor. SAGE Publications; 2004. Chapter 19: Latent Variable Analysis: Growth Mixture Modeling and Related Techniques for Longitudinal Data; pp. 106–109. [Google Scholar]
- 43.Lee D.D.Seung H.S. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- 44.Ghahramani Z.Griffiths T.L. In: Advances in Neural Information Processing Systems 18 (NIPS 2005) Weiss Y.Schölkopf B.Platt J.editors. NIPS; 2006. Infinite latent feature models and the Indian buffet process; pp. 475–482. [Google Scholar]
- 45.Hinton G.E.Salakhutdinov R.R. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
- 46.Hinton G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002;14:1771–1800. doi: 10.1162/089976602760128018. [DOI] [PubMed] [Google Scholar]
- 47.Tran T.Phung D.Venkatesh S. In: Proceedings of the Asian Conference on Machine Learning. Hsu C.-N.Lee W.S.editors. PMLR; 2011. Mixed-variate restricted Boltzmann machines; pp. 213–229. [Google Scholar]
- 48.Srivastava N.Salakhutdinov R.R. In: Advances in Neural Information Processing Systems 25 (NIPS 2012) Pereira F.Burges C.J.Bottou L.Weinberger K.Q.editors. NIPS; 2012. Multimodal learning with Deep Boltzmann Machines; pp. 2222–2230. [Google Scholar]
- 49.Nguyen T.D.Tran T.Phung D.Venkatesh S. In: Proceedings of the 5th Asian Conference on Machine Learning. Ong C.S.Ho T.B.editors. PMLR; 2013. Learning parts-based representations with nonnegative restricted Boltzmann machine; pp. 133–148. [Google Scholar]
- 50.Cueto M.A.Morton J.Sturmfels B. Geometry of the restricted Boltzmann machine. Contemp. Math. 2010;516:135–153. [Google Scholar]
- 51.LeCun Y.A.Bottou L.Orr G.B.Mülle K.R. In: Neural Networks: Tricks of the Trade. Montavon G.Müller K.-R.Orr G.B.editors. Springer; 2012. Efficient backprop; pp. 9–48. [Google Scholar]
- 52.Welling M.Rosen-Zvi M.Hinton G.E. In: Advances in Neural Information Processing Systems 17 (NIPS 2004) Saul L.Weiss Y.Bottou L.editors. NIPS; 2005. Exponential family harmoniums with an application to information retrieval; pp. 1481–1488. [Google Scholar]
- 53.Bengio Y.Lamblin P.Popovici D.Larochelle H. In: Advances in Neural Information Processing Systems 19 (NIPS 2006) Schölkopf B.Platt J.Hoffman T.editors. NIPS; 2007. Greedy layer-wise training of deep networks; pp. 153–160. [Google Scholar]
- 54.Tran T.Nguyen T.D.Phung D.Venkatesh S. Learning vector representation of medical objects via EMR-driven nonnegative restricted boltzmann machines (eNRBM) J. Biomed. Inform. 2015;54:96–105. doi: 10.1016/j.jbi.2015.01.012. [DOI] [PubMed] [Google Scholar]
- 55.Cui Z.Ge S.S.Cao Z.Yang J.Ren H. Analysis of different sparsity methods in constrained RBM for sparse representation in cognitive robotic perception. J. Intell. Robot. Syst. 2015;80:121–132. [Google Scholar]
- 56.Wan L.Zeiler M.Zhang S.Le Cun Y.Fergus R. In: Proceedings of the 30th International Conference on Machine Learning. Dasgupta S.McAllester D.editors. PMLR; 2013. Regularization of neural networks using dropconnect; pp. 1058–1066. [Google Scholar]
- 57.Srebro N.Shraibman A. In: International Conference on Computational Learning Theory. Auer P.Meir R.editors. Springer; 2005. Ranktrace-norm and max-norm; pp. 545–560. [Google Scholar]
- 58.Breiman L. Bagging predictors. Mach. Learn. 1996;24:123–140. [Google Scholar]
- 59.Prechelt L. In: Neural Networks: Tricks of the Trade. Montavon G.Müller K.-R.Orr G.B.editors. Springer; 1998. Early Stopping — But When? pp. 55–69. [Google Scholar]
- 60.Hinton G.E. In: Neural Networks: Tricks of the Trade. Montavon G.Müller K.-R.Orr G.B.editors. Springer; 2012. A practical guide to training Restricted Boltzmann Machines; pp. 599–619. [Google Scholar]
- 61.Smith L.N. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) IEEE; 2017. Cyclical learning rates for training neural networks; pp. 464–472. [Google Scholar]
- 62.Olden J.D.Joy M.K.Death R.G. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol. Modell. 2004;178:389–397. [Google Scholar]
- 63.Olden J.D.Jackson D.A. Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Modell. 2002;154:135–150. [Google Scholar]
- 64.Garson G.D. Interpreting neural-network connection weights. AI Expet. 1991;6:46–51. [Google Scholar]
- 65.Larochelle H.Mandel M.Pascanu R.Bengio Y. Learning algorithms for the classification Restricted Boltzmann Machine. J. Mach. Learn. Res. 2012;13:643–669. [Google Scholar]
- 66.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. [Google Scholar]
- 67.Rousseeuw P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20:53–65. [Google Scholar]
- 68.Pomerantz M.M.Li F.Takeda D.Y.Lenci R.Chonkar A.Chabot M.Cejas P.Vazquez F.Cook J.Shivdasani R.A.et al. The androgen receptor cistrome is extensively reprogrammed in human prostate tumorigenesis. Nat. Genet. 2015;47:1346–1351. doi: 10.1038/ng.3419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Gel B.Díez-Villanueva A.Serra E.Buschbeck M.Peinado M.A.Malinverni R. RegioneR: An R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics. 2016;32:289–291. doi: 10.1093/bioinformatics/btv562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Suzuki R.Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22:1540–1542. doi: 10.1093/bioinformatics/btl117. [DOI] [PubMed] [Google Scholar]
- 71.Ansari-Pour N.Zheng Y.Yoshimatsu T.F.Sanni A.Ajani M.Reynier J.B.Tapinos A.Pitt J.J.Dentro S.Woodard A.et al. Whole-genome analysis of Nigerian patients with breast cancer reveals ethnic-driven somatic evolution and distinct genomic subtypes. Nat. Commun. 2021;12:6946. doi: 10.1038/s41467-021-27079-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Luce R.D. Wiley; 1959. Individual Choice Behavior: A Theoretical Analysis. [Google Scholar]
- 73.Plackett R.L. The analysis of permutations. Appl. Stat. 1975;24:193–202. [Google Scholar]
- 74.Mollica C.Tardella L. Epitope profiling via mixture modeling of ranked data. Stat. Med. 2014;33:3738–3758. doi: 10.1002/sim.6224. [DOI] [PubMed] [Google Scholar]
- 75.Anderson G.M. Error propagation by the Monte Carlo method in geochemical calculations. Geochim. Cosmochim. Acta. 1976;40:1533–1538. doi: 10.1016/0016-7037(76)90092-2. [DOI] [Google Scholar]
- 76.Kingman J.F.C. In: Encyclopedia of Biostatistics. Armitage P.Colton T.editors. John Wiley & Sons; 2005. Poisson processes. [Google Scholar]
- 77.Jones M.C. Simple boundary correction for kernel density estimation. Stat. Comput. 1993;3:135–146. [Google Scholar]
Associated Data
This section collects any data citationsdata availability statementsor supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Sequencing data generated for this study have been deposited in the European Genome-phenome Archive with accession code EGAS00001000262. Processed data and code used in this manuscript is available at https://github.com/woodcockgrp/evotypes_p1/ and via https://doi.org/10.5281/zenodo.10214795.







