Chapter 5 Comparison of reproducibility between NMF and ICA

NMF and ICA are algorithms often applied to solve blind source deconvolution problem. NMF gained popularity as a tool of transcriptomic analysis reflected in many publications (Moffitt et al. 2015; Shen-Orr and Gaujoux 2013; Brunet et al. 2004; Repsilber et al. 2010). However, none of these works compare components obtained from different datasets between each other.

The non-negativity constraint, an attractive concept in the case of non-negative transcriptome counts, may be a reason why the results of NMF decomposition are not the best candidate for our deconvolution task. I performed an analysis that demonstrates that NMF-based metagenes are less reproducible between different transcriptomic datasets than ICA-based metagenes.

5.1 Comparing metagenes obtained with NMF versus ICA

I compared the reproducibility of NMF (classical brunet version, see Section 2.3.6.2) and ICA (fastICA) through decomposition of four breast cancer datasets (BRCATCGA, METABRIC, BEK, WAN)(Weinstein et al. 2013; Curtis et al. 2012; Bekhouche et al. 2011, Wang et al. (2005)). Those datasets were selected because of their size (number of samples > 50) and because they were available in not centered format necessary for NMF.

For NMF the procedure was following:

  1. data was transformed into log2(x +1)
  2. zero-rows were removed
  3. the algorithm assessing cophenetic index was applied to select the optimal number of components
  4. datasets were decomposed with Matlab NMF implementation from Brunet et al. (2004) into (i) number of components suggested by the cophenetic coefficient (ii) MSTD dimension (iii) 50 components (approaching overdecomposition)
  5. the obtained metagenes were decorrelated from the mean using a linear regression model

For ICA, the procedure was following:

  1. data were transformed into log2(x +1)
  2. transformed data were mean-centered by gene
  3. our implementation of MSTD (most stable transcriptomic dimension) from (Kairov et al. 2017) was used to evaluate most stable dimension
  4. datasets were decomposed into (i) MSTD dimension and (ii) 50 components (approaching overdecomposition) with Matlab implementation of fastICA with icasso stabilization

I did not decompose ICA into a low number of components as we consider it as strong underdecomposition and we suspect signals would not be the most reproducible.

To define the optimal number of factors for NMF (\(k\)), I followed the strategy employed in (Brunet et al. 2004) using the cophenetic coefficient which is a metric related to the stability of clusters obtained over iterative runs of NMF.

[The cophenetic coefficient] is defined as the Pearson correlation between the samples’ distances induced by the consensus matrix (seen as a similarity matrix) and their cophenetic distances from a hierarchical clustering based on these very distances (by default an average linkage is used) (Brunet et al. 2004)

The cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. The minimum of the cophenetic coefficient values over \(k\) indicates the optimal number of factors.

Finding the best \(k\) number of factors for NMF of the biggest dataset (METABRIC) for \(k\) ranging from 2 to 50 took 30245 minutes (3 weeks). Therefore, I limited the \(k_{max}\) to 50 components (maximal number of factors) and not to 100 as initially planned.

Once, the four datasets were decomposed to MSTD, Cophenetic\(_{min}\) and 50, I proceed to the comparison of the components between datasets. I correlated all obtained metagenes with each other and with known reference metagenes (Biton et al. 2014). We represented the results in the form of a correlation graph where nodes are metagenes from different datasets and decomposition levels, and edge width corresponds Pearson correlation coefficients (Fig 5.1).

Correlation graph of ICA and NMF multiple decompositions. In the upper part of the figure (A, B) we observe the correlation graph of all metagenes (ICA or NMF-based) displayed using edge-weighted bio layout. In the lower part of the figure (C, D) we applied >0.4 thresholds to filter the edges. In the case of ICA (C), remaining nodes form pseudo-cliques, immune-related pseudo-clique is highlighted. In the case of NMF (D), components cluster by the dataset. Edges’ width corresponds to Pearson correlation coefficient. Node colors correspond to the dataset from which a metagene was obtained (see legend).

Figure 5.1: Correlation graph of ICA and NMF multiple decompositions. In the upper part of the figure (A, B) we observe the correlation graph of all metagenes (ICA or NMF-based) displayed using edge-weighted bio layout. In the lower part of the figure (C, D) we applied >0.4 thresholds to filter the edges. In the case of ICA (C), remaining nodes form pseudo-cliques, immune-related pseudo-clique is highlighted. In the case of NMF (D), components cluster by the dataset. Edges’ width corresponds to Pearson correlation coefficient. Node colors correspond to the dataset from which a metagene was obtained (see legend).

I expected to observe a subset of components from different datasets (no matter the decomposition level) correlated with each other firmly and much less with other components in order to confirm that the signal is reproducible (can be found in several dataset) and specific (can be matched to one corresponding signal in another dataset). I used the reference components here to help with the identification of signals (labeling) of indicative nature. In ICA-based correlation of components, without applying any threshold (Fig 5.1A), some emerging clusters can be remarked and after application of >0.4 thresholds on the Pearson correlation coefficient value(Fig 5.1C) numerous pseudo-cliques emerge. While for metagenes from NMF decomposition, they are more tightly connected globally and when the threshold is applied components group by the dataset. In NMF decomposition, if it is hard to define different signals as the components seem all related to each other. We can see from (Fig 5.1D) that the IMMUNE signal is correlated >0.4 with a high number of NMF components that are also linked to some other components. In ICA (Fig 5.1C) components related to the IMMUNE metagenes form a pseudo-clique that is related only with one link to INTERFERON metagene. This makes them much more specific, and therefore the interpretation is more straightforward.

5.2 Summary

This simple analysis illustrates that NMF applied to cancer transcriptomes decomposes them to metagenes that are not selectively and specifically matching between datasets. In part, this is because NMF components are correlated with the average gene expression. Therefore, NMF can be sensitive to the normalization. However, even after the “average removal” through linear regression, this phenomena persist. It is not clear why, from a mathematical perspective, we observe such a discrepancy of interpretation of NMF and ICA components.

It is also possible that using a different method to find correct decomposition dimension (\(k\)) should be used. Ideally, different NMF implementation should be tested to verify if using different error updates can have an impact on the results.

In practice, it will not always be possible to work with the data processed in the same way. Using ICA for decomposition seems to be more straightforward, and the obtained components are easier to interpret as biological functions (thanks to the reciprocal matching) without a need to renormalize datasets.

A deepened extension of this study was performed by Cantini, Kairov, et al. (2018) and is available online.

References

Moffitt, Richard A, Raoud Marayati, Elizabeth L Flate, Keith E Volmar, S Gabriela Herrera Loeza, Katherine A Hoadley, Naim U Rashid, et al. 2015. “Virtual microdissection identifies distinct tumor- and stroma-specific subtypes of pancreatic ductal adenocarcinoma.” Nat. Genet. 47 (10). Nature Publishing Group: 1168–78. doi:10.1038/ng.3398.

Shen-Orr, Shai S, and Renaud Gaujoux. 2013. “Computational deconvolution: extracting cell type-specific information from heterogeneous samples.” Curr. Opin. Immunol. 25 (5): 571–78. doi:10.1016/j.coi.2013.09.015.

Brunet, Jean-Philippe, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. “Metagenes and molecular pattern discovery using matrix factorization.” Proc. Natl. Acad. Sci. U. S. A. 101 (12). National Academy of Sciences: 4164–9. doi:10.1073/pnas.0308531101.

Repsilber, Dirk, Sabine Kern, Anna Telaar, Gerhard Walzl, Gillian F Black, Joachim Selbig, Shreemanta K Parida, Stefan HE Kaufmann, and Marc Jacobsen. 2010. “Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach.” BMC Bioinformatics 11 (1). BioMed Central: 27. doi:10.1186/1471-2105-11-27.

Weinstein, John N, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, et al. 2013. “The Cancer Genome Atlas Pan-Cancer Analysis Project.” Nature Genetics 45 (10). Nature Publishing Group: 1113.

Curtis, Christina, Sohrab P. Shah, Suet Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dunning, Doug Speed, et al. 2012. “The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.” Nature 486 (7403): 346–52. doi:10.1038/nature10983.

Bekhouche, Ismahane, Pascal Finetti, José Adelaïde, Anthony Ferrari, Carole Tarpin, Emmanuelle Charafe-Jauffret, Colette Charpin, et al. 2011. “High-resolution comparative genomic hybridization of Inflammatory breast cancer and identification of candidate genes.” PLoS ONE 6 (2). doi:10.1371/journal.pone.0016950.

Wang, Yixin, Jan G.M. Klijn, Yi Zhang, Anieta M. Sieuwerts, Maxime P. Look, Fei Yang, Dmitri Talantov, et al. 2005. “Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.” Lancet 365 (9460): 671–79. doi:10.1016/S0140-6736(05)17947-1.

Kairov, Ulykbek, Laura Cantini, Alessandro Greco, Askhat Molkenov, Urszula Czerwinska, Emmanuel Barillot, and Andrei Zinovyev. 2017. “Determining the optimal number of independent components for reproducible transcriptomic data analysis.” BMC Genomics 18 (1). BioMed Central: 712. doi:10.1186/s12864-017-4112-9.

Biton, Anne, Isabelle Bernard-Pierrot, Yinjun Lou, Clémentine Krucker, Elodie Chapeaublanc, Carlota Rubio-Pérez, Nuria López-Bigas, et al. 2014. “Independent Component Analysis Uncovers the Landscape of the Bladder Tumor Transcriptome and Reveals Insights into Luminal and Basal Subtypes.” Cell Rep. 9 (4): 1235–45. doi:10.1016/j.celrep.2014.10.035.

Cantini, Laura, Ulykbek Kairov, Aurelien de Reynies, Emmanuel Barillot, Francois Radvanyi, and Andrei Zinovyev. 2018. “Stabilized Independent Component Analysis outperforms other methods in finding reproducible signals in tumoral transcriptomes.” bioRxiv. Cold Spring Harbor Laboratory, 318154. doi:10.1101/318154.