Skip to main content
  • Research article
  • Open access
  • Published:

Untargeted lipidomic features associated with colorectal cancer in a prospective cohort



Epidemiologists are beginning to employ metabolomics and lipidomics with archived blood from incident cases and controls to discover causes of cancer. Although several such studies have focused on colorectal cancer (CRC), they all followed targeted or semi-targeted designs that limited their ability to find discriminating molecules and pathways related to the causes of CRC.


Using an untargeted design, we measured lipophilic metabolites in prediagnostic serum from 66 CRC patients and 66 matched controls from the European Prospective Investigation into Cancer and Nutrition (Turin, Italy). Samples were analyzed by liquid chromatography-high-resolution mass spectrometry (LC-MS), resulting in 8690 features for statistical analysis.


Rather than the usual multiple-hypothesis-testing approach, we based variable selection on an ensemble of regression methods, which found nine features to be associated with case-control status. We then regressed each selected feature on time-to-diagnosis to determine whether the feature was likely to be either a potentially causal biomarker or a reactive product of disease progression (reverse causality).


Of the nine selected LC-MS features, four appear to be involved in CRC etiology and merit further investigation in prospective studies of CRC. Four other features appear to be related to progression of the disease (reverse causality), and may represent biomarkers of value for early detection of CRC.

Peer Review reports


Colorectal cancer (CRC) accounts for over 25% of all cancer-related deaths with global incidence rates steadily rising [1,2,3]. Since less than 15% of CRC risk has been attributed to heritable genetics [4, 5], non-shared exposures and their contributions to gut inflammation are believed to be important etiologic factors [6]. Increased CRC risks have been associated with cigarette smoking, alcohol use, lack of physical activity, obesity, abnormal glucose metabolism, and consumption of red meat and n-6 polyunsaturated fatty acids (PUFAs) [7,8,9,10]. Conversely, consumption of n-3 PUFAs, fruits, fish, vitamins D and E, and regular use of aspirin appear to reduce CRC risks [7, 11, 12]. There are also persistent suggestions that the interplay between dietary factors - particularly red meat, lipids, and fiber - and the gut microbiota are effect modifiers for CRC [6, 13,14,15,16].

Many of the associations between exposures and CRC have been gleaned from epidemiological studies that employed self-reported dietary and lifestyle factors [8, 9, 16, 17]. Given the inherent limitations of such data for discovering causal exposures, investigators have recently employed metabolomics to compare small-molecule features between CRC cases and controls. This strategy is based on the idea that small molecules in human blood reflect chemical exposures from both internal and external sources, including the diet, microbiota, psychosocial stress, and pollutants [18]. However, since molecules that discriminate cases from controls in cross-sectional studies can reflect both potential causes of CRC and dysregulation of metabolic processes that result from progression of the disease (reverse causality) [5, 19], it is important that biospecimens be collected before diagnosis to gain insights into causes and effects. Indeed, a class of ultra-long-chain fatty acids (ULCFAs) that discriminated for CRC in several cross-sectional studies [20, 21] was essentially ruled out as a causal factor in a prospective cohort [22].

Metabolomic analyses of blood from prospective cohorts have found some associations between CRC incidence and small molecules, as summarized in Table 1, with periods of follow-up ranging from 3.7 to 14.7 years [19, 22,23,24,25,26,27]. Interestingly, all of these nested case-control studies followed targeted or semi-targeted designs where relatively few molecular features were tested between cases and controls. Two of the studies focused on metabolism of dietary choline and found that the mammalian metabolite, betaine, was moderately protective against CRC whereas trimethylamine-N-oxide (TMAO), a metabolite mediated via intestinal microbiota, was associated with increased risk [26, 27]. A genetic link between TMAO and CRC risk has also been reported [28]. Intriguingly, red meat and other phosphatidylcholine-rich foods appear to contribute to dysbiotic microbiota that generate trimethylamine (the precursor of TMAO) [15, 29], whereas fiber-rich foods appear to encourage symbiotic bacteria that are associated with decreased CRC risk [15, 30].

Table 1 Studies that investigated associations of colorectal cancer with small molecules in plasma or serum from prospective cohorts

Since untargeted metabolomics via liquid chromatography-mass spectrometry (LC-MS) can detect thousands of small-molecule features, traditional hypothesis-testing approaches that adjust for multiple comparisons by controlling false positive error rates, such as the false discovery rate (FDR) [31], can make it difficult to find features whose levels differ significantly between cases and controls. This may have motivated the semi-targeting strategy of Cross et al. (Table 1) [25], who limited hypothesis tests of the thousands of detected features to only 278 molecules that had been fully annotated. Such a strategy is likely to be biased towards well curated metabolites that participate in recognized human pathways [18], and thus can miss novel exposures of potential importance to initiation of cancer, including those experienced predominately by either cases or controls. Indeed, of the 278 small-molecules tested by Cross et al. [25], only glycochenodeoxycholate (a secondary bile salt) was associated with increased CRC risk in women (but not men) after using the conservative Bonferonni correction of the p-value.

Here, we report results of an untargeted metabolomics analysis of serum from 66 incident CRC cases and matched controls from the European Prospective Investigation of Cancer and Nutrition (EPIC). Given the involvement of lipids in inflammatory processes and CRC [32,33,34], the serum-extraction procedure favored lipophilic molecules. As an alternative to the traditional multiple-hypothesis-testing paradigm for selecting features of potential importance to CRC, we developed a variable-selection strategy that employs an ensemble of diverse prediction methods, including regularized linear regression and regression trees [35,36,37]. Such methods have recently been applied independently for analyzing metabolomic and other -omic data [35, 36, 38]. Our analyses point to a small set of features that were predictive of CRC-case status. However, as with all discovery studies, these potentially important features and the molecules they represent must be further validated with independent data sets.


Study population

EPIC is a prospective cohort study with approximately 520,000 adult participants from across Europe that were enrolled from 1992 through 2000 [39]. We had received EPIC serum to test the hypothesis that ULCFAs were protective of colorectal cancer [22], and simultaneously performed untargeted metabolomics to discover other potentially causal features. For the current study, data were analyzed from serum collected between 1993 and 1997 from 66 case-control pairs in Turin, Italy. Controls were matched to incident cases by age, year and season of enrollment, and gender. Dietary data were collected with food frequency questionnaires [40, 41]. Summary statistics for these subjects are listed in Table 2, including time to diagnosis (ttd), gender, body mass index (bmi), waist circumference, smoking status, diabetes status, physical activity, and alcohol and meat consumption. These variables were selected based on previous evidence of associations with CRC risk [7, 42, 43]. Across our subjects, the only significant differences between CRC cases and controls were observed for bmi and waist circumference, both of which were higher in cases (Table 2).

Table 2 Descriptive statistics of human subjects matched by age, study enrollment, gender, and selected covariates


Isopropanol (LC-MS grade, Fluka), methanol, water and 13C- cholic acid (internal standard) were from Sigma-Aldrich (Milwaukee, WI, USA). Acetic acid (LC-MS grade, Optima) and chloroform were from Fisher Scientific (Santa Clara, CA, USA). All chemicals were of analytical grade and were used without purification.

Sample processing

Serum was stored after collection in 0.5-ml aliquots that were placed in cryostraws, sealed, and stored in liquid nitrogen (-196 °C) at the International Agency for Research on Cancer in Lyon, France. Approximately 1 year prior to analysis, cryostraws were transported (with dry ice) to our laboratory in Berkeley, CA (USA), where they were maintained at -80 °C. As previously reported [22], 20 μl of serum was mixed with 100 μl of a solvent mixture (isopropanol/methanol/water = 60:35:5) containing 13C-cholic acid as an internal standard (final concentration of 3.0 μg/ml). After mixing samples for 1 minute with a vortex mixer, samples were left at room temperature for 10 min. to precipitate proteins, and were then centrifuged for 10 min at 10,000 g. The supernatant was retained and stored at 4 °C prior to LC-MS. Case-control pairs were analyzed sequentially but in random order. A local quality-control sample, prepared by pooling aliquots from all serum specimens of each batch, was analyzed after every ten samples to monitor system stability and estimate the precision of the analyses.

Mass spectrometry

Analysis was performed with an Agilent LC (1100 series) coupled to an Agilent high resolution MS (Model 6550 QTOF, Santa Clara, CA, USA) as previously reported [22]. Briefly, 10 μl of extracts were slowly loaded on to a Luna C5 column (Phenomenex, Los Angeles, CA) with a 22-min gradient elution of mobile phase A (methanol/0.5% acetic acid = 5:95) and mobile phase B (isopropanol/methanol/0.5% acetic acid = 60:35:5). The electrospray was operated in negative electrospray-ionization (ESI) mode. Tandem MS/MS spectra were obtained on the same platform in data-dependent mode (immediately after data collection) or targeted mode (analysis of the selected features). Full LC-MS acquisition parameters were previously published [22].

Approximately one third of the serum samples had a gelled consistency that resulted from an additive to the cryostraws [22, 44, 45]. Pairs with at least one gelled sample were analyzed in one batch (batch 1, n = 96), and the remaining (non-gelled) pairs were analyzed in a second batch (batch 2, n = 36).

Data processing

Raw data were converted to mzXML format for peak picking using ProteoWizard software (Spielberg Family Center for Applied Proteomics, Los Angeles, CA). Peak detection and retention-time alignment were performed as described previously [22], using the XCMS package within the R statistical programming environment [46,47,48]. The CAMERA package was used to identify isotopes, ESI adducts, and in-source fragments with the custom rule set used from Stanstrup et al. [49, 50]. Annotation of features was conducted using the compMS2Miner package [51], by comparing accurate masses and MS2 fragmentation patterns with the Human Metabolome Database (HMDB) and Metlin [52, 53].

Over 24,300 features were initially detected in negative ESI mode. Features were filtered by removing those with a mean fold-change in abundance less than 1.5 compared to the same peaks in reagent blanks (background noise) and those with coefficients of variation (CV) from QC samples greater than 30% [54, 55]. This resulted in a final dataset of 8690 features for statistical analysis. Feature intensities were (natural) log-transformed and adjusted for batch and gel-status effects using the following linear regression model, previously described in [22]:

$$ \mathit{\log}{Y}_i={\beta}_0+{\beta}_1{X}_{i, gel}+{\beta}_2{X}_{i, batch}+{\epsilon}_i, $$

where Yi denotes the intensity of a given feature for the ith subject and Xi, gel and Xi, batch are the corresponding categorical covariates for gel-status and batch. After fitting the linear model, normalized (logged) intensities were obtained by subtracting the estimated batch and gel effects from the original (logged) intensities.

Upper-quantile scaling was used to render the distributions of feature abundances more comparable across all subjects [56, 57]. A correlation-network program (Cytoscape, [58]) and an R package clustering algorithm (RAMclust, [59]) were used to identify clustered ions and assist with annotations.

Statistical methods: Variable selection

In order to identify discriminating features between CRC cases and control, we shifted the paradigm from multiple hypothesis testing to variable selection based on a combination of three regression methods. First, we considered the following standard linear regression model for the raw intensity of a given feature Y in the ith subject:

$$ \mathit{\log}{Y}_i={\beta}_0+{\beta}_1{X}_{i, caco}+{\beta}_2{X}_{i, gel}+{\beta}_3{X}_{i, batch}+{\beta}_4{X}_{i, age}+{\beta}_5{X}_{i, gender}+{\epsilon}_i, $$

where caco, gel, and gender denote binary variables for case-control status, presence or absence of gelled serum, and the matched variables of gender and age (in years). Features were then ranked based on the nominal unadjusted p-value for the case-control coefficient (β1). Second, a regularized logistic regression (LASSO: least absolute shrinkage and selection operator) was performed [35, 60] with case-control status as the binary outcome variable regressed on the following covariates: normalized log intensities from Eq. (1) for all 8690 metabolites, age and gender. Although age and gender were in the LASSO model, neither variable was selected by the regularized regression.

In order to stabilize feature selection with LASSO, 500 bootstrap samples were taken for each of a variety of penalty parameters [61]. Features chosen by LASSO in at least 10% of the bootstrap samples, across a wide range of penalty parameters, were retained. A data-driven cutoff of 10% was chosen based on plots of the percentage of time that each metabolite was selected during the bootstrap iterations across various penalty parameters, sorted in decreasing order. There was an obvious gap between metabolites selected more than and less than 10% of the time, which lead to choosing this as a natural cutoff. Third, the random forest algorithm [36, 62] was used to build a predictor of case-control status, using the same covariates as for the LASSO regression. No obvious jump in variable importance could be seen for the sorted random forest variable importance or the sorted linear regression p-values. Therefore, a cutoff of 1% was selected for both of these criteria because this cutoff is relatively stringent, yet still included a reasonable number of variables for consideration. In summary, to select a final set of variables, we included only features that were selected by the bootstrap LASSO and were also among the top 1% of features ranked by linear regression p-values and random forest variable importance.

When a set of features was selected that satisfied all criteria, the extracted ion chromatographs were visually inspected and those with poor peak morphology (ill-defined Gaussian shape) or integration were removed. Then, the three variable selection methods were repeated as needed to arrive at a final set of selected features with good peak morphology and integration.

Initially, only covariates on which the samples were matched (i.e. age and gender) were included in the models used for variable selection. We did not include other dietary or health related covariates because we did not want to obscure possible associations between the metabolites and case-control status. However, we did subsequently test for associations between the nine selected metabolites and the following covariates weight, bmi, smoking status, and consumption of beef, pork, and alcohol. As shown in Additional file 1, only one feature (ID 839) was marginally associated with any of the covariates (i.e. bmi and consumption of beef). Also, when each of these covariates was added to the LASSO model (Eq. (2)) with case-control status as the binary outcome variable none of them was selected by the regularized regression.


Features that discriminate for CRC

After applying the three variable-selection methods described above, two of which prioritize predictive ability (LASSO and random forest), nine features were selected. The volcano plot in Fig. 1 relates case-control fold changes and –log10 p-values (for the model in Eq. 2) for all features and highlights the nine selected metabolites (shown in Table 3). Case-control fold-changes ranged from approximately 0.2 to 3.0 overall and between 0.40 and 1.40 for the nine selected features. Due to the nature of our variable selection method, the p-values of the selected features were not necessarily the smallest, nor were their fold-changes necessarily the largest. Nevertheless, the nine selected metabolites resulted in a 79% correct classification rate when they were used to fit a logistic regression model on the learning set to predict case-control status. Although this correct classification is likely optimistic because the same data were used to perform the variable selection and to build and test the predictor, the selected features are worthy of validation in independent samples of CRC cases and controls from prospective cohorts.

Fig. 1
figure 1

Volcano plot of analyzed features; the nine selected features are highlighted in red with the ID labels from Table 3. An arbitrary p-value = 0.05 threshold line is present for reference. p-values and fold-changes are calculated based on the regression model in Eq. (2)

Table 3 Untargeted features selected as predictors of case-control status

Potentially causal and reactive biomarkers

The nine selected features were evaluated to determine their associations with time to diagnosis (ttd) as a means of discerning whether they represent potentially causal exposures or reactive effects of disease progression [22]. If the log fold-change for a given feature was constant across the whole range of ttd in a linear model (p-value > 0.05), the feature was classified as potentially causal (C) and if the case-control difference decreased with increasing ttd, the feature was classified as potentially reactive (R). These (C) and (R) classifications are listed in Table 3 for the nine selected features and the plots for the ttd linear models are shown in Fig. 2. (See Additional file 1 for regression coefficients and p-values). This process resulted in four potentially causal features (no apparent effect of ttd for IDs: 5080, 3207, 6054 and 839), four potentially reactive features (case-control differences diminish with ttd for IDs: 235, 4250, 4294 and 14,963), and one feature that could not be classified as either (C) or (R) (case control differences increased with ttd, ID 5749). The four potentially causal metabolites resulted in a 72% correct classification rate to predict case-control status (with the same caveats mentioned above). While the four potentially causal features may be linked to exposures that contribute to CRC, the four reactive features may be useful pre-diagnostic biomarkers.

Fig. 2
figure 2

Scatterplots of case-control log fold-change vs. time to diagnosis (ttd) for the 9 selected features. The blue line is the linear regression fit and the gray band represents a 95% confidence intervals, calculated with the ‘lm’ method of the R function ‘geom_smooth’ in the package ‘ggplot2’


Using untargeted metabolomics in serum samples from 66 pairs of CRC cases and controls from the EPIC cohort, we sought evidence linking lipophilic molecules with the etiology of CRC. The LC-MS data collected from these samples included over 24,000 features. After filtering for noise (mean fold-change above blank samples), reproducibility (CVs), and likely artifacts (CAMERA), 8690 features were available for evaluating potential associations with CRC case status.

It has been standard practice in metabolomics to identify features that discriminate for case-control status using a multiple-testing approach, e.g., based on a cutoff for p-values that have been adjusted to control for a false positive error rate such as the FDR [63]. Since untargeted metabolomics can detect thousands of features, FDR correction is severe [63] and can drastically reduce the number of selected metabolites, thereby resulting in false negatives. Thus, we shifted our paradigm to a variable-selection approach, based on an ensemble of diverse regression methods, in order to uncover a reliable set of features for further investigation. This led to selection of 9 features that discriminated for CRC (Table 3) with a correct classification of 79%. Based on regressions of case/control fold changes on ttd (Fig. 2), four of these features appear to be related to causal factors for CRC, four appear to be related to cancer progression, and one is indeterminate.

As mentioned earlier, the top 1 % of features in terms of smallest p-values were considered for variable selection. However, many of these roughly 90 features were not also high ranking in terms of predictive importance, as assessed by random forest and LASSO. Indeed, inspection of data from the features with very small p-values found some to have small standard errors rather than meaningful case-control differences, while other features with large fold changes had great uncertainties in fold-change estimates. This reinforces the value of using an ensemble of regression methods to evaluate biological variability rather than a single measure such as a small p-value or a large case-control fold change.

Possible annotations

Potential annotations of the nine selected features were based on comparisons of MS2 spectra with human metabolome database (HMDB) entries as summarized in Table 4. Focusing first on the four potentially causal features shown in Table 3, MS2 were only obtained for IDs 3207 and 6054, which were positively correlated (Pearson correlation coefficient of 0.64) and were both present at lower levels in cases than in controls, indicating possibly protective effects. The two MS2 fragment ions detected for ID 3207 could not be identified. However, ID 6054 had fragments characteristic of [C16H29O2] and losses of two H2O molecules, consistent with the loss of two hydroxyl groups and a hexadecenoic acetate fragment. These fragments are suggestive of ceramide lipids [64], a class of molecules that has been implicated in the regulation of cancer cells [65]. Although IDs 3207 and 6054 and the two other potentially causal features (IDs 5080 and 839) could not be fully annotated, the identified characteristics of accurate mass, retention time, and MS2 fragments can be used for validation in future studies.

Table 4 Results of tandem MS/MS analyses of features associated with case-control status

Turning now to the likely reactive features, ID 4294 was putatively identified as ULCFA 468, which had been evaluated separately in our targeted study of 8 ULFCAs [22] and had been first reported by Ritchie et al. [24]. This feature had neutral losses of H2O and CO2, characteristic of a hydroxylated fatty acid and a likely molecular formula for [M-H] of C28H52O5, within 1.56 ppm of the exact mass. Another reactive feature (ID 4250) also displayed these characteristic neutral losses of H2O and CO2 and was highly correlated with ID 4294 with a correlation coefficient of 0.85. This suggests that ID 4250 is a previously uncharacterized ULCFA with molecular formula for [M-H] of C27H50O5, within − 2.76 ppm. While odd-numbered fatty acids are less common in humans, microbial single carbon metabolism in very long chain fatty acids has been reported [66]. As a class, ULCFAs tend to be present at higher levels among controls compared to paired cases, but this difference diminishes with ttd, suggesting that they result from disease progression [22]. Nonetheless, the fact that these two ULCFAs were selected from approximately 9000 features that survived filtering of the untargeted metabolomics data offers partial validation to our variable-selection strategy. Based on correlation maps (data not shown) both of these features clustered with five other ULCFAs that have been described by Ritchie, et al. [24] (ULCFAs 465, 466, 492, 518, and 538; exact masses within 10 ppm of calculated m/z), and were also analyzed in our targeted study [22]. Another selected feature (ID 235) exhibited similar reactive (R) behavior to the ULCFAs (Fig. 2), and the presence of a neutral loss of CO2, indicating that ID 235 may be a fatty or bile acid. Deoxycholic acid (3α, 12α-dihydroxy-5β-cholanic acid) [M-H], chenodeoxycholic acid (3α, 7α-dihydroxy-5β-cholanic acid) [M-H], and adrenic acid [M + HAc-H] were eliminated as possible annotations of feature 235 by comparison of retention times between the experimental data and analytical standards. However, these two tested molecules are just two isomers of a large class of bile acids, some of which are positively correlated with CRC [25, 67]; in our study, feature 235 was negatively correlated with CRC.


Limitations of this study include the small sample size, which reduced the power to detect differences between case-control pairs, and lack of information regarding aspirin consumption and a family history of CRC, two covariates that have been associated with CRC incidence [1, 68]. Any bias (and potential confounding) introduced by the gelling of some samples from the cryostraws should have been removed by adjustment via Eq. (1). Nonetheless, gelling complicated sample processing and was, therefore, a source of random variation that probably reduced our ability to detect differences between cases and controls. Gelling of EPIC serum resulted from a ‘gel powder’ that had been added to seal one end of the cryostraw ( This illustrates how proper storage of biological specimens for decades is challenging because preservation of cells, DNA, RNA, proteins and small molecules must be considered. However, decades later, shortcomings of then-contemporary technology (such as gelling of serum) can be revealed and their reporting can improve the design of future investigations.


In summary, of the nearly 9000 filtered features subjected to statistical analysis, four appear to be potentially causal features that are worthy of following up in an independent set of prospective CRC cases and controls. When these four features alone were used to build a logistic regression predictor of case/control status on the learning set, they resulted in a correct classification rate of 72%. Again, this is likely an optimistic correct classification rate, but given that only four features were used for prediction, it is quite promising. Four other selected features, notably some ULCFAs and related fatty acids, appear to be products of disease progression and, therefore, could be useful diagnostic biomarkers for early detection of CRC. Since ULCFAs had previously been shown to discriminate CRC cases from controls in several cross-sectional investigations, it is reassuring that two putative ULCFAs (IDs 4294 and 4250) were selected as predictive features in this untargeted analysis. While the relatively modest number of samples limited the power to detect other associations, the nine features selected in our study correctly predicted case-control status in 79% of the samples. The stability of these features across three disparate feature-selection methods is promising. Furthermore, based on m/z and annotation information, these nine features appear to be different than those reported in the prospective CRC study by Cross et al. [25], warranting further identification and validation.



Body mass index


Colorectal cancer


Coefficient of variation


European prospective investigation into cancer and nutrition


False discovery rate


Human Metabolome Database


Least absolute shrinkage and selection operator


Liquid chromatography mass spectrometry


Poly-unsaturated fatty acid



ttd :

Time to (case) diagnosis


Ultra-long chain fatty acid


  1. Arnold M, Sierra MS, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global patterns and trends in colorectal cancer incidence and mortality. Gut. 2016;0:1–9.

    Google Scholar 

  2. Haggar FA, Boushey RP. Colorectal Cancer Epidemiology : incidence, mortality, survival, and risk factors. Clin Colon Rectal Surg. 2009;22:191–7.

    Article  PubMed  Google Scholar 

  3. American Cancer Society. Cancer facts and figures 2015; 2015. p. 1–9.

    Google Scholar 

  4. Hemminki K, Czene K. Attributable risks of familial Cancer from the family-Cancer database attributable risks of familial Cancer from the family-Cancer Database 1; 2002. p. 1638–44.

    Google Scholar 

  5. Rappaport SM. Genetic factors are not the major causes of chronic diseases. PLoS One. 2016;11:e0154387.

    Article  PubMed  Google Scholar 

  6. Gassler N, Klaus C, Kaemmerer E, Reinartz A. Modifier-concept of colorectal carcinogenesis: Lipidomics as a technical tool in pathway analysis. World J Gastroenterol. 2010;16:1820–7.

    Article  CAS  PubMed  Google Scholar 

  7. Stone WL, Krishnan K, Campbell SE, Palau VE. The role of antioxidants and pro-oxidants in colon cancer. World J Gastrointest Oncol. 2014;6:55–66.

    Article  PubMed  Google Scholar 

  8. Leufkens AM, Van Duijnhoven FJB, Siersema PD, Boshuizen HC, Vrieling A, Agudo A, et al. Cigarette smoking and colorectal cancer risk in the European prospective investigation into Cancer and nutrition study. Clin. Gastroenterol Hepatol. 2011;9:137–44.

    Article  Google Scholar 

  9. Platz EA, Willett WC, Colditz GA, Rimm EB, Spiegelman D, Giovannucci E. Proportion of colon cancer risk that might be preventable in a cohort of middle-aged US men. Cancer Causes Control. 2000;11:579–88.

    Article  CAS  Google Scholar 

  10. Oostindjer M, Alexander J, Amdam GV, Andersen G, Bryan NS, Chen D, et al. The role of red and processed meat in colorectal cancer development : a perspective. MESC. 2014;97:583–96 Elsevier BV.

    Google Scholar 

  11. Rothwell PM, Fowkes FGR, Belch JFF, Ogawa H, Warlow CP, Meade TW. Effect of daily aspirin on long-term risk of death due to cancer: analysis of individual patient data from randomised trials. Lancet. 2010;377:31–41 Elsevier Ltd.

    Article  Google Scholar 

  12. Ma Y, Zhang P, Wang F, Yang J, Liu Z, Qin H. Association between vitamin D and risk of colorectal Cancer : a systematic review of prospective studies. J Clin Oncol. 2011;3775–82.

    Article  CAS  Google Scholar 

  13. Terzić J, Grivennikov S, Karin E, Karin M. Inflammation and Colon Cancer. Gastroenterology. 2010;138:2101–14.

    Article  PubMed  Google Scholar 

  14. O'Keefe S. Diet, microorganisms and their metabolites, and colon cancer. Nat Rev Gastroenterol Hepatol. 2016;13:691–706.

    Article  CAS  Google Scholar 

  15. Vipperla K, O’Keefe S. Diet, microbiota, and dysbiosis: a “recipe” for colorectal cancer. Food Funct. 2016;7:1731–40 Royal Society of Chemistry.

    Article  CAS  Google Scholar 

  16. Norat T, Bingham S, Ferrari P, Slimani N, Jenab M, Mazuir M, et al. Meat, fish, and colorectal cancer risk : the European prospective investigation into cancer and nutrition. J Natl Cancer Inst. 2005;97:906–16.

    Article  PubMed  Google Scholar 

  17. Chao A, Connell CJ, Mccullough ML, Jacobs EJ, Flanders WD, Rodriguez C, et al. Meat consumption and risk of colorectal Cancer. J Am Med Assoc. 2005:293, 172–82.

    Article  CAS  Google Scholar 

  18. Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A. The blood exposome and its role in discovering causes of disease. Environ Health Perspect. 2014;122:769–74.

    Article  PubMed  Google Scholar 

  19. van Duijnhoven FJ, Bueno-De-Mesquita HB, Calligaro M, Jenab M, Pischon T, Jansen EH, Frohlich J, Ayyobi A, Overvad K, Toft-Petersen AP, Tjønneland A. Blood lipid and lipoprotein concentrations and colorectal cancer risk in the European Prospective Investigation into Cancer and Nutrition. Gut. 2011;60(8):1094–102.

  20. Ritchie SA, Heath D, Yamazaki Y, Grimmalt B, Kavianpour A, Krenitsky K, et al. Reduction of novel circulating long-chain fatty acids in colorectal cancer patients is independent of tumor burden and correlates with age. BMC Gastroenterol. 2010;10:140 BioMed Central Ltd.

    Article  CAS  PubMed  Google Scholar 

  21. Ritchie SA, Tonita J, Alvi R, Lehotay D, Elshoni H, Su-Myat, et al. Low-serum GTA-446 anti-inflammatory fatty acid levels as a new risk factor for colon cancer. Int J Cancer. 2013;132:355–62.

    Article  CAS  Google Scholar 

  22. Perttula K, Edmands WMB, Grigoryan H, Cai X, Iavarone AT, Gunter MJ, et al. Evaluating ultra-long-chain fatty acids as biomarkers of colorectal Cancer risk. Cancer Epidemiol Biomark Prev. 2016;25:1216–24.

    Article  CAS  Google Scholar 

  23. Nishiumi S, Kobayashi T, Ikeda A, Yoshie T, Kibi M, Izumi Y, et al. A novel serum metabolomics-based diagnostic approach for colorectal cancer. PLoS One. 2012;7:1–10.

    Article  Google Scholar 

  24. Ritchie, SA, Ahiahonu PWK, Jayasinghe D, Heath D, Liu J, Lu Y, et al. Reduced levels of hydroxylated, polyunsaturated ultra long-chain fatty acids in the serum of colorectal cancer patients: implications for early screening and detection. BMC Med. 2010;8:13.

    Article  PubMed  Google Scholar 

  25. Cross AJ, Moore SC, Boca S, Huang WY, Xiong X, Stolzenberg‐Solomon R, Sinha R, Sampson JN. A prospective study of serum metabolites and colorectal cancer risk. Cancer. 2014;19(120):3049–57.

    Article  CAS  PubMed  Google Scholar 

  26. Bae S, Ulrich CM, Neuhouser ML, Malysheva O, Bailey LB, Xiao L, et al. Plasma choline metabolites and colorectal cancer risk in the women’s health initiative observational study. Cancer Res. 2014;74:7442–52.

    Article  CAS  PubMed  Google Scholar 

  27. Nitter M, Norgård B, De VS, Eussen SJPM, Meyer K, Ulvik A, et al. Plasma methionine, choline, betaine, and dimethylglycine in relation to colorectal cancer risk in the European prospective investigation into Cancer and nutrition ( EPIC ). Ann Oncol. 2014;25:1609–15.

    Article  CAS  Google Scholar 

  28. Xu R, Wang Q, Li L. A genome-wide systems analysis reveals strong link between colorectal cancer and trimethylamine N-oxide ( TMAO ), a gut microbial metabolite of dietary meat and fat. BMC Genomics. 2015;16:1–9.

    Article  Google Scholar 

  29. Wang Z, Klipfell E, Bennett BJ, Koeth R, Levison BS, Dugar B, et al. Gut flora metabolism of phosphatidylcholine promotes cardiovascular disease. Nature. Nat Publ Group. 2011;472:57–63.

    CAS  Google Scholar 

  30. Zackular JP, Rogers MAM, Ruf MT, Schloss PD. The human gut microbiome as a screening tool for colorectal Cancer. Cancer Prev Res. 2014;7:1112–22.

    Article  CAS  Google Scholar 

  31. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995;57:289–300.

    Google Scholar 

  32. Santos CR, Schulze A. Lipid metabolism in cancer. FEBS J. 2012;279:2610–23.

    Article  CAS  Google Scholar 

  33. Gassler N, Klaus C, Kaemmerer E, Reinartz A, Gassler N, Klaus C, et al. Modifier-concept of colorectal carcinogenesis : Lipidomics as a technical tool in pathway analysis. 2010;16:1820–7.

  34. Lydic TA, Townsend S, Adda CG, Collins C, Mathivanan S, Reid GE. Rapid and comprehensive “shotgun” lipidome profiling of colorectal cancer cell derived exosomes. Methods. 2015;87:83–95 Elsevier Inc.

    Article  CAS  PubMed  Google Scholar 

  35. Friedman J, Hastie T, Tibshirani RJ. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.

    Article  PubMed  Google Scholar 

  36. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.

    Google Scholar 

  37. Huang BFF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics. 2016;17:1–13.

    Article  Google Scholar 

  38. Saeys Y, Inza I, Larranaga P. Gene expression a review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–17.

    Article  CAS  Google Scholar 

  39. Riboli E, Hunt KJ, Slimani N, Ferrari P, Norat T, Fahey M, et al. European prospective investigation into Cancer and nutrition (EPIC): study populations and data collection. Public Health Nutr. 2002;5:1113–24.

    Article  CAS  PubMed  Google Scholar 

  40. Williams MD, Reeves R, Resar LS, Hill HH. Metabolomics of colorectal cancer: past and current analytical platforms. Anal Bioanal Chem. 2013;405:5013–30.

    Article  CAS  Google Scholar 

  41. Nolen BM, Brand RE, Prosser D, Velikokhatnaya L, Allen PJ, Zeh HJ, et al. Prediagnostic serum biomarkers as early detection tools for pancreatic cancer in a large prospective cohort study. PLoS One. 2014;9:e94928.

    Article  PubMed  Google Scholar 

  42. Moore LL, Bradlee ML, Singer MR, Splansky GL, Proctor MH, Ellison RC, et al. BMI and waist circumference as predictors of lifetime colon cancer risk in Framingham study adults. Int J Obes Relat Metab Disord. 2004;28:559–67.

    Article  CAS  Google Scholar 

  43. Aleksandrova K, Boeing H, Jenab M, Bueno-de-Mesquita HB, Jansen E, Van Duijnhoven FJB, et al. Metabolic syndrome and risks of colon and rectal cancer: the european prospective investigation into cancer and nutrition study. Cancer Prev Res. 2011;4:1873–83.

    Article  CAS  Google Scholar 

  44. Talwar P. Manual of assisted reproductive technologies and clinical embryology: Jaypee brothers medical Publisher Pvt. Limited; New Delhi.2014.

  45. Saint-Ramon J-G, Beau C, Ehrsam A. Tubes for conservation biological particle; for use as tool in biological sampling: Google Patents; US 2002O188222A1. 2001.

  46. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. LC / MS preprocessing and analysis with xcms. Anal Chem. 2006;78:779–87.

    Article  CAS  Google Scholar 

  47. Benton HP, Wong DM, Trauger S a, Siuzdak G. XCMS 2 : processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal Chem. 2008;80:6382–9.

    Article  CAS  PubMed  Google Scholar 

  48. R Development Core Team. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2013. URL

    Google Scholar 

  49. Kuhl C, Tautenhahn R, Christoph B, Larson TR, Neumann S. CAMERA: an integrated strategy for compound spectra extraction and annotation of LC/MS data sets. Anal Chem. 2011;84:283–9.

    Article  PubMed  Google Scholar 

  50. Stanstrup J, Gerlich M, Dragsted LO, Neumann S. Metabolite profiling and beyond: approaches for the rapid processing and annotation of human blood serum mass spectrometry data. Anal Bioanal Chem. 2013;405:5037–48.

    Article  CAS  Google Scholar 

  51. Edmands WMB, Petrick LM, Barupal DK, Scalbert A, Wilson M, Wickliffe J, et al. compMS2Miner : an automatable metabolite identification, visualization and data-sharing R package for high-resolution LC-MS datasets. Anal Chem. 2017;89:3919–28.

    Article  CAS  Google Scholar 

  52. Wishart DS, Tzur D, Knox C, Eisner R, Guo AC, Young N, Cheng D, Jewell K, Arndt D, Sawhney S, Fung C. HMDB: the human metabolome database. Nucleic Acids Res. 2007;35(suppl_1):D521–6.

    Article  CAS  PubMed  Google Scholar 

  53. Smith CA, O’Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, et al. METLIN a metabolite mass spectral database. Proc 9Th Int Congr Ther Drug Monit Clin Toxicol. 2005;27:747–51.

    Article  CAS  Google Scholar 

  54. Veselkov KA, Vingara LK, Masson P, Robinette SL, Want E, Li JV, et al. Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Anal Chem. 2011;83:5864–72.

    Article  CAS  Google Scholar 

  55. Dunn WB, Broadhurst D, Begley P, Zelena E, Francis-McIntyre S, Anderson N, et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc. 2011;6:1060–83.

    Article  CAS  Google Scholar 

  56. Bolstad BM, Irizarry RA. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.

    Article  CAS  Google Scholar 

  57. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.

    Article  PubMed  Google Scholar 

  58. Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–504.

    Article  CAS  PubMed  Google Scholar 

  59. Broeckling CD, Afsar FA, Neumann S, Prenni JE. RAMClust: A Novel Feature Clustering Method Enables Spectral- Matching-Based Annotation for Metabolomics Data. 2014;

    Google Scholar 

  60. Tibshirani R. Regression Selection and Shrinkage via the Lasso. J R Stat Soc B. 1996;58:267–88.

    Google Scholar 

  61. Bach FR, Project-team IW. Bolasso : Model Consistent Lasso Estimation through the Bootstrap. 2008;

    Book  Google Scholar 

  62. Breiman L. Random forests. Mach Learn. 2001;45:5–32.

    Article  Google Scholar 

  63. Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2:171–96.

    Article  CAS  Google Scholar 

  64. Byrdwell WC. Dual parallel liquid chromatography with dual mass spectrometry (LC2/MS2) for a total lipid analysis. Front Biosci. 2008;13:100–20.

    Article  CAS  Google Scholar 

  65. Ogretmen B. Sphingolipid metabolism in cancer signalling and therapy. Nat Rev Cancer. 2017;18:33–50.

    Article  PubMed  Google Scholar 

  66. Rezanka T, Sigler K. Odd-numbered very-long-chain fatty acids from the microbial, animal and plant kingdoms. Prog Lipid Res. 2009;48:206–38.

    Article  CAS  Google Scholar 

  67. Weir TL, Manter DK, Sheflin AM, Barnett BA, Heuberger AL, Ryan EP. Stool microbiome and metabolome differences between colorectal Cancer patients and healthy adults. PLoS One. 2013;8:e70803.

    Article  CAS  PubMed  Google Scholar 

  68. Agle SC, Philips P, Martin RCG. Environmental exposures, tumor heterogeneity, and colorectal Cancer outcomes. Curr Colorectal Cancer Rep. 2014;10:189–94.

    Article  Google Scholar 

Download references


We gratefully acknowledge the assistance of Agilent Technologies (Santa Clara, CA, USA) for the loan of the high-resolution mass spectrometer used in these analyses. We also thank the subject cases and controls for their participation in EPIC.


Support for this work was provided by grant P42ES04705 from the National Institute for Environmental Health Sciences (National Institutes of Health) (to S.M.R.) and grant agreement 308610-FP7 from the European Commission (Project Exposomics) (to P.V.).

Availability of data and materials

The datasets generated and/or analysed during the current study are not publicly available since the covariates originated from a third party (EPIC, IARC), but are available from the corresponding author on reasonable request and approval of the third party.

Author information

Authors and Affiliations



KP, WBE, PV and SR conceived and designed the study. XC, KP, WBE, and HG developed LCMS methodology. KP and WBE performed experiments and acquired the LCMS data. KP, WBE, LP, and XC analyzed LCMS features. CS, KP, SD, and SR performed the normalization and statistical analysis. PV, MG, AN, SP and MG provided and interpreted the metadata from the EPIC study that were required for statistical analyses. KP, CS, and SR wrote the manuscript. All authors contributed to the final manuscript and approved its content.

Corresponding author

Correspondence to Stephen M Rappaport.

Ethics declarations

Ethics approval and consent to participate

The study was approved by the ethics review board of the International Agency for Research on Cancer (IARC). Informed written consent was provided by all participants at the participating centers; for this study, the center in Turin, Italy.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Additional statistical analyses of the nine selected metabolomic features. Table S1. Covariate associations with the nine selected features; Table S2. Time-to-diagnosis model coefficients and p-values for the nine selected features. (PDF 60 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Perttula, K., Schiffman, C., Edmands, W.M.B. et al. Untargeted lipidomic features associated with colorectal cancer in a prospective cohort. BMC Cancer 18, 996 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: