University of Tasmania
Browse

File(s) under permanent embargo

Random Forests machine learning applied to gas chromatography – mass spectrometry derived average mass spectrum data sets for classification and characterisation of essential oils

journal contribution
posted on 2023-05-20, 10:13 authored by Leo LebanovLeo Lebanov, Tedone, L, Alireza GhiasvandAlireza Ghiasvand, Brett PaullBrett Paull

Differences in chemical profiles of various essential oils (EOs) come from the fact that each plant species and chemotype has a distinctive secondary metabolism. Therefore, these differences can be used as the chemical markers for EO classification and determination of their quality. Herein, the Random Forests (RF) machine learning algorithm was applied to the classification of 20 different EOs. From three-way raw gas chromatography - mass spectra data, total chromatogram average mass spectra (TCAMS) and segment average mass spectra (SAMS) were created. TCAMS was generated by averaging response of each m/z over the whole chromatogram and SAMS by averaging the response of each fragment across a certain time segment within the chromatogram. The RF model was applied to the two data sets and optimised through the evaluation of pre-processed data, number of trees, and number of variables used in each node split. The performance of the model was evaluated through a cross-validation process, repeated 50 times by dividing the whole sample set into training and validation subsets. The calculated average out-of-bag error (OOBE), over 50 different training TCAMS data sets was 3.22 ± 1.29%, while for SAMS it was found to be 2.28 ± 1.33%. The minimal number of variables necessary for EO classification was determined by a nested cross-validation process. The amount of reduced variables in each step was 10%. It was shown that the TCAMS data set with 6 variables had similar prediction power as the SAMS with 30 variables. OOBE for classification of 20 EOs was 2.89 ± 1.44% and 3.70 ± 1.73%, for TCAMS and SAMS, respectively. Proximity between samples was used to evaluate their qualities. Samples with greater intra-class proximity had good similarity, while the lower ones indicated greater variations in the chemical profiles. The SAMS data set showed superior potential for quality assurance, compared with TCAMS.

History

Publication title

Talanta

Volume

208

Article number

120471

Number

120471

Pagination

1-12

ISSN

0039-9140

Department/School

School of Natural Sciences

Publisher

Elsevier Science Bv

Place of publication

Amsterdam, Netherlands

Rights statement

© 2019 Elsevier B.V. All rights reserved.

Repository Status

  • Restricted

Socio-economic Objectives

Expanding knowledge in the chemical sciences

Usage metrics

    University Of Tasmania

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC