Extended Abstract
Background: Sorghum, a C4 plant, is relatively tolerant to various abiotic stresses. However, its performance is significantly affected by temperatures above 32 °C and below 15 °C. Identifying key genes through gene expression data and feature selection methods is a valuable approach to understanding stress tolerance. Feature selection filters out the most relevant genes using statistical and computational algorithms. Filter-based methods, which are independent of machine-learning algorithms, offer a fast and efficient way to identify relevant features. Combining multiple filter methods allows for a more precise and robust selection of key genes involved in sorghum’s response to temperature stress. Therefore, this study aimed to identify key genes involved in cold and heat stress response in sorghum using transcriptomic data and
filter-based methods, including Information Gain, Gain Ratio, and Relief.
Methods: In this study, gene expression count data were extracted from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) with accession number GSE225632. The analysis focused on sorghum shoot data under control conditions and temperature stress at four different times of day. The data were divided into two classes, control and stress (cold and heat), and differentially expressed genes were identified using the DESeq2 package in R. Subsequently, top genes were selected using three feature selection algorithms (Information Gain, Gain Ratio, and Relief), and the Venn diagram was used to examine the overlap of genes identified by the algorithms. Two machine-learning algorithms, Bayes Net and Random Forest, were employed for validation. These algorithms were run in WEKA 3.7, and their performance was compared in classifying samples based on the identified features. The classification algorithms were evaluated and compared using metrics, including True Positive Rate (TP Rate), False Positive Rate (FP Rate), Precision, Recall, F1 score, Matthews Correlation Coefficient (MCC), Area Under the ROC Curve (ROC AUC), and Area Under the PRC Curve (PRC AUC). A confusion matrix was used to display classification errors.
Results: In this study, gene expression changes were first analyzed by comparing control conditions with temperature stress (cold and heat). Among 34,129 genes examined, 2,136 genes with significant expression changes at a 0.05 level and a log2FoldChange greater than 1 were selected and used in subsequent feature selection and machine-learning analyses. Key genes responsive to temperature stress were identified using three feature-selection algorithms, with the top 50 genes extracted by each algorithm based on ranking. Across all methods, nine genes were consistently identified by all three feature selection approaches. The performance of two classification models was evaluated in classifying three classes (control, cold stress, and heat stress). The Bayes Net algorithm showed high discriminative accuracy; a TP Rate of 1, FP Rate of 0.21, and Precision of 0.980 were obtained for the control class; a Precision of 1 and a TP Rate of 0.958 were achieved for the cold-stress class; both accuracy and TP Rate were 1 for the
heat-stress class. The Random Forest algorithm also demonstrated strong discriminative power. A correct classification rate of 1 and a Precision of 0.96 were observed for the control class; a correct classification rate of 0.958 and a Precision of 1 were obtained for the cold-stress and
heat-stress classes, indicating robust performance in accurately identifying stressed samples.
Conclusion: This study demonstrates that identifying and analyzing key genes involved in sorghum’s temperature stress response can provide insights into the biological pathways and regulatory networks activated under such conditions. Nine out of the 2,136 differentially expressed genes were consistently identified by three different selection algorithms. These genes can serve as potential molecular markers, but further biological validation is necessary across different sorghum varieties. The high accuracy of Bayes Net and Random Forest confirms the strength of these models in distinguishing gene expression patterns between stressed and control conditions. Homology analysis of genes, such as Sobic.004G283300, Sobic.010G100600, and Sobic.006G093500, in Arabidopsis and maize supports their role in heat stress response. However, six genes (Sobic.010G128900, Sobic.001G093100, Sobic.007G168100, Sobic.002G269100, Sobic.006G183701, and Sobic.002G047800) remain uncharacterized, with no documented molecular function. Further research is required to explore the roles of these genes in physiological and stress-related processes. Understanding their functions could contribute to breeding sorghum varieties that are more resilient to environmental stresses, ultimately supporting sustainable agriculture. Field-based and experimental validation of these molecular markers is also recommended to confirm their applicability under real-world farming conditions.
| Rights and permissions | |
|
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |