ANN in Dimensionality Reduction of Gene Expression Data
By R R Ojha,
Artificial Neural Networks (ANN) are a class of machine learning models that have been widely used in various domains, including bioinformatics. Gene expression data is a critical component of many biological studies, and dimensionality reduction of gene expression data is a challenging problem. The high dimensionality of gene expression data, coupled with the complexity and noise in the data, makes it challenging to extract meaningful information from the data. However, ANN provides a powerful tool for reducing the dimensionality of gene expression data.
Dimensionality reduction is the process of reducing the number of features or variables in a dataset while retaining as much relevant information as possible. The main goal of dimensionality reduction is to simplify the dataset and improve its computational efficiency and interpretability. Gene expression data can be represented as a high-dimensional matrix, where rows represent samples, and columns represent genes. Dimensionality reduction techniques aim to find a lower-dimensional representation of the data while preserving the essential features that differentiate one sample from another.
Artificial neural networks are a class of models that are inspired by the structure and function of biological neural networks. ANNs consist of interconnected nodes that perform simple computations, and these nodes are organized into layers. The input layer of an ANN receives the data, and the output layer produces the output. In between the input and output layers, there can be one or more hidden layers, which perform nonlinear transformations on the input. ANNs are trained using a supervised learning approach, where the model learns to map inputs to outputs by adjusting the weights of the connections between the nodes.
One of the most popular ANN-based dimensionality reduction techniques is the autoencoder. An autoencoder is an ANN that is trained to reconstruct the input data at the output layer. The autoencoder consists of two main parts: an encoder that maps the input data to a lower-dimensional representation, and a decoder that maps the lower-dimensional representation back to the original input space. During training, the autoencoder is optimized to minimize the difference between the input and output data, which forces the model to learn a compressed representation of the input data.
The autoencoder has been used for dimensionality reduction of gene expression data, and the results have been promising. In a study by Xiong et al. (2016), an autoencoder was used to reduce the dimensionality of gene expression data for cancer classification. The autoencoder was able to achieve high classification accuracy while reducing the dimensionality of the data by more than 95%. In another study by Kang et al. (2018), an autoencoder was used to reduce the dimensionality of gene expression data for clustering analysis. The autoencoder was able to identify distinct clusters of samples that corresponded to different cancer subtypes.
Fig: Autoencoder ANN architecture
Case study
In a study by Li et al. (2019), an autoencoder-based artificial neural network was used to reduce the dimensionality of gene expression data for cancer subtype classification. The study used RNA sequencing data from the Cancer Genome Atlas (TCGA) database for six different cancer types: breast, lung, ovarian, prostate, kidney, and liver.
The researchers first preprocessed the data by removing genes with low expression levels and normalizing the remaining genes. The resulting dataset consisted of approximately 20,000 genes for each cancer type. To reduce the dimensionality of the data, an autoencoder was trained on each cancer type separately. The autoencoder had three hidden layers, each with 500 nodes, and used the rectified linear unit (ReLU) activation function.
The autoencoder was trained using the mean squared error (MSE) loss function and the Adam optimizer. The training was stopped when the validation loss stopped decreasing. Once the autoencoder was trained, the encoder part of the network was used to reduce the dimensionality of the gene expression data to 50 dimensions.
The reduced-dimensional data was then used for cancer subtype classification using a support vector machine (SVM) classifier. The SVM classifier was trained using 70% of the data and tested on the remaining 30%. The classification accuracy was measured using the area under the receiver operating characteristic (ROC) curve.
The results showed that the autoencoder-based dimensionality reduction approach achieved higher classification accuracy than other dimensionality reduction methods, including principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). The classification accuracy ranged from 80-90% for the six different cancer types.
This study demonstrates the effectiveness of using an autoencoder-based artificial neural network for dimensionality reduction of gene expression data. The reduced-dimensional representation of the data allowed for more efficient and accurate cancer subtype classification.