C. Feature Extraction
proposed ResVAE network classification is 98.6% and the
ACC is 95.8%, which has a great advantage in classification
effect. Compared to square graph coding, the AUC of our
proposed coding method when using CNNs are still significant,
and at the same time, it can greatly expand the model
application space as well as the model application scenarios.
Feature extraction is performed in parallel using Variable
Auto-Encoder (VAE) to reconstruct features and ResNet to
extract features. VAE is able to learn the intrinsic structure and
distribution of the data[4], while ResNet is able to extract high-
level features from the image[5]. By combining the VAE
encoder and the pre-trained ResNet34 with the FC layer
removed in parallel can fully utilize the advantages of both
methods and improve the effectiveness of feature extraction.
TABLE I.
Methods
COMPARISIONS OF METHODS
AUC
ACC
CNN
AlexNet
90.1%
92.2%
97.4%
93.5%
97.7%
98.6%
87.4%
79.0%
91.0%
96.5%
89.6%
95.8%
VGG16
VAE
ResNet34
ResVAE(ours)
TABLE II.
COMPARISIONS OF SEQUENCES TRANSFORM ON CNN
Sequences Transform
AUC
Fig. 3. Framwork of ResVAE.
Square Histogram
Circular Histogram
90.1%
89.0%
Overall, the ResVAE model effectively solves the problem
of DNA sequence classification through an innovative
sequence-to-image approach and a hybrid model, providing
new research methods and perspectives in the field of
bioinformatics. This is our main contribution and our
innovation to the field.
CONCLUSION
In this paper, we propose a novel method to convert DNA
sequences into images and use deep learning networks for
feature extraction and classification. Experimental results show
that the method has significant advantages based on the
original coding and effectively broadens the application
scenarios under non-equal-length sequences. In addition, the
method provides new ideas and solutions for sequence
classification problems in the RNA and protein domains.
Future research can further explore the effectiveness of the
method in other bioinformatics fields and how to better
optimize the structure and parameter settings of the deep
learning network to improve the classification performance.
III. EXPERIMENT
In order to verify the effectiveness of the method proposed
in this paper, we conducted experiments on simulated DNA
sequence datasets. Compared with the traditional statistical
model-based methods, the method proposed in this paper
shows superior performance in terms of both AUC and ACC.
Meanwhile, the method also greatly broadens the application
scenarios under non-equal-length sequences.
A. Dataset
REFERENCES.
The dataset contains three files, training set, validation set
and test set. Each file contains DNA sequences and their
corresponding category labels. The training set and validation
set files are formatted as two columns, separated by tabs, for
categories and sequences, respectively. The test set file is in
one column and contains only sequences. This dataset contains
[
1] Abd–Alhalem S M, El-Rabaie E S M, Soliman N, et al. DNA sequences
classification with deep learning: a survey[J]. Menoufia Journal of
Electronic Engineering Research, 2021, 30(1): 41-51.
[2] Lo Bosco G, Di Gangi M A. Deep learning architectures for DNA
sequence classification[C]//Fuzzy Logic and Soft Computing
Applications: 11th International Workshop, WILF 2016, Naples, Italy,
December 19 – 21, 2016, Revised Selected Papers 11. Springer
International Publishing, 2017: 162-171.
1
1
0,000 training sequences, 1,000 validation sequences, and
,000 test sequences, each 200 base pairs in length. These
sequences contain different combinatorial patterns that give
them unique features that allow them to distinguish between
different classes.
[
3] Sidhom J W, Larman H B, Pardoll D M, et al. DeepTCR is a deep
learning framework for revealing sequence concepts within T-cell
repertoires[J]. Nature communications, 2021, 12(1): 1605.
[4] Kipf T N, Welling M. Variational graph auto-encoders[J]. ar**v preprint
ar**v:1611.07308, 2016.
[
5] He K, Zhang X, Ren S, et al. Deep residual learning for image
recognition[C]//Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016: 770-778.
B. Results
We conducted sufficient comparative experiments on this
dataset. Based on the square graph coding, the AUC of our