A hybrid model for multiclassification based on DNA  
sequence to image conversion  
Ruihao Zhang  
Xiao Liu*  
Tsinghua University  
Tsinghua Shenzhen International Graduate School  
Shenzhen, China  
Tsinghua University  
Tsinghua Shenzhen International Graduate School  
Shenzhen, China  
zhangrh23@mails.tsinghua.edu.cn  
liuxiao@sz.tsinghua.edu.cn  
AbstractDNA sequence classification is biologically  
important for disease diagnosis and prediction, and traditional  
DNA sequence classification methods are usually directly based  
on sequence data. In this study, we address the challenges of  
DNA sequence classification by proposing a novel approach, the  
ResVAE model, which transforms DNA sequences into images,  
and then utilizes a deep learning network for feature extraction  
and classification. The experimental results show that compared  
with the traditional methods based on sequence data, the  
ResVAE model exhibits significant advantages in dealing with  
sequences, effectively broadening the application scenarios. In  
addition, we explore the possible application prospects of the  
ResVAE model in the field of RNA and protein non-equal-length  
sequences. This approach provides new perspectives and  
possibilities for DNA sequence classification and is expected to  
play an important role in future bioinformatics research.  
networks for feature extraction and classification. The specific  
steps of ResVAE are as follows.  
A. Character Map  
ATGC characters in a DNA sequence are mapped to  
specific numeric values in order to convert sequence data to  
numeric data. This mapping preserves the base information in  
the sequence while facilitating subsequent image construction.  
KeywordsDNA sequence classification; ResVAE; feature  
extraction; image transformation; non-equivalent sequences  
I.  
INTRODUCTION  
With the rapid development of bioinformatics, the problem  
of classifying DNA sequence data has become a hot research  
topic. Traditional sequence classification methods are usually  
based on statistical features or structural features of  
sequences[1]. However, these methods face many challenges  
when dealing with large-scale and high-dimensional data. In  
recent years, deep learning has achieved great success in areas  
such as image recognition and speech recognition, providing  
new ideas for DNA sequence classification[2]. The goal of this  
paper is to explore a method for transforming DNA sequences  
into images and then using deep learning for feature extraction  
and classification. Our main contribution is to propose a new  
sequence-to-image approach, i.e., transforming sequences into  
circular histograms, which can expand the application space of  
deep learning models. In addition, our proposed ResVAE  
model excels in classification results. With this approach, we  
not only solve the difficulties of traditional methods in dealing  
with non-equal-length sequences, but also provide new  
perspectives and possibilities for other bioinformatic sequence  
classification[3].  
Fig. 1. Sequences Character Map.  
B. Image Construction  
In addition to using the direct method of converting the  
sequence into a square histogram, we also divide each  
character equally into a certain percentage of fan-shaped  
circular surfaces. We draw a 200×200 square on the outside of  
the image, which is tangent to the circle, as a way to represent  
the DNA sequence information in the spatial structure. In this  
way, the spatial structure of DNA sequences is more intuitive  
and can be used in non-equal-length sequence datasets for  
better classification and identification.  
II. METHODS  
a. Square histogram  
b. Circular histogram  
In this paper, we propose a method called ResVAE which  
converts DNA sequences into images and uses deep learning  
Fig. 2. Sequences transform to histogram.  
C. Feature Extraction  
proposed ResVAE network classification is 98.6% and the  
ACC is 95.8%, which has a great advantage in classification  
effect. Compared to square graph coding, the AUC of our  
proposed coding method when using CNNs are still significant,  
and at the same time, it can greatly expand the model  
application space as well as the model application scenarios.  
Feature extraction is performed in parallel using Variable  
Auto-Encoder (VAE) to reconstruct features and ResNet to  
extract features. VAE is able to learn the intrinsic structure and  
distribution of the data[4], while ResNet is able to extract high-  
level features from the image[5]. By combining the VAE  
encoder and the pre-trained ResNet34 with the FC layer  
removed in parallel can fully utilize the advantages of both  
methods and improve the effectiveness of feature extraction.  
TABLE I.  
Methods  
COMPARISIONS OF METHODS  
AUC  
ACC  
CNN  
AlexNet  
90.1%  
92.2%  
97.4%  
93.5%  
97.7%  
98.6%  
87.4%  
79.0%  
91.0%  
96.5%  
89.6%  
95.8%  
VGG16  
VAE  
ResNet34  
ResVAE(ours)  
TABLE II.  
COMPARISIONS OF SEQUENCES TRANSFORM ON CNN  
Sequences Transform  
AUC  
Fig. 3. Framwork of ResVAE.  
Square Histogram  
Circular Histogram  
90.1%  
89.0%  
Overall, the ResVAE model effectively solves the problem  
of DNA sequence classification through an innovative  
sequence-to-image approach and a hybrid model, providing  
new research methods and perspectives in the field of  
bioinformatics. This is our main contribution and our  
innovation to the field.  
CONCLUSION  
In this paper, we propose a novel method to convert DNA  
sequences into images and use deep learning networks for  
feature extraction and classification. Experimental results show  
that the method has significant advantages based on the  
original coding and effectively broadens the application  
scenarios under non-equal-length sequences. In addition, the  
method provides new ideas and solutions for sequence  
classification problems in the RNA and protein domains.  
Future research can further explore the effectiveness of the  
method in other bioinformatics fields and how to better  
optimize the structure and parameter settings of the deep  
learning network to improve the classification performance.  
III. EXPERIMENT  
In order to verify the effectiveness of the method proposed  
in this paper, we conducted experiments on simulated DNA  
sequence datasets. Compared with the traditional statistical  
model-based methods, the method proposed in this paper  
shows superior performance in terms of both AUC and ACC.  
Meanwhile, the method also greatly broadens the application  
scenarios under non-equal-length sequences.  
A. Dataset  
REFERENCES.  
The dataset contains three files, training set, validation set  
and test set. Each file contains DNA sequences and their  
corresponding category labels. The training set and validation  
set files are formatted as two columns, separated by tabs, for  
categories and sequences, respectively. The test set file is in  
one column and contains only sequences. This dataset contains  
[
1] AbdAlhalem S M, El-Rabaie E S M, Soliman N, et al. DNA sequences  
classification with deep learning: a survey[J]. Menoufia Journal of  
Electronic Engineering Research, 2021, 30(1): 41-51.  
[2] Lo Bosco G, Di Gangi M A. Deep learning architectures for DNA  
sequence classification[C]//Fuzzy Logic and Soft Computing  
Applications: 11th International Workshop, WILF 2016, Naples, Italy,  
December 19  21, 2016, Revised Selected Papers 11. Springer  
International Publishing, 2017: 162-171.  
1
1
0,000 training sequences, 1,000 validation sequences, and  
,000 test sequences, each 200 base pairs in length. These  
sequences contain different combinatorial patterns that give  
them unique features that allow them to distinguish between  
different classes.  
[
3] Sidhom J W, Larman H B, Pardoll D M, et al. DeepTCR is a deep  
learning framework for revealing sequence concepts within T-cell  
repertoires[J]. Nature communications, 2021, 12(1): 1605.  
[4] Kipf T N, Welling M. Variational graph auto-encoders[J]. ar**v preprint  
ar**v:1611.07308, 2016.  
[
5] He K, Zhang X, Ren S, et al. Deep residual learning for image  
recognition[C]//Proceedings of the IEEE conference on computer vision  
and pattern recognition. 2016: 770-778.  
B. Results  
We conducted sufficient comparative experiments on this  
dataset. Based on the square graph coding, the AUC of our