A hybrid model for multiclassification based on DNA

sequence to image conversion

Ruihao Zhang

Xiao Liu*

Tsinghua University

Tsinghua Shenzhen International Graduate School

Shenzhen, China

Tsinghua University

Tsinghua Shenzhen International Graduate School

Shenzhen, China

zhangrh23@mails.tsinghua.edu.cn

liuxiao@sz.tsinghua.edu.cn

Abstract—DNA sequence classification is biologically

important for disease diagnosis and prediction, and traditional

DNA sequence classification methods are usually directly based

on sequence data. In this study, we address the challenges of

DNA sequence classification by proposing a novel approach, the

ResVAE model, which transforms DNA sequences into images,

and then utilizes a deep learning network for feature extraction

and classification. The experimental results show that compared

with the traditional methods based on sequence data, the

ResVAE model exhibits significant advantages in dealing with

sequences, effectively broadening the application scenarios. In

addition, we explore the possible application prospects of the

ResVAE model in the field of RNA and protein non-equal-length

sequences. This approach provides new perspectives and

possibilities for DNA sequence classification and is expected to

play an important role in future bioinformatics research.

networks for feature extraction and classification. The specific

steps of ResVAE are as follows.

A. Character Map

ATGC characters in a DNA sequence are mapped to

specific numeric values in order to convert sequence data to

numeric data. This mapping preserves the base information in

the sequence while facilitating subsequent image construction.

Keywords—DNA sequence classification; ResVAE; feature

extraction; image transformation; non-equivalent sequences

INTRODUCTION

With the rapid development of bioinformatics, the problem

of classifying DNA sequence data has become a hot research

topic. Traditional sequence classification methods are usually

based on statistical features or structural features of

sequences[1]. However, these methods face many challenges

when dealing with large-scale and high-dimensional data. In

recent years, deep learning has achieved great success in areas

such as image recognition and speech recognition, providing

new ideas for DNA sequence classification[2]. The goal of this

paper is to explore a method for transforming DNA sequences

into images and then using deep learning for feature extraction

and classification. Our main contribution is to propose a new

sequence-to-image approach, i.e., transforming sequences into

circular histograms, which can expand the application space of

deep learning models. In addition, our proposed ResVAE

model excels in classification results. With this approach, we

not only solve the difficulties of traditional methods in dealing

with non-equal-length sequences, but also provide new

perspectives and possibilities for other bioinformatic sequence

classification[3].

Fig. 1. Sequences Character Map.

B. Image Construction

In addition to using the direct method of converting the

sequence into a square histogram, we also divide each

character equally into a certain percentage of fan-shaped

circular surfaces. We draw a 200×200 square on the outside of

the image, which is tangent to the circle, as a way to represent

the DNA sequence information in the spatial structure. In this

way, the spatial structure of DNA sequences is more intuitive

and can be used in non-equal-length sequence datasets for

better classification and identification.

II. METHODS

a. Square histogram

b. Circular histogram

In this paper, we propose a method called ResVAE which

converts DNA sequences into images and uses deep learning

Fig. 2. Sequences transform to histogram.

C. Feature Extraction

proposed ResVAE network classification is 98.6% and the

ACC is 95.8%, which has a great advantage in classification

effect. Compared to square graph coding, the AUC of our

proposed coding method when using CNNs are still significant,

and at the same time, it can greatly expand the model

application space as well as the model application scenarios.

Feature extraction is performed in parallel using Variable

Auto-Encoder (VAE) to reconstruct features and ResNet to

extract features. VAE is able to learn the intrinsic structure and

distribution of the data[4], while ResNet is able to extract high-

level features from the image[5]. By combining the VAE

encoder and the pre-trained ResNet34 with the FC layer

removed in parallel can fully utilize the advantages of both

methods and improve the effectiveness of feature extraction.

TABLE I.

Methods

COMPARISIONS OF METHODS

AUC

ACC

CNN

AlexNet

90.1%

92.2%

97.4%

93.5%

97.7%

98.6%

87.4%

79.0%

91.0%

96.5%

89.6%

95.8%

VGG16

VAE

ResNet34

ResVAE(ours)

TABLE II.

COMPARISIONS OF SEQUENCES TRANSFORM ON CNN

Sequences Transform

AUC

Fig. 3. Framwork of ResVAE.

Square Histogram

Circular Histogram

90.1%

89.0%

Overall, the ResVAE model effectively solves the problem

of DNA sequence classification through an innovative

sequence-to-image approach and a hybrid model, providing

new research methods and perspectives in the field of

bioinformatics. This is our main contribution and our

innovation to the field.

CONCLUSION

In this paper, we propose a novel method to convert DNA

sequences into images and use deep learning networks for

feature extraction and classification. Experimental results show

that the method has significant advantages based on the

original coding and effectively broadens the application

scenarios under non-equal-length sequences. In addition, the

method provides new ideas and solutions for sequence

classification problems in the RNA and protein domains.

Future research can further explore the effectiveness of the

method in other bioinformatics fields and how to better

optimize the structure and parameter settings of the deep

learning network to improve the classification performance.

III. EXPERIMENT

In order to verify the effectiveness of the method proposed

in this paper, we conducted experiments on simulated DNA

sequence datasets. Compared with the traditional statistical

model-based methods, the method proposed in this paper

shows superior performance in terms of both AUC and ACC.

Meanwhile, the method also greatly broadens the application

scenarios under non-equal-length sequences.

A. Dataset

REFERENCES.

The dataset contains three files, training set, validation set

and test set. Each file contains DNA sequences and their

corresponding category labels. The training set and validation

set files are formatted as two columns, separated by tabs, for

categories and sequences, respectively. The test set file is in

one column and contains only sequences. This dataset contains

[

1] Abd–Alhalem S M, El-Rabaie E S M, Soliman N, et al. DNA sequences

classification with deep learning: a survey[J]. Menoufia Journal of

Electronic Engineering Research, 2021, 30(1): 41-51.

[2] Lo Bosco G, Di Gangi M A. Deep learning architectures for DNA

sequence classification[C]//Fuzzy Logic and Soft Computing

Applications: 11th International Workshop, WILF 2016, Naples, Italy,

December 19 – 21, 2016, Revised Selected Papers 11. Springer

International Publishing, 2017: 162-171.

0,000 training sequences, 1,000 validation sequences, and

,000 test sequences, each 200 base pairs in length. These

sequences contain different combinatorial patterns that give

them unique features that allow them to distinguish between

different classes.

[

3] Sidhom J W, Larman H B, Pardoll D M, et al. DeepTCR is a deep

learning framework for revealing sequence concepts within T-cell

repertoires[J]. Nature communications, 2021, 12(1): 1605.

[4] Kipf T N, Welling M. Variational graph auto-encoders[J]. ar**v preprint

ar**v:1611.07308, 2016.

[

5] He K, Zhang X, Ren S, et al. Deep residual learning for image

recognition[C]//Proceedings of the IEEE conference on computer vision

and pattern recognition. 2016: 770-778.

B. Results

We conducted sufficient comparative experiments on this

dataset. Based on the square graph coding, the AUC of our