Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy
Accepted: 07 July 2021 Published: 30 August 2021
Background: Small open reading frames (sORFs) with protein-coding ability present unprecedented challenge for genome annotation because of their short sequence and low expression level. In the past decade, only several prediction methods have been proposed for discovery of protein-coding sORFs and lack of objective and uniform negative datasets has become an important obstacle to sORFs prediction. The prediction efficiency of current sORFs prediction methods needs to be further evaluated to provide better research strategies for protein-coding sORFs discovery. Methods: In this work, nine mainstream existing methods for predicting protein-coding potential of ORFs are comprehensively evaluated based on a random sequence strategy. Results: The results show that the current methods perform poorly on different sORFs datasets. For comparison, a sequence based prediction algorithm trained on prokaryotic sORFs is proposed and its better prediction performance indicates that the random sequence strategy can provide feasible ideas for protein-coding sORFs predictions. Conclusions: As a kind of important functional genomic element, discovery of protein-coding sORFs has shed light on the dark proteomes. This evaluation work indicates that there is an urgent need for developing specialized prediction tools for protein-coding sORFs in both eukaryotes and prokaryotes. It is expected that the present work may provide novel ideas for future sORFs researches.
Small open reading frames; Small protein; Gene prediction; Genome annotation; Protein-coding gene
Small proteins (shorter than 100 amino acids) encoded by small open reading frames (sORFs) have been ignored in genome annotations during the past decades. Reports of two functional small peptides myoregulin and DWORF encoded by transcripts that had been annotated as long noncoding RNAs [1, 2] aroused unprecedented attention to small open reading frames (sORFs) and their encoded proteins [3, 4, 5, 6, 7]. The discovery of protein-coding sORFs also led to a debate on the definition of noncoding RNA and rethinking of the understanding of genome [8, 9, 10, 11, 12]. Thus, sORFs have gradually become a research hotspot in the past few years. Actually, sORFs have been seriously underestimated because they were believed too short to encode proteins, so that earlier literatures called them evil little fellows (ELFs) . In most cases, traditional methods are not suitable for short sequences [14, 15], therefore identifying protein-coding sORFs is a huge challenge for genome research. Recently, rapid development of versatile omics sequencing technologies such as mass spectrometry and ribosome profiling reveal a large number of protein-coding sORFs with important functions in different genomic regions [12, 16, 17, 18]. Even so, the proteogenomic methods are not sensitive enough and the ribosome profiling sequencing strategies require additional measures to ensure comprehensive and accurate sORF annotation , hence there is still lack of efficient technologies for sORF identification [7, 16, 19, 20]. Furthermore, the resolution of ribosome profiling for bacterial cells is lower than that for eukaryotic cells due to technical challenges . Therefore, most sORFs studies mainly focus on several model eukaryotes such as human, mouse, Arabidopsis Thaliana. Thus, a limited number of protein-coding sORFs prediction programs trained by eukaryotic sORFs have been developed recently [21, 22, 23, 24, 25]. Among them, sORF finder , MiPepid , CPPred-sORF  were specially designed for sORFs. Some coding potential prediction programs for normal ORFs, such as CPPred and e also tested on sORFs datasets. These programs provide alternative tools for sORFs detection, but the real efficiency need to be further evaluated. On the other hand, lack of reliable datasets particularly negative samples has become one of the key issues in sORF prediction [5, 14, 16, 19, 24, 25]. Construction of reliable sORFs dataset and annotation platforms have been the foremost challenge in the field . Then, in this work, we perform comprehensive evaluation of nine up-to-date ORF coding potential prediction programs that have been discussed in recent sORFs related studies [16, 27] based on a random sequence strategy. It is expected that the present work may provide novel ideas for future sORFs researches.
3. Materials and methods
3.1 Data sources
Four non-redundant positive datasets (Hum-7111 dataset, Mou-7385 dataset, Ara-2125 dataset, Pro-6318) are constructed in this work. To construct the Hum-7111 dataset and the Mou-7385 dataset, 10000 human sORFs and 10000 mouse sORFs were downloaded from the sORF.org database respectively, and 2888 Arabidopsis thaliana sORFs were downloaded from the TAIR database to construct the Ara-2125 dataset. To construct the Pro-6318 dataset, the sORFs with definite functions were derived from 56 prokaryotic genomes (Supplementary Table 1), the genomic GC contents of the 56 selected prokaryotic genomes have a wide range from 20% to 70%. Thus a total of 6578 prokaryotics sORFs were obtained. These candidate sORFs were further filtered as follows:
(i) Excluding the redundant sequences in each dataset using the CD-Hit program  with the similarity threshold of 80% at DNA level;
(ii) Excluding the sORF 100 aa;
(iii) Excluding the sORFs whose sequence length cannot be divisible by 3;
(iv) Excluding the sORFs that do not end with a stop codon;
(v) Excluding the sORFs with stop codon in its sequence;
(vi) Excluding the sORFs that start with stop codon.
In this way, 7111 human sORFs, 7385 mouse sORFs, 2125 Arabidopsis thaliana sORFs and 6318 prokaryotic sORFs are obtained. These datasets are as the positive testing sets. In Supplementary file 1, the four datasets are provided in fasta format.
It is difficult to construct negative sORFs datasets. The sORFs from intergenic region and noncoding region were usually extracted as negative samples, but there is great possibility of the existence of protein-coding sORFs in these regions. For prokaryotes, there are few noncoding and intergenic regions, therefore constructing negative samples is a challenging task for sORF prediction. Negative ORFs generated based on random sequencestrategies have been used in gene prediction works [32, 33]. Then, in this work, a strict negative sORFs generating strategy is proposed by following steps:
(i) Randomly shuffling each positive sORF sequence to get a corresponding negative sequence without any stop codons before the stop codon at the end of sequence;
(ii) Ensuering that the negative sequence shares the same start codon and stop codon with its original positive sORF and there is no pre-mature stop codon in the sequence.
(iii) Excluding the redundant sequences with the abovementioned standard.
Furthermore, an experimentally verified dataset (Eexp-150-53) released by Hemm et al.  is also employed as test set. This dataset includes 150 positive sORFs and 53 negative sORFs detected from E. coli genome.
3.3 The protein-coding sORF prediction programs
At present, there are only a few sORF prediction programs. Some prediction programs reviewed in recent works are proposed for long ORF, but several of them are also applied to sORFs [16, 27]. Then, in total of nine ORF coding potential prediction programs [22, 23, 24, 25, 34, 35, 36, 37, 38] with source codes available are evaluated based on the test sets constructed above. The operating system and parameters used to run these programs are listed in Table 1. It is noted that MiPepid, CPPred-sORF are specially proposed for sORFs. On the other hand, although the abovementioned sORFs prediction programs are trained on the ORFs (sORFs) derived from different eukaryotic species, some of them were declared to have cross species prediction ability. Even so, no uniform standard has been proposed to measure their real efficiency. Therefore, different sORFs prediction programs are evaluated in this work.
3.4 Construction of the prokaryotic sORFs prediction method
Currently, most sORFs studies mainly focus on several model eukaryotes such as human and mouse. To verify the efficiency of the random sequence strategy, we propose an alternative protein-coding sORFs prediction algorithm (PsORFs) based on the presented random sequence strategy. This algorithm uses the frequency of the 64 kinds of codons as numerical parameters, and the random forest is adopted as classifier. Detailed description of this prediction algorithm is provided in Supplementary file 2. Our earlier studies indicated that some prokaryotic genomes exhibit properties of universal protein-coding genes regardless of their genome sizes and genomic GC contents . The protein-coding genes in these genomes can be used as training set to accurately predict the protein-coding genes in other prokaryotic genomes. Then, the protein-coding sORFs with known functions derived from five prokaryotic genomes (NC_009089, NC_003103, NC_012962, NC_000913, NC_008380) are adopted as the positive training set, and the negative sORFs in the training set are generated according to the random sequence generation procedure mentioned above. Furthermore, the training set is processed according to the abovementioned filtering steps, and finally 1228 positive sORFs and 1327 negative sORFs are obtained. In Supplementary file 3, we provide the training sets in fasta format.
3.5 The evaluation indices
For evaluation purpose, the sensitivity (s), specificity (s) and accuracy (ACC) are adopted, i.e.,
In addition, the Matthew’s correlation coefficient (MCC) is also used to describe the agreement of prediction and annotation with a single value in the range of [–1, 1], i.e.,
Where, TP and TN denote the number of coding sORFs and non-coding sORFs that have been correctly predicted respectively, FP and FN denote the number of coding sORFs and non-coding sORFs that have been falsely predicted respectively. Then, s and s correspond to the proportion of the coding/non-coding ORFs that have been predicted correctly, respectively.
4. Results and discussions
4.1 Evaluation results of protein-coding sORF prediction
Computational methods can provide quick and convenient tool for sORFs prediction. Several of the nine computational algorithms evaluated in this work have been tested in sORFs in their original literatures [22, 23, 24, 25], and we summarize their reported performances in Supplementary Table 2. The CPPred program got its best performance with ACC 0.8788 and MCC 0.7650 on their integrated test set. As its improved version, CPPred-sORF got its best performance with ACC 0.8849 and MCC 0.7680 on an integrated eukaryotic sORFs dataset. Mipepid got prediction accuracy 0.9576 on integrated dataset and 0.96 on human sORFs dataset, but it only predicts the sORFs start with ATG. DeepCPP is a deep neural network-based method for RNA coding potential prediction, and its reported prediction accuracy and MCC on a human sORFs test set are 0.8858 and 0.7740, respectively. The sORF finder program seems to achieve the lowest performance. Lack of reliable negative samples is one of the key challenges for sORFs prediction. As two specially designed programs for protein-coding sORFs prediction, Mipepid and CPPred-sORF defined ORFs derived from miRNA and lncRNA as negative samples in their training and test datasets respectively. In Fig. 1, we analyzed the length distribution of the protein coding datasets in Mipepid and CPPred-sORF. Obviously, there is apparent length bias between the positive dataset and negative dataset in both programs. It is noted that more than 90% negative samples from Mipepid are shorter than 20 aa, while more than 80% negative samples from CPPred-sORF are longer than 100 aa. It means that one can discriminate negative samples from positive samples only based on their length. However, sORFs generally exist in different parts of the genome, so the length of sORFs cannot be different between positive and negative samples. Therefore, in order to develop the prediction method of sORFs better, it is necessary to evaluate these procedures based on the third party datasets objectively.
The original prediction results of each program are provided in Supplementary Tables 3.1–3.5, where coding and noncoding represent the positive or negative sORFs that are predicted as coding sORFs or noncoding sORFs, and unknown represents the sORFs cannot be predicted (among these programs, MiPiped can only predict the sORFs start with start codon of ATG, therefore the sORFs with other start codons cannot be predicted). In Table 2, we provide the prediction efficiency of different programs based on the four random sequence-based test datasets of Hum-7115, Ara-2142, Mou-7385, Pro-6578 and the experimental verified dataset of Eexp-150-53. It can be seen from the results that the prediction performances of different programs for the two data types are consistent. For comparison, we mark the lower s and lower s with italic fonts, the biggest indexes of ACC and MCC are marked by bold fonts. Obviously, according to s and s, these programs can be divided into two groups. Group 1 includes CPC2, CPPred, DeepCPP, CPAT, CNCI, PLEK, LGC, these programs are inefficient for positive sORFs, and most positive sORFs are falsely classified as negative samples. Another group includes CPPred-sORF and MiPiped, both of them are specially designed for sORFs, but the results show that they failed to identify the negative samples. It is noted that the input of most programs evaluated above is DNA sequence, while some of them were developed for RNA transcripts and their input should be RNA sequence . Then, in Supplementary Tables 4.1–4.5, we also provide the prediction results by inputting RNA sequences. The results show that the prediction efficiencies are much worse than that of DNA sequences. In fact, in tradional gene prediction programs, ORFs longer than 303 bp are usually excluded to decrease the prediction false positive . Therefore, the results in Table 2 further confirm this conclusion. The poor prediction efficiencies indicate that there is still huge room for improvement in prediction of protein-coding sORFs.
|* The indexes of MiPiped are evaluated by the sORFs start with ATG. The lower s and lower s are labeled using italic fonts and the biggest indexes of ACC and MCC are marked using bold fonts.|
4.2 Prediction results of the prokaryotic sORF prediction method
Protein-coding sORFs have a widespread occurrence in diverse species and can be of high functional importance. However, no single identification method developed to date is sufficient to identify all sORFs, hence sORFs detection is a multidisciplinary strategy . The evaluation results of CPPred-sORF and MiPiped indicate that protein-coding sORFs prediction is still in its infancy. There are few noncoding regions in prokaryotic genomes, so it is very difficult to construct prokaryotic negative sORFs datasets. Then, we propose the PsORFs model based on the random sequence strategy. The random forest is employed as the core algorithm to train the PsORFs model. The number of bags was set as 200 according the evaluation result during K-fold cross validation. The five-fold cross validation was used to evaluate the model performance, the accuracy and MCC (threshold set as 0.5) of which are 0.8925 and 0.7852, respectively. To compare with other programs, PsORFs is evaluated by the five independent test datasets and its prediction results are also provided in Table 2. It can be found that the prediction efficiency of PsORFs is better than other methods in each test dataset. Although PsORFs is trained based on the prokaryotic sORFs, its prediction efficiency in eukaryotic sORFs is superior to other programs, which indicates the random sequence can provide robust data sources for sORFs prediction. The source code of PsORFs algorithm in Matlab format can be downloaded from http://126.96.36.199:8888/.
The important roles of protein-coding sORFs in biological activities have been confirmed by a large number of studies in recent years. As a kind of important functional genomic element, discovery of protein-coding sORFs has shed light on the dark proteomes . In this work, we evaluated different types of prediction programs, and the results showed that our evaluation study can provide important theoretical basis and novel ideas for sORFs discoveries.
6. Author contributions
JY and CX designed the experiments; JY, LG, XD perform the experiments; XD, BQ, WJ and CW wrote the codes, JY, CX, WJ, BQ developed the prediction method, LG, JW, CW analyzed the data; JY, LG, JW and CX wrote the manuscript.
7. Ethics approval and consent to participate
Thanks to all the peer reviewers for their opinions and suggestions.
This work was supported by the National Natural Science Foundation of China (61771093, 62011530044 and 61671107), the Youth Science and technology innovation plan of universities in Shandong (2019KJE007).
10. Conflict of interest
The authors declare no conflict of interest.
Supplementary material associated with this article can be found, in the online version, at https://www.fbscience.com/Landmark/articles/10.52586/4943.
-  Anderson D, Anderson K, Chang C, Makarewich C, Nelson B, McAnally J, et al. A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance. Cell. 2015; 160: 595–606.
-  Nelson BR, Makarewich CA, Anderson DM, Winders BR, Troupes CD, Wu F, et al. A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science. 2016; 351: 271–275.
-  Jackson R, Kroehling L, Khitun A, Bailis W, Jarret A, York AG, et al. The translation of non-canonical open reading frames controls mucosal immunity. Nature. 2018; 564: 434–438.
-  Sberro H, Fremin BJ, Zlitni S, Edfors F, Greenfield N, Snyder MP, et al. Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell. 2019; 178: 1245–1259.e14.
-  Martinez TF, Chu Q, Donaldson C, Tan D, Shokhirev MN, Saghatelian A. Accurate annotation of human protein-coding small open reading frames. Nature Chemical Biology. 2020; 16: 458–468.
-  Petruschke H, Schori C, Canzler S, Riesbeck S, Poehlein A, Daniel R, et al. Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome. 2021; 9: 55.
-  Delcourt V, Staskevicius A, Salzet M, Fournier I, Roucou X. Small Proteins Encoded by Unannotated ORFs are Rising Stars of the Proteome, Confirming Shortcomings in Genome Annotations and Current Vision of an mRNA. Proteomics. 2018; 18: e170058.
-  Guttman M, Russell P, Ingolia NT, Weissman JS, Lander ES. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell. 2013; 154: 240–251.
-  Schmitz JF, Bornberg-Bauer E. Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA. F1000Research. 2019; 6: 57.
-  Devkota S. Big data and tiny proteins: shining a light on the dark corners of the gut microbiome. Nature Reviews Gastroenterology & Hepatology. 2020; 17: 68–69.
-  Brunet MA, Leblanc S, Roucou X. Reconsidering proteomic diversity with functional investigation of small ORFs and alternative ORFs. Experimental Cell Research. 2020; 393: 112057.
-  Ruiz-Orera J, Albà MM. Conserved regions in long non-coding RNAs contain abundant translation and protein–RNA interaction signatures. NAR Genomics and Bioinformatics. 2019; 1: e2.
-  Lawrence J. When ELFs are ORFs, but don’t act like them. Trends in Genetics. 2003; 19: 131–132.
-  Cheng H, Chan WS, Li Z, Wang D, Liu S, Zhou Y. Small open reading frames: current prediction techniques and future prospect. Current Protein & Peptide Science. 2011; 12: 503–507.
-  Wang B, Hao J, Pan N, Wang Z, Chen Y, Wan C. Identification and analysis of small proteins and short open reading frame encoded peptides in Hep3B cell. Journal of Proteomics. 2021; 230: 103965.
-  Peeters MKR, Menschaert G. The hunt for sORFs: a multidisciplinary strategy. Experimental Cell Research. 2020; 391: 111923.
-  VanOrsdel CE, Kelly JP, Burke BN, Lein CD, Oufiero CE, Sanchez JF, et al. Identifying New Small Proteins in Escherichia coli. Proteomics. 2018; 18: e1700064.
-  Hemm MR, Weaver J, Storz G. Escherichia coli small proteome. EcoSal Plus. 2020; 9: 10.1128/ecosalplus.ESP-0031-2019.
-  Yin X, Jing Y, Xu H. Mining for missed sORF-encoded peptides. Expert Review of Proteomics. 2019; 16: 257–266.
-  Xu P, Zhang Y, He C. Advances in small protein identification. SCIENTIA SINICA Vitae. 2018; 48: 278–286.
-  Hanada K, Akiyama K, Sakurai T, Toyoda T, Shinozaki K, Shiu S. SORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics. 2010; 26: 399–400.
-  Tong X, Liu S. CPPred: coding potential prediction based on the global description of RNA sequence. Nucleic Acids Research. 2019; 47: e43.
-  Zhang Y, Jia C, Fullwood MJ, Kwoh CK. DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction. Briefings in Bioinformatics. 2020; 22: 2073–2084.
-  Tong X, Hong X, Xie J, Liu S. CPPred-sORF: Coding Potential Prediction of sORF based on non-AUG. bioRxiv. 2020. (in press)
-  Zhu M, Gribskov M. MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics. 2019; 20: 559.
-  Couso J, Patraquim P. Classification and function of small open reading frames. Nature Reviews Molecular Cell Biology. 2017; 18: 575–589.
-  Schlesinger D, Elsässer SJ. Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins. FEBS J. 2021. (in press)
-  Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Research. 2017; 46: D851–D860.
-  Olexiouk V, Menschaert G. Using the sORFs.Org Database. Current Protocols in Bioinformatics. 2019; 65: e68.
-  Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, et al. The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome. Genesis. 2015; 53: 474–485.
-  Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010; 26: 680–682.
-  Yu J, Xiao K, Jiang D, Guo J, Wang J, Sun X. An integrative method for identifying the over-annotated protein-coding genes in microbial genomes. DNA Research. 2011; 18: 435–449.
-  Guo F, Ou H, Zhang C. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Research. 2003; 31: 1780–1789.
-  Kang Y, Yang D, Kong L, Hou M, Meng Y, Wei L, et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Research. 2017; 45: W12–W16.
-  Wang L, Park HJ, Dasari S, Wang S, Kocher J, Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research. 2013; 41: e74.
-  Sun L, Luo H, Bu D, Zhao G, Yu K, Zhang C, et al. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research. 2013; 41: e166.
-  Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics. 2014; 15: 311.
-  Wang G, Yin H, Li B, Yu C, Wang F, Xu X, et al. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics. 2019; 35: 2949–2956.
-  Orr MW, Mao Y, Storz G, Qian S. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Research. 2019; 48: 1029–1042.