Genome-wide comparative analysis of transposable elements in Palmae genomes

National Centre for Bioinformatics, King Abdulaziz City for Science and Technology, 11442 Riyadh, Saudi Arabia, Centre for Biodiversity Genomics, University of Guelph, Guelph, ON N1G 2W1, Canada, National Life Science and Environment Research Institute, King Abdulaziz City for Science and Technology, 11442 Riyadh, Saudi Arabia, National Center for Agricultural Technology, King Abdulaziz City for Science and Technology, 11442 Riyadh, Saudi Arabia


Abstract
Background: Transposable elements (TEs) are the largest component of the genetic material of most eukaryotes and can play roles in shaping genome architecture and regulating phenotypic variation; thus, understanding genome evolution is only possible if we comprehend the contributions of TEs. However, the quantitative and qualitative contributions of TEs can vary, even between closely related lineages. For palm species, in particular, the dynamics of the process through which TEs have differently shaped their genomes remains poorly understood because of a lack of comparative studies. Materials and methods: We conducted a genome-wide comparative analysis of palm TEs, focusing on identifying and classifying TEs using the draft assemblies of four palm species: Phoenix dactylifera, Cocos nucifera, Calamus simplicifolius, and Elaeis oleifera. Our TE library was generated using both de novo structure-based and homology-based methodologies. Results: The generated libraries revealed the TE component of each assembly, which varied from 41-81%. Class I retrotransposons covered 36-75% of these species' draft genome sequences and primarily consisted of LTR retroelements, while non-LTR elements covered about 0.56-2.31% of each assembly, mainly as LINEs. The least represented were Class DNA transposons, comprising 1.87-3.37%. Conclusion: The current study contributes to a detailed identification and characterization of transposable elements in Palmae draft genome assemblies.

Introduction
Eukaryotic genomes are known to be densely populated with different types of repetitive elements, including tandem repeats [1] and transposable elements (TEs) [2]. TEs were first characterized in plant genomes over 65 years ago by B. McClintock, who discovered genes that move from one chromosome to another and, in so doing, affect the phenotype of the host organism [3,4]. Thousands or even tens of thousands of TE families exist in plants [5]. They have conquered thousands of different families in the plant kingdom [6], making up anywhere from 14% of a plant's genome (as in Arabidopsis thaliana [7]) to over 80% (as in maize [8,9]. Plants are thus the front line for investigating the impact of TEs on genome structure and gene expression. Notably, TEs can generate genetic diversity upon which selection can act, and this can be leveraged for various purposes in plant breeding programs. Recent insertions of TE families have proven to be particularly helpful in better understanding the evolutionary mechanisms involved in species differentiation [10]. TEs are classified into two major categories based on the mechanism of transposition [11]. Both classes consist of assorted subdivisions, orders, and superfamilies, as described in [12]. Class I LTR retrotransposons (LTR-RTs) represent by far the majority of TEs harbored in plant genomes [13], primarily composed of two superfamilies, Ty1/Copia and Ty3/Gypsy [12], which are differentiated based on the order of their coding domains and evolutionary divergence [14]. Class I TEs replicate via a copy-andpaste mechanism involving an RNA intermediate, whereby TE mRNA is translated into its associated proteins, including a reverse transcriptase that converts the intermediate into DNA, which is then re-inserted into the genome to generate a new copy. Other retrotransposon lineages include long and short interspersed elements (LINEs, SINEs) and the less common Penelope elements [15]. Class II TEs, or "cut-and-paste" elements, mobilize themselves using an element-encoded transposase that mediates excision and transposition of the parent element from one position to another. Terminal inverted repeat (TIR) elements are the most common subclass of so-called cut-and-paste DNA transposons [16]. Other Class II elements common in plant genomes are Helitrons, which are generally less abundant than cut-and-paste TIR transposons and use a rolling circle form of replication [17].
TEs increase their copy number within a host genome through transposition, while the host often represses their activity through epigenetic mechanisms such as RNA and chromatin-mediated silencing [18]. Once integrated into a host genome, each element is subject to mutation and to a wide array of rearrangements including internal deletions, truncations, and nested insertions. Environmental stresses (cold, heat, UV light, pathogen attack, etc.), including tissue culture stress, can cause reactiva-tion of a variable fraction of the TE population; such reactivation is thought to contribute to the host's short-term response to changing environmental conditions [19,20]. In tissue culture processes specifically, well-known triggers of LTR-retrotransposon remobilization [21] have been demonstrated in plants such as rice, tobacco, and barley [22][23][24]. Such investigations have confirmed that TEs can contribute to somaclonal variation and promote the emergence of altered phenotypes [25].
The major members of the palm family (Arecaceae or Palmae) are considered among the tallest domesticated trees and the longest-lived monocotyledonous species [26]. Palm trees are often used as landscape plants; they are also of considerable economic importance, widely cultivated in arid and semi-arid regions from North Africa through the Middle East and the Indus Valley. Among cultivated palms, the greatest quantity of plantation area (17 million hectares) is given over to oil palms in the genus Elaeis, producing 50 million tons of palm oil annually. This genus comprises two species, Elaeis guineensis, and Elaeis oleifera, which are responsible for about 33% of vegetable oil and 35% of edible oil produced worldwide [27]. The earliest recorded cultivation of the date palm Phoenix dactylifera occurred in 3700 BCE in the area between the Euphrates and the Nile River [28]. About 5000 date palm cultivars exist around the world [29], and they are an essential species in drought and saline-affected regions, particularly Saudi Arabia, which grows >10% of the world's date palm trees (14% of date production) with a representation of nearly 340 varieties [30]. This study also analyzes other palm species, including the coconut (Cocos nucifera) and the rattan (Calamus simplicifolius), that are critical ecological and socioeconomic resources for many countries, having vital roles in food security, lumber, the ornamental market, and industrial materials [31]. Characterization of the genomic variation among Phoenix dactylifera (date palm), Cocos nucifera (coconut), Calamus simplicifolius (rattan), and Elaeis oleifera (oil palm) will provide insights into the evolutionary pattern of divergence within the palm family, at least structurally and at the level of the genome sequence. Early investigations [32] reported high similarity between coconut, oil palm, and date palm in terms of segmental duplications.
In the present work, genome-wide annotation of TEs was conducted in the aforementioned species using their publicly available genome assemblies. This process involved combining several approaches for the identification and annotation of TEs based on structural features, inherent repetitiveness (de novo), and similarity to elements within existing reference libraries (homology-based) [33]. We additionally built a TE reference database to characterize the compositions of palm genome assemblies and compare their respective TE populations. The comprehensive detection and annotation of TEs is still an open topic in the area of bioinformatics [34], and this analysis provides in-sights into TE annotation, especially in complex genomes like those of plants.

Identification of transposable elements
A combination of multiple approaches was employed to identify TEs in the four palm draft genome sequences: (i) signature-based identification of TEs, (ii) de novo identification of TEs, and (iii) similarity-based identification [33]. A flowchart describing our overall approach is given in Fig. 1.
In signature-based identification, candidate LTR-RTs were identified by the LTRharvest [37] from Genome-Tools v1.6.1 software [38], which searched the input sequences for direct repeats (LTRs) separated by at least 1000 bp and flanked by apparent target site duplications (TSDs). Default settings were employed with the following exceptions: -motif tgca -motifmis 1 -minlenltr 100 -maxlenltr 3500 -mintsd 2. The program LTRdigest [39] was applied to recognize coding regions and primer binding sites within the predicted LTR-RTs; this tool annotated protein-coding domains in the sequence bracketed by each putative element's LTRs, specifically using HMMER3 [40] to identify homologs to a set of TE-related pHMMs from the Pfam [41] and GyDB databases [42]. Finally, the EMBOSS (v6.6.0) [43] utility getorf was used to annotate additional ORFs that did not overlap with LTRdigest predictions, considering only those longer than 100 amino acids.
For detecting non-LTR-RTs, we next masked the genome sequence to avoid hits with reverse transcriptase domains already identified. Next, the getorf tool from EM-BOSS v6.4.0.0 [43] was employed to extract ORFs from the masked genome sequence. A minimum ORF size of 500 bp was used to accommodate the APE domain (97% of inspected non-LTR elements have sizes between 600 and 800 bp). Finally, we applied MGEScan-nonLTR (v4.0) [44] with default parameters. This program is a generalized hidden Markov model (GHMM) [44] that uses three states to represent two protein domains and the inter-domain linker regions encoded in non-LTRs, the scores for which are evaluated by Phmm (for protein domains) and Gaussian Bayes classifiers (for linker regions).
Putative Class II transposons can be divided into two subclasses: (1) terminal inverted repeat (TIR) elements, which are flanked by TIRs of various lengths and produce TSDs of various lengths upon successful integration into the genome sequence, and (2) non-TIR transposons such as Helitrons, which replicate via a rollingcircle mechanism and do not produce TSDs upon integration. Candidates in these subclasses were respectively identified using MiteFinderII [45] and HelitronScanner [46], both executed with default parameters.

Classification and superfamily assignment
The generated candidate LTR-RTs were interrogated for their inclusion in one of the three recognized superfamilies: Ty1/Copia, Ty3/Gypsy, and Bel/Pao. The evidence consisted of matches to hidden Markov models (HMMs) and BLAST results against the Viridiplantae LTR-RT database (retrieved from Repbase and Dfam), respectively, obtained via nhmmer (-incE 1 × 10 −5 , -E 10) and tblastx (-evalue 1 × 10 −5 ). Only the best hits were kept. Each superfamily was then clustered using the "80-80-80" sequence similarity rule suggested by [12]: two elements belonged to the same family if they were at least 80 bp long and shared at least 80% of sequence identity in at least 80% of their coding or internal domain, within their terminal repeat regions, or both. All LTR-RT families that met this definition according to [12] were considered.
To exclude false-positive hits from our putative Class II elements, hits were queried with BlastN (-evalue 1 × 10 −5 ) against a merged database retrieved from Repbase (Class II: Viridiplantae) and P-MITE (Arecaceae MITEs) [47]. Hits were classified to the superfamily level based on the highest score match, and elements without homologs were discarded as false positives.
Libraries generated as described above were filtered first for duplicates using SeqKit rmdup on the basis of sequence (-s) [48]. To further classify LTR retrotransposons into clades below the superfamily level, and Class II elements, the TEsorter hidden Markov model (HMM) profile-based classifier was used with default settings [49], taking as reference the protein domains found in the REXdb Viridiplantae version of the database [50]. Complete LTR elements were identified based on the presence and order of conserved domains, including capsid (GAG), aspartic protease (AP), integrase (INT), reverse transcriptase (RT), and RNase H (RH) as described in [12]. TEsorter was also used to filter the library of consensus sequences prior to genome sequence annotation, primarily by detecting chimeras or nested elements composed of drastically different types of protein-coding sequences (e.g., transposase and non-LTR reverse transcriptase).
LINE elements were further scrutinized by extracting RT coding regions as identified by TEsorter, over 200 aa were extracted, and fragments were aligned to a reference alignment from Kapitonov et al. [51]. Multiple members of LINE superfamilies could not be verified and thus were classified as unknown LINEs.

Annotation and estimation of genome sequence coverage
Reference TE sequences from palm species (Palmae) were extracted from Dfam (20170127) [52] and RepBase (20181026) [53] using the script 'queryRepeat-Database.pl' supplied with RepeatMasker. After generation, the libraries were merged and used as an input to mask and annotate the assembled genomes. This masking employed (iii) similarity-based identification of TEs via Re-peatMasker v.4.1.0 [54], with RMBlast as the search al-gorithm, Smith-Waterman for alignment, and -cutoff 225. We applied high sensitivity/low-speed search conditions to avoid spurious results: -s, -no_is, -lib, -norna, and exclusion of low complexity regions (-nolow); other parameters were default. Additionally, we counted the copy number of classified elements and determined genome sequence coverage from the RepeatMasker output files (.out), using the one code to find them all script [55] to estimate the fraction of the genome occupied by each TE family.
Finally, the unmasked portion of each genome sequence was scanned using (ii) de novo methodology for TE detection. Namely, RepeatModeler2 [56] was used with default parameters to identify any unclassified TEs missed by structure-based identification approaches. Results obtained will be merged to the reference library for filtration and Final Re-annotation.

Phylogenetic analysis
The consensus sequences classified as belonging to Ty1/Copia superfamily elements and containing all five protein-coding domains characteristic of LTR elements (GAG, AP, INT, RT, RH) were selected for phylogenetic analysis. To choose for elements more likely to have been recently active, amino acid sequences translated from RT coding regions were screened for length (>200 amino acids) and the absence of stop codons and ambiguous positions. Sequences were aligned using MUS-CLE [57] to a reference alignment of representative RT sequences from Ty1/Copia elements (Sto-4 for Ikeros, Tork4 for Tork, Oryco1-1 for Ivana, SIRE1-4 for SIRE, and Fourf for TAR) obtained from the Gypsy Database [42]. To minimize the effect of information loss on tree construction, sequences were only included if they were at least within ten amino acids of either end of the reference element alignment. A maximum-likelihood tree was built with the iqTree server, using mutation model estimation and default settings [58,59]. The resultant tree was visualized using the iTOL web server [60].

Assessing completeness of the genome assemblies
To assess the completeness of each of the four genome assemblies, we adopted the Benchmarking Universal Single-Copy Orthologs (BUSCO) plant lineage dataset, which consists of 1440 single-copy orthologs for the Embryophyta lineage. Among surveyed genome drafts, C. nucifera had the highest BUSCO score ( Fig. 2A), with 1335 complete BUSCOs (92.71%); another 2.40% of sequences were fragmented, and 4.93% were considered missing (71 BUSCOs). The BUSCO scores of the C. nucifera, P. dactylifera, and C. simplicifolius assemblies were comparable and higher than those of the E. oleifera assembly.

Construction of a palm repeat library
A reference TE library was created by applying a combination of structure-based and homology-based approaches to 335 P. dactylifera, 1473 C. nucifera, 1481 C. simplicifolius, and 777 E. oleifera scaffold sequences. After identifying, and filtering elements, we recorded 3526, 3563, 4542, and 2874 consensus sequences for each species classified as TEs according to the Repbase and Dfam reference libraries. Taken together, this library of TE candidates encompasses both Class I (LTR-RTs, non-LTR retrotransposons) and Class II elements (TIR elements, Helitrons, and MITEs), which are provided in Supplementary files 1-4.
The contributions of TEs to the P. dactylifera, C. nucifera, C. simplicifolius, and E. oleifera assemblies were assessed by utilizing a similarity-based approach (Repeat-Masker) to mask the assembled genomes with the generated TE libraries.  Fig. 2B). Something to be mindful of in this case, and for other genome sequences, is that assembly size can often differ considerably from the genome sequence size as measured using cytological methods [61]. These palm assemblies range from 70.5% (in P. dactylifera) to 95.65% (C. simplicifolius) of the estimated genome size, indicating that our values may be underestimates of the repetitive content in each genome assembly [35,62].
The repetitive elements detected in P. dactylifera, C. nucifera, C. simplicifolius and E. oleifera were classified into five main categories: (1) LTR-RTs identified

Class I
The LTR-RT detection process, which identified elements consisting of two relatively intact LTRs and flanking TSDs, returned 11120, 94186, 116725, and 35452 raw hits for each of P. dactylifera, C. nucifera, C. simplicifolius, and E. oleifera. These candidates accounted for The candidate LTR-RTs were classified into seven superfamilies according to the Wicker classification system [12] as represented in the RepBase and Dfam databases (Table 2). In all four studied assemblies, the most abundant LTR-RT superfamilies were Ty1/Copia and Ty3/Gypsy, respectively accounting for 574-1298 and 31-736 consensus sequences. We then evaluated the distribution of predicted protein-coding sequences within LTR-RTs in order to gain insight into their possible associations. The total proteins identified in each assembly and their breakdowns by domain are summarized in Table 3. Most putative LTR-RTs featured the gag-integrase-reverse transcriptase protein domain order characteristic of Ty1/Copia elements. Ty1/Copia and Ty3/Gypsy consensus sequences were further classified into lineages using TEsorter. Within those groups, consensus sequences identified as complete (containing hits to each of the characteristic LTR proteins mentioned previously) were respectively classified into nine and six lineages. The most represented lineages in Ty1/Copia were Angela (2.24%-23%) and SIRE (0.74%-10.5%) ( Table 4), while amongst the Ty3/Gypsy consensus sequences, Retand elements (0.85%-4.68%) were of the highest coverage in each assembly (Table 5).
To build a phylogenetic tree of complete Ty1/Copia consensus sequences, we filtered their RT sequences for length, stop codons, and ambiguous regions. We selected the longest contiguous RT regions from consensus sequences categorized as complete by TEsorter. We selected representative lineages of LTR elements that are likely to be more recently active. This yielded 179 sequences for tree construction: 24 from P. dactylifera, 19 from C. nucifera, 136 from C. simplicifolius, and none from E. oleifera. Collectively, these represented 93 SIRE, 52 Ivana, 14 Angela, 8 Tork, 5 TAR, 5 Ikeros, and 2 Ale elements; those that classified into particular groups clustered in well-supported clades with their respective reference elements (SIR classified SIRE, Oryco1-1 for Ivana, Tork4 for Tork, Fourf for TAR, and Sto-4 for Ikeros). The one exception was the Angela and Ikeros complex, which is known to be paraphyletic [50]. Reference elements are denoted with dotted lines on the tree (Fig. 3). C. simplicifolius sequences in general, dominated the tree, but more specifically SIRE and Ivana; these groups featured several low-divergence clades, suggestive of recent activity (Fig. 3). In contrast, recently active SIRE clades are interspersed with more divergent lineages, some composed of elements from the other genomes, suggesting that SIRE, in general, has maintained more activity over evolutionary timescales. Of complete Ty1/Copia consensus sequences in C. nucifera, the bulk were Angela elements (see Table 4), but only a small number of these consensus sequences were represented on the tree, with several having low divergence.
Non-LTRs were identified by applying MGEScan to the LTR-masked genome sequences. This tool discovers all known full-length elements and simultaneously classifies them into the following clades: CR1, I, Jockey, L1, R1, R2, and RTE. Previous studies have classified non-LTR retrotransposons into 11 clades based on the reverse transcriptase phylogeny [63]. The non-LTR retrotransposons we distinguished in palm species are summarized in Table 6. Seven superfamilies were represented in P. dactylifera, four in C. nucifera, seven in C. simplicifolius, and one in E. oleifera. R2 was the only superfamily present in all studied species, while the I superfamily was by far the most abundant when it was present, with 14 occurrences in E. oleifera. These full-length elements covered 205,245 bp, 13,932 bp, 241,692 bp, and 8925 bp of the associated genome sequences; the smallest counts of ORF-conserving elements were identified in E. oleifera.

Class II
In investigating Class II TEs, we first identified ubiquitous miniature inverted-repeat elements (MITEs), characterized by essential structural features such as TIRs and TSDs, AT-rich sequences, and a lack of transposase coding capacity. Canonical MITE sequences with TIRs, TSDs, and perfect or near-perfect structure (inverted repeats with some mismatches) feature a TIR pair (≥10 bp in length) and a TSD pair (2-10 bp) and have a length between 50 and 800 bp; these elements were detected using MITEFinderII, which reported a total of five superfami-   Table 6. Counts of ORF-conserving non-LTR retrotransposons identified in the four palm assemblies.
In the P. dactylifera draft assembly, we identified a total of 303 MITE elements, which accounted for 99,906 bp all told; only 116 elements showed significant homology to database entries (RepBase, PMITE), and these belonged to six different superfamilies. In C. nucifera, we identified 189 MITE elements, which accounted for 37,550 bp of the assembly; of these, 187 elements were collectively associated with seven different superfamily definitions. In Finally, we identified Helitron-like sequences using the exhaustive structure-based approach of Helitron-Scanner, which predicts putative Helitrons based on definitive features by scanning for conserved structural traits: 5' end with TC, 3' end with CTAG, and a GC-rich hairpin loop 2-10 nt in front of the CTAG end. This method predicted 51, 133, 131, and 69 elements in P. dactylifera, C. nucifera,

RepeatModeler
After masking the four assemblies, we employed RepeatModeler2 to discover TEs not detected by previous methods, such as TIR elements, then merged those results with the otherwise-predicted libraries into a master library.
An overview of elements detected by RepeatModeler only, namely the number of families representing each superfamily in each assembly, is given in Table 8. The results reveal that retrotransposons, especially LTR-RT, dominate the masked genome sequences of these four palm species.
Overall, the studied palm draft genome assemblies contain different proportions and numbers of DNA-TIR and LTR elements relative to their respective genome sequence sizes. In absolute terms, for each of P. dactylifera, C. nucifera, C. simplicifolius, and E. oleifera,

Discussion
We generated repeat libraries for each of the four palm species with available genome sequences to investigate the abundance and characteristics of repeat-derived DNA within this family. This study also facilitates the repeat-masking of DNA and provides a first step towards constructing a comprehensive palm TE catalogue. Our analysis techniques were very conservative, which may have led to an underestimation of ancient and divergent el-ements; such elements may have been detected as unclassified. To ensure the reliability of our results, we employed a method incorporating both known TEs and signature-based repeat identification tools.
After merging all predicted repeats and performing validation and redundancy removal, we obtained libraries containing 3526, 3563, 4542, and 2874 consensus sequences, respectively, for P. dactylifera, C. nucifera, C. simplicifolius, and E. oleifera. We then merged them into a composite master reference library and re-annotated the four genome assemblies. Doing so revealed repetitive elements as comprising a total of 229.91 Mb (41.42%) in P. dactylifera, 1714.54 Mb (81.55%) in C. nucifera, 1314.23  Table 1. All told, the examined draft genomes were similar in terms of overall repetitive content (Fig. 2B), namely that retroelements dominated the assemblies. A strong predominance of retroelements over DNA transposons is a common feature of plant genomes [64]. Class I elements constituted 36-75% of the annotated assemblies in the present study, with LTRs comprising 34-75%. This result was expected; the larger the plant genome, the greater the chance it contains many retroelements. For example, retroelements comprise 80-85% of barley and maize genomes with size >3 Gb [65,66], but only 17% of the rice genome with size less than 1 Gb [67]. Among the LTR retrotransposons in this study, we discovered all four palm genome sequences to feature comparable diversity of the Ty3/Gypsy and Ty1/Copia families, with Ty1/Copia elements being more abundant than Ty3/Gypsy. This result is consistent with a previous study conducted by [27], which revealed Ty1/Copia elements to be more abundant than Ty3/Gypsy in the oil palm. Ty1/Copia elements were also the first elements detected in palm genomes via hybridization [68,69]. We also shed some light on the composition of LTR elements below the superfamily level for the C. simplicifolius and C. nucifera assemblies for the first time.
In our phylogenetic investigation of Ty1/Copia elements likely to be recently active, C. simplicifolius dominated the tree with 136 sequences. Notably, although Ivana consensus sequences with full domains made up a small percentage of the assembly, they comprised a high percentage of elements on the tree, with at least two low-divergence clades suggesting recent transposition events. Similarly, C. simplicifolius also contained several low-divergence clades of SIRE elements, although these were interspersed with more divergent lineages. In C. nucifera, unlike the other assemblies, the bulk of complete consensus sequences were from the Angela group, comprising about 23% of the assembly. On the phylogenetic tree, Angela elements presented a low-divergence clade for C. nucifera specifically but otherwise do not seem to have many potentially active families based on consensus sequences, as defined by our filtering metric. Within C. nucifera, recent LTR activity has been reported as dominated by Ty1/Copia in the last 2 million years, with fewer and fewer elements showing evidence of activity when approaching the present [32]. In the P. dactylifera draft genome, we detected and classified TAR/Fourf, Orcyo/Ivana, and SIRE elements, supporting the work of Nouroz and Mukaramin [70]; we also identified four other Ty1/Copia groups. Despite these elements contributing less to the tree overall, the tree contains representatives of nearly all the Ty1/Copia groups detected in the P. dactylifera assembly, including the only two Ale consensus sequences. Both C. simplicifolius and C. nucifera represent the most complete and largest assemblies of the four palm species analyzed, thus probably contributing to their bias of complete consensus sequences analyzed on the tree.
Very few non-LTR retrotransposons have been reported in plants; such elements appear more abundant in animal genomes [12,71]. For example, SINEs may comprise up to >15% of primate genomes but only account for 1% or less of plant genomes in general. In the present study, we found LINEs to make up 0.54-2.30% of total repetitive elements and SINEs to be only negligibly observed, representing 0.1-0.7% of each assembly, which is in line with previous reports [30,72]. Mao et al. [73] suggested that the forces underlying rapid changes of plant genomes may be responsible, at least in part, for the removal of old SINEs from the host genome.
We also found other classes of repetitive elements, such as Class II TEs, to be poorly represented in palm genome sequences, collectively making up 1.87-3.37% of the four annotated assemblies. The most prevalent DNA transposon super-families were the hAT, Mutator, and Helitron elements, likewise being the most abundant in previous studies [74]. In particular, members of the hAT superfamily are found in many monocots, such as the Ac-Ds family in maize [75]. Unlike other DNA transposons, Helitrons are challenging to identify because they require structural-based detection methods rather than homology. In the publication detailing HelitronScanner [46], Xiong et al. reanalyzed the genome sequences of 26 plant species and reported Helitron abundance to cover at most 2-6%, the highest percentage being in maize. In the present study, we observed Helitrons to comprise about 0.73-2.47% of each assembly, with the highest coverage being found in C. nucifera (2.47%) followed by the C. simplicifolius (2.12%) and the lowest percentage reported in P. dactylifera (0.73%).

Conclusions
The findings of this study will provide a valuable resource for further research into palm biology and genomics. While the investigated genome sequences were similar in terms of the content and distribution of the identified repetitive elements, differences were also observed that might be associated with factors such as different evolutionary origins or discrepancies in the assembly stages of these draft genomes. Additional research into repetitive elements in palm genome sequences, perhaps with more complete genome assemblies, would provide more information on and awareness of the genomic features of these economically important plants. Furthermore, the causes and consequences of the high degree of inter-genome variability in the distribution, amount, and relative proportion of TEs are still not wholly understood; it is essential to continue characterizing this critical fraction of eukaryotic genomes. Such characterizations can bring to light evolutionary phenomena, including genomic rearrangements and other dynamic events, that have occurred in the past and may also be underway in contemporary times.

Author contributions
MMM and FHA conceived and designed the ex-periments; MAI, TAE, BMA, SNA, MSA and MMM carried out the experiments; MAI, SNA, TAE, BMA, FHA and MMM analyzed the data; MAI, SNA, TAE, and MMM wrote the manuscript. All authors reviewed the manuscript.

Ethics approval and consent to participate
Not applicable.

Conflict of interest
The authors declare no conflict of interest.

Availability of data and materials
All data generated or analysed during this study are included in this published article (and its supplementary information files).