De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango

Despite their commercial and nutritional importance, the genomic information for avocado, macadamia and mango is largely lacking. In our Data Descriptor, we report the generation and validation of transcriptome assemblies from pooled leaf, stem, bud, root, floral and fruit/nut tissue.
De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango

By Tinashe G. Chabikwa and François F. Barbier

Professor Christine Beveridge's group at the University of Queensland is part of The Small Tree High Productivity (STHP) Initiative, a Queensland Department of Agriculture and Fisheries (DAF) and Queensland Alliance for Agriculture and Food Innovation (QUAFFI) project which is co-funded by Horticulture Innovation Australia, to develop high density and high productivity orchard systems. This is a diverse research team consisting of plant physiologists breeders, geneticists, molecular biologists, and modelers studying commercially important tropical fruit trees; avocado (Persea americanaMill.), macadamia (Macadamia integrifoliaL.) and mango (Mangifera indicaL.). Despite their commercial and nutritional importance, only a handful of studies have attempted to decipher the regulation of their growth cycle at the molecular level. Moreover, the genomic information for these tree species is largely lacking, the macadamia and avocado genome sequences were only released in the past 5 years. In order to characterize seasonal, tissue-specific gene expression profiles of avocado, macadamia and mango, we required a reliable reference genome or comprehensive transcriptome assemblies. We also encountered challenges during sample preparation and processing. Extracting RNA from certain tissues from these so-called “recalcitrant species” for molecular studies is particularly challenging. We therefore had to be creative in developing a procedure for extracting DNA and RNA from   diverse tissues of the different species. It took a number of   years to establish such a procedure which we recently published. Another challenge was to get roots from adult trees grown in orchards. Not only it was hard to dig the roots out, but these trees are also infested with snakes and spiders! 


We therefore generated comprehensive avocado, macadamia and mango RNA-Seq datasets using normalized cDNA libraries derived from pooled leaf, stem, bud, root, floral and fruit/nut tissue. cDNA library normalization is done to maximize the number of transcripts represented in each library. Using a combination of de novo transcriptome assembly and redundancy reduction, we assembled 63420, 78871 and 82198 'unigenes'of avocado,macadamia and mango, respectively. The essential part of the library preparation process was the conversion of the pooled RNA into normalized cDNA using a duplex-specific nuclease (DSN) normalization protocolto accentuate rare/low abundance transcripts. This was done to avoid the dilution of transcripts from lowly expressed genes by those from highly expressed genes and therefore to improve gene discovery.Using the Benchmarking Universal Single-Copy Orthologs (BUSCO) algorithm, we found 70-95 % of complete BUSCOs present in our three de novotranscriptomes indicating high-quality, near-complete transcriptome assemblies


The assemblies generated in this study can be utilized for gene discovery and expression profiling experiments as well as ongoing and future mRNA-based genome annotation and marker development applications. For example, considering that avocado and mango are both prone to alternate/biennial bearing, the identification and subsequent manipulation of genes regulating floral induction may greatly contribute towards solving this problem. We also hope that these datasets will be useful to researchers interested in fruit tree genomics and foster collaboration in this discipline. In recent years, open-access journalssuch as Scientific Datahave provided an opportunity for publishing scientifically valuable datasets promoting research that advances the sharing and reuse of scientific data to speed up the pace of scientific discovery. Our team has generated other large-scale plant genomics and transcriptomic datasets which are currently being analyzed to provide biological insights to important plant developmental processes and to share with the scientific community. We hope that our data release will also encourage other researchers to do the same.


Please sign in or register for FREE

If you are a registered user on Research Data at Springer Nature, please sign in