We are happy to share with you the new Giardia intestinalis WB genome in near-complete chromosomes, published as a Data Descriptor in Scientific Data (https://rdcu.be/b1br5).
The G. intestinalis WB genome published in 2007 has been widely used in the Giardia community as well as in comparative genomics. However the old genome is fragmented with 137 internal gaps. When Pacbio long-read sequencing technology became mature, we decided to re-sequence the WB genome. We were happy with the initial HGAP3 assembly, which generated long contigs that mapped well to the optical maps. When we stitched contigs together based on optical maps, contigs overlapped had stretches of sequences aligned well confirming that it was the right neighbouring contigs. It was straight forward to obtain the five near-complete chromosomes combining long contigs from PacBio assembly and optical maps. However it took time to annotate the new genome.
We transferred the old annotation combining it with de novo structural predictions, and manually examined inconsistent cases. The old genome contains many wrongly assigned ORFs that were carefully cleaned out, and start codons were adjusted based on RNA-Seq support as well as start codon motif. We were not only careful about the structural annotation but also cautious about the functional annotation. We combined BLAST and domain searches together with pathway predictions and experimental evidence available from GiardiaDB to update functional annotation. All the genes which were differently annotated either structurally or functionally were double checked to make sure the new version is better compared to the old version.
During the manual annotation, we also realised that some of the functional genes were fragmented in the new genome due to frame-shifts caused by PacBio sequencing errors. We thus obtained Illumina reads to correct base errors, and updated 50 SNPs affecting 20 genes, and 85 indels (mostly at homopolymers) affecting 38 genes. I would recommend getting complementary Illumina DNA reads from the start to improve the base calling quality, which also make the SNP estimation reliable.
We believe the new G. Intestinalis WB genome is not only more complete sequence-wise, but also better annotated both structurally and functionally. We wish the Giardia community would appreciate such a high quality genome and start using this as the new Giardia reference genome.
With the advancing of sequencing technology, there are more and more re-sequencing projects in order to improve the fragmented reference genomes. So I would like to share some thoughts on it with my experiences from Giardia WB re-sequencing project.
First, combine long and short reads to achieve both long contigs and high base quality, and do that early on during the project to save time.
Second, decide early on how the genome is going to be submitted to NCBI, as a new submission or as an update of the old genome? It took us months to decide after we were done with the annotation. We wanted to keep the same geneids (locus_tag) as the old genome, since that's what researchers have been using over a decade now. So we opted for an update in the end to be able to use the same locus_tag prefix GL50803_, and made sure the synteny holds between the two genomes. However, we had to add 00 after the locus_tag prefix anyway to distinguish the geneids from the old genome and the new. I still don’t know if this is the best way to update a genome. It might have been better if we have chosen to submit it as a new genome, but adding a suffix to the locus_tag prefix instead, like GL50803x_.
Third, think about publication. Traditionally, a research article requires novel interesting biological findings. Re-sequencing project improves the quality of the reference genome, but often does not reveal enough new interesting findings for traditional research article. Luckily, there are nowadays quite a few journals where you can submit genome report about the new genome. I found it very straight forward to write about the new Giardia WB genome in the Data Descriptor format, and would highly recommend Data Descriptor in Scientific Data for future genome updates.