Protein-coding genes can evolve from scratch in previously noncoding regions. These so called ‘de novo’ genes have been found in many genomes and are one of most exciting areas in molecular evolution. In fact, there have been several important contributions to the field recently (Pubmed “de novo genes“).
There are many possible strategies to identify de novo genes, for example using comparative genomics or analyzing long noncoding RNA datasets. A popular approach is based on phylostratigraphy, or the age determination of genes based on their distribution on a given phylogeny.
Phylostratigraphy is a powerful approach but has flaws that can lead to underestimating the age of a gene. This is because many genes evolve rapidly bearing undetectable sequence similarity with orthologous copies in other species. To find out how often phylostratigraphy gets the age of de novo genes wrong I reanalized data from 3 studies focused on mouse or mouse and rat de novo genes. After excluding putative de novo genes that are no more annotated, I found that the two studies that relied on phylostratigraphy as the only method to detect de novo genes had >60% of error rate. I called these false positives ‘de nono’ genes.
This and other observations are now on a manuscript on biorXiv