130 38 12MB
English Pages 127 Year 2021
Systematic Entomology (2021), DOI: 10.1111/syen.12508
Phylogenomic analyses clarify the pattern of evolution of Adephaga (Coleoptera) and highlight phylogenetic artefacts due to model misspecification and excessive data trimming A L E X A N D R O S V A S I L I K O P O U L O S 1, M I C H A E L B A L K E 2, S A N D R A K U K O W K A 1, J A M E S M . P F L U G 3, S E B A S T I A N M A R T I N 1, K A R E N M E U S E M A N N 4, L A R S H E N D R I C H 2, C H R I S T O P H M A Y E R 1, D AV I D R . M A D D I S O N3 , O L I V E R N I E H U I S4 , R O L F G . B E U T E L 5 and B E R N H A R D M I S O F 6 1 Centre for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, Bonn, Germany, 2 Department of Entomology, SNSB-Bavarian State Collections of Zoology, Munich, Germany, 3 Department of Integrative Biology, Oregon State University, Corvallis, OR, U.S.A., 4 Department of Evolutionary Biology and Animal Ecology, Institute of Biology I (Zoology), Albert-Ludwig University of Freiburg, Freiburg, Germany, 5 Institut für Zoologie und Evolutionsforschung, Friedrich-Schiller-Universität Jena, Jena, Germany and 6 Zoological Research Museum Alexander Koenig, Bonn, Germany
Abstract. Adephaga is the second largest suborder of Coleoptera and contains aquatic and terrestrial groups that are sometimes classified as Hydradephaga and Geadephaga, respectively. The phylogenetic relationships of Adephaga have been studied intensively, but the relationships of the major subgroups of Geadephaga and the placement of Hygrobiidae within Dytiscoidea remain obscure. Here, we infer new DNA-hybridization baits for exon-capture phylogenomics and we combine new hybrid-capture sequence data with transcriptomes to generate the largest phylogenomic taxon sampling within Adephaga presented to date. Our analyses show that the new baits are suitable to capture the target loci across different lineages of Adephaga. Phylogenetic analyses of moderately trimmed supermatrices confirm the hypothesis of paraphyletic ‘Hydradephaga’, with Gyrinidae placed as sister to all other families as in morphology-based phylogenies, even though quartet-concordance analyses did not support this result. All analyses conducted with site-heterogeneous models suggest Trachypachidae as sister to a clade Carabidae + Cicindelidae in congruence with results from morphological studies. Haliplidae is inferred as sister to Dytiscoidea, while a clade of Noteridae (+ most likely Meruidae) is inferred as sister to all remaining Dytiscoidea. A strongly supported clade Hygrobiidae + (Amphizoidae + monophyletic Aspidytidae) is inferred in most analyses of moderately trimmed supermatrices when a site-heterogeneous model is used. In general, we find that stringent trimming of supermatrices results in reduced deviation from model assumptions but also in reduction of phylogenetic information. We also find that site-heterogeneous C60 models provide greater stability of phylogenetic relationships of Adephaga across analyses of different amino-acid supermatrices than site-homogeneous models. Thus, site-heterogeneous C60
Correspondence: Alexandros Vasilikopoulos, Centre for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, 53113 Bonn, Germany. E-mail: [email protected] Bernhard Misof, Zoological Research Museum Alexander Koenig, 53113 Bonn, Germany. E-mail: [email protected] [Corrections added on 30 July 2020, after the first online publication: Copyright line has been updated] © 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1
2 A. Vasilikopoulos et al.
models can potentially reduce incongruence in phylogenomics. Lastly, we show that gene-tree errors are prominent in the data, even after sub-sampling genes to reduce these errors, but we also show that subsampling genes based on the likelihood mapping criterion in summary coalescent analyses results in higher topological congruence with the concatenation-based tree. Overall, our analyses demonstrate that moderate alignment trimming strategies, application of site-heterogeneous models and mitigation of gene-tree errors should be routinely included in the phylogenomic pipeline in order to more accurately infer the phylogeny of species.
Introduction Beetles (Coleoptera) are the most speciose insect order and their phylogeny has been the focus of attention for many decades (Crowson, 1960; Lawrence & Newton, 1982; Hunt et al., 2007; Lawrence et al., 2011; Beutel et al., 2019a, 2020; McKenna et al., 2019). Polyphaga is the largest beetle suborder with numerous phytophagous species (McKenna et al., 2019) but also many other feeding habits. Adephaga, which mostly includes predacious species, is the second largest beetle suborder with more than 45 000 species assigned to 11 families (Beutel et al., 2020; Duran & Gough, 2020). The family-level phylogenetic relationships of Adephaga have been extensively debated but scientists are now reaching a consensus on the most likely scenario of their evolution (McKenna et al., 2019; Beutel et al., 2020; Gustafson et al., 2020). Despite this, open questions remain, such as the phylogenetic relationships of the major terrestrial groups, the phylogenetic position of Hygrobiidae within Dytiscoidea and the intra-familial relationships within Carabidae, Cicindelidae and Dytiscidae (Michat et al., 2017; Vasilikopoulos et al., 2019; Beutel et al., 2020; Gustafson et al., 2020). In addition, previous analyses of family-level relationships in Adephaga have suggested that some results of previous studies might be artefacts due to systematic errors (Cai et al., 2020). In this study, we address these unresolved issues by combining newly generated exon-capture sequence data with transcriptomic sequence data to infer the phylogeny of Adephaga based on extensive sampling of species. The majority of species diversity in Adephaga belong to the terrestrial family Carabidae (ground beetles, >35 000 extant species), whereas the closely related family Cicindelidae is a medium-sized terrestrial group (tiger beetles, >2400 extant species). Trachypachidae is another terrestrial family with only six extant species (Beutel et al., 2020; Duran & Gough, 2020; Lorenz, 2020). These families have been collectively referred to as ‘Geadephaga’ (Crowson, 1960). The monophyly of this unit has been disputed in the past based on analyses of morphological characters (Burmeister, 1976; Beutel & Roughley, 1988), but most recent morphological and molecular analyses suggest a single origin of the terrestrial adephagan groups (Beutel et al., 2006, 2020; Maddison et al., 2009; McKenna et al., 2019; Gustafson et al., 2020). In contrast to this, the phylogenetic relationships among Carabidae, Cicindelidae and Trachypachidae remain controversial as different phylogenomic analyses
have produced different topologies. Phylotranscriptomic analyses have placed Trachypachidae as sister to Carabidae + Cicindelidae (McKenna et al., 2019). In contrast, analyses of mitochondrial genomes suggested a weakly supported clade of Cicindelidae + Trachypachidae as sister to Carabidae (López-López & Vogler, 2017), whereas analyses of ultraconserved elements (UCEs) suggested Cicindelidae + (Carabidae + Trachypachidae) (Gustafson et al., 2020). It should be noted, however, that the taxon sampling of previous phylogenomic studies was not sufficient to test the monophyly of Carabidae and Cicindelidae and to robustly infer the phylogenetic position of the small family Trachypachidae (Zhang et al., 2018b; McKenna et al., 2019; Gough et al., 2020; Gustafson et al., 2020). In addition, the results of some molecular analyses do not agree with results of morphological studies that suggest Trachypachidae as sister to Carabidae + Cicindelidae (Beutel et al., 2020). Therefore, a re-evaluation of the relationships of Geadephaga with careful examination of potential sources of systematic error and increased species sampling is needed. The species of the remaining eight families of Adephaga (Amphizoidae, Aspidytidae, Dytiscidae, Haliplidae, Hygrobiidae, Meruidae, Noteridae and Gyrinidae) occur primarily in aquatic or semi-aquatic habitats (Jäch & Balke, 2008; Short, 2018). Most species of Dytiscidae, Gyrinidae, Hygrobiidae and Noteridae are strictly aquatic. Species of Amphizoidae are also aquatic, whereas Aspidytidae and Meruidae occur in hygropetric habitats (Kavanaugh, 1986; Balke et al., 2003; Spangler & Steiner, 2005; Vasilikopoulos et al., 2019). Crowson (1960) suggested that all the aquatic and semi-aquatic groups constitute a monophylum to which he referred to as ‘Hydradephaga’. Only a few molecular phylogenetic studies have supported this concept (Shull et al., 2001; Ribera et al., 2002; McKenna et al., 2015; López-López & Vogler, 2017), whereas the monophyly of this group has been refuted in more comprehensive studies based on analyses of morphological characters and genomic data (Beutel & Roughley, 1988; Beutel et al., 2006, 2020; Baca et al., 2017a; Gustafson et al., 2020; McKenna et al., 2019). More specifically, the placement of Gyrinidae as sister to all other Adephaga is currently a well-accepted scenario (Baca et al., 2017a; Beutel et al., 2020; Beutel & Roughley, 1988; Gustafson et al., 2020; but see Freitas et al., 2021). In addition, most analyses suggest a sister group relationship of Haliplidae to the superfamily Dytiscoidea (which includes Amphizoidae, Aspidytidae,
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga Dytiscidae, Hygrobiidae, Meruidae and Noteridae) and a clade Meruidae + Noteridae as sister to all remaining families of Dytiscoidea (Beutel et al., 2006; Baca et al., 2017a; Vasilikopoulos et al., 2019; Gustafson et al., 2020). Despite this, the phylogenetic position of the family Hygrobiidae (squeak beetles) within Dytiscoidea remains contentious (Toussaint et al., 2016; Baca et al., 2017a; Vasilikopoulos et al., 2019, 2021; Cai et al., 2020; Gustafson et al., 2020). The issue of model and data selection has received considerable attention in the context of the phylogeny of insects and other groups (Misof et al., 2013; Lanfear et al., 2014; Song et al., 2016; Feuda et al., 2017; Ballesteros & Sharma, 2019; Cai et al., 2020; Kapli & Telford, 2020; Evangelista et al., 2021). Specifically, several studies have demonstrated that using unrealistic models of molecular evolution might result in spurious phylogenetic estimates (Lartillot et al., 2007; Song et al., 2010, 2016; Wang et al., 2019; Crotty et al., 2020; Kapli & Telford, 2020). It has also been suggested that selecting sites or genes with reduced deviation from model assumptions might be beneficial (Philippe et al., 2017; Simion et al., 2020). In contrast, other authors have shown that it is difficult to remove systematic bias from the data without removing phylogenetic signal at the same time (Mongiardino Koch & Thompson, 2021). Such issues relating to model misspecification and data selection were also recently discussed in the context of the phylogeny of Adephaga (Vasilikopoulos et al., 2019, 2021; Cai et al., 2020). Heterogeneous composition of amino acids and nucleotides across taxa or across alignment sites, systematic bias resulting from hypervariable alignment sites, and deficient taxon sampling are among the potential factors affecting the internal phylogeny of the superfamily Dytiscoidea, including the monophyly of Aspidytidae (Baca et al., 2017a; Vasilikopoulos et al., 2019; Cai et al., 2020; Gustafson et al., 2020). Furthermore, it has been observed that summary coalescent and concatenation-based phylogenetic analyses often deliver incongruent topologies within Adephaga (Baca et al., 2017a; Gustafson et al., 2020; Freitas et al., 2021). However, the factors that contribute to this incongruence remain poorly understood (Baca et al., 2017a; Freitas et al., 2021). In particular, the extent of gene-tree errors in previous summary coalescent analyses of Adephaga and their effect on species-tree estimation remain uncertain (Baca et al., 2017a; Vasilikopoulos et al., 2019; Gustafson et al., 2020). Thus, two issues are imperative for a thorough assessment of the phylogenetic relationships of Adephaga in the light of increased taxon sampling: (i) evaluating the extent of gene-tree errors in summary coalescent analyses and (ii) using biologically realistic models in concatenation-based analyses. The issue of data-collection strategies in phylogenetics has also been extensively discussed (McCormack et al., 2013; Young & Gillung, 2020) and several hybrid-enrichment (or sequence-capture) approaches for phylogenomics have been developed (Faircloth et al., 2012; Lemmon et al., 2012; Bragg et al., 2016; Mayer et al., 2016). The use of UCEs (Faircloth et al., 2012) is the only sequence-capture approach that has been applied to the phylogeny of Adephaga so far (Baca et al., 2017a; Gustafson et al., 2020). However, some authors have suggested the use of other sequence-capture or transcriptomic approaches
3
in addition to or independent of the UCE approach (Bank et al., 2017; Karin et al., 2020) in an attempt to validate and compare results among studies (see also Vasilikopoulos et al., 2021). In this sense, hybrid-enrichment of protein-coding exons (Bank et al., 2017; Sann et al., 2018; Mayer et al., 2021) is another sequence-capture method that can provide complementary or independent evidence for testing the validity of previously suggested phylogenetic hypotheses of Adephaga. Concerning the utility of the exon-capture approach across different scales of molecular divergence, previous research suggests it is only effective for investigating taxonomic clades characterized by small to moderate levels of molecular divergence (Bi et al., 2012; Bragg et al., 2016; Mayer et al., 2016). Nevertheless, if transcriptomic resources are available for a broad set of species within the group of interest, they can be used for testing the applicability of exon-specific DNA-hybridization baits at deeper evolutionary timescales with higher levels of molecular divergence. Additionally, recently developed bioinformatic approaches are able to automatically detect suitable regions for bait design in aligned DNA sequence data, including protein-coding data, by minimizing overall bait-to-target distances (Mayer et al., 2016). Therefore, these bioinformatic approaches offer a promising solution to the problem of designing probes with broad phylogenetic applicability (Lemmon & Lemmon, 2013). Transcriptomic and genomic resources for adephagan beetles have increased considerably in the last few years (Gustafson et al., 2019; McKenna et al., 2019; Vasilikopoulos et al., 2019). Combined with the above-mentioned bioinformatic approaches, these new data make it now possible to test the applicability and efficiency of exon capture for deep-level phylogenetics in Adephaga. In this study, we develop a new set of DNA-hybridization baits specifically tailored to capture hundreds of single-copy protein-coding genes across adephagan lineages and generate new hybrid-capture data to infer the phylogeny of Adephaga. We test the efficiency of this set of baits for locus recovery in a large number of specimens. We also combine the newly generated hybrid-capture data with transcriptomes to generate the most species-rich phylogenomic dataset for adephagan beetles presented to date. In order to avoid biased estimates of phylogeny of Adephaga, we take measures to minimize phylogenetic artefacts by: (i) employing biologically realistic models of sequence evolution and (ii) by reducing potentially biasing factors in the data, using data-filtering strategies that select conserved alignment sites. We evaluate the effects of model misspecification and excessive data trimming both on the results of phylogenetic tree reconstructions and on quartet-based analyses of phylogenetic incongruence in an attempt to acquire a more detailed view of phylogenetic signal, conflict and bias in the backbone phylogeny of Adephaga. We also explore whether or not gene-tree discordance (GTD) can be explained by gene-tree estimation errors and suggest possible strategies for selecting informative genes that may increase congruence with concatenation-based analyses. Lastly, we discuss our results in the context of the morphological evolution of Adephaga.
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
4 A. Vasilikopoulos et al. Materials and methods Taxon sampling We combined 38 transcriptomes from 23 species of Adephaga and 15 outgroup species (File S1: Table S1) with newly generated exon-capture sequence data from 95 species of Adephaga (File S1: Table S2, note that two specimens of Hydrocanthus oblongus were initially processed but only one was included in the present study). Our initial taxon sampling comprised data from 118 species of Adephaga representing all families, except the monotypic Meruidae, and 21 outgroup species (two terminals of Hymenoptera, three of Mecopterida, two of Strepsiptera, four of Neuropterida, two of Myxophaga, two of Archostemata and six of Polyphaga). The initial taxon sampling includes the six reference species of the ortholog set (see below).
Inference of bait nucleotide sequences for hybrid enrichment of protein-coding exons We used 24 transcriptomes of Adephaga as a basis to build codon-based nucleotide multiple sequence alignments (MSAs) of orthologous genes and search for MSA regions that are suitable for bait design within Adephaga (see File S1: Table S1 and File S2). First, we used a custom ortholog gene set consisting of 3085 ortholog clusters of single-copy genes (COGs) at the hierarchical level of Holometabola (Vasilikopoulos et al., 2019) to assign orthologous transcripts from each transcriptome to each COG. Orthology assignment of transcripts to each COG was performed with Orthograph v. 0.6.1 (Petersen et al., 2017). Subsequently, we followed procedures for amino-acid MSA, alignment refinement, outlier-sequence removal and removal of reference taxa before generating codon-based nucleotide MSAs (see supplementary information of Misof et al., 2014a for details on these procedures). We then used Baitfisher v. 1.2.7 (Mayer et al., 2016) to screen the codon-based MSAs for regions that are appropriate for bait design within the Adephaga clade (File S2). We conducted seven different tiling design experiments, corresponding to different lengths of bait regions, bait offsets and total number of baits in order to capture as many promising coding exons as possible while accounting for variable exon length, possibly large amount of missing data or hypervariable regions in some parts of the MSAs (see Mayer et al., 2016 for details of the procedure used by Baitfisher, File S1: Table S3). In order to exclude baits targeting multiple genomic regions in adephagan genomes, we filtered the resulting baits (separately for each tiling design experiment) by conducting a blast search against a draft genome assembly of the beetle Bembidion corgenoma (Gustafson et al., 2019, as Bembidion haplogonum, see File S2). We then selected only one bait region per coding exon in each tiling design experiment: the one that required the minimum number of baits (Mayer et al., 2016). Subsequently, for those exons that were captured in multiple tiling design experiments only the longest bait region among experiments was considered (see File S2). In total, we inferred 49 786 120 bp-long bait sequences for targeting 923
protein-coding exons from 651 genes. For the sake of simplicity, we refer to our approach as ‘exon capture’ in this study instead of ‘coding-exon capture’, even though in our procedure we intended to include and analyse only the protein-coding regions of the targeted exons (i.e., excluding 3′ and 5′ untranslated regions).
Tissue preservation, total genomic DNA extraction, next-generation sequencing library preparation and hybrid enrichment of protein-coding exons Most specimens used for hybrid-enrichment of target genomic DNA (gDNA) were freshly collected and preserved in 96% ethanol but we also used a few dry pinned museum specimens (File S1: Table S4). Total gDNA was extracted from 95 specimens of Adephaga (File S1: Tables S2, S4) using the DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany) and eluted in 100 μL nuclease-free water. Whenever available, voucher material has been deposited at the Zoologische Staatssammlung München (Zoological State Collection in Munich, Germany, see File S1: Table S4). Quality and quantity of the extracted gDNA were assessed with a Fragment Analyzer (Agilent Technologies Inc., Santa Clara, CA, U.S.A.) and a Quantus Fluorometer (Promega, Fitchburg, WI, U.S.A.). Whenever sufficient amount of extracted DNA was available, we used 100 ng of DNA diluted in 10 μL for fragmentation before library preparation, otherwise less than 100 ng were used. First, gDNA was sheared into fragments of 150–400 bp using a Bioruptor Pico sonication device (Diagenode s.a., Seraing, Belgium). Multiple shearing steps were performed for each sample until at least ∼90% of fragments were within the desired length interval. The quality and quantity of the fragmented gDNA were assessed with a Fragment Analyzer at the end of each shearing step. For library preparation, we followed the SureSelectXT2 Target Enrichment System Protocol for Illumina Paired-End Multiplexed Sequencing (Version E1 published in June 2015 by Agilent Technologies Inc.) with some minor modifications (Bank et al., 2017). Specifically, in the library preparation steps ‘End Repair’ and ‘A-tailing’, we reduced the reaction volume specified in Agilent’s protocol (pages 43–49 for 100 ng DNA samples) by 50% as described by Bank et al. (2017). Subsequently, adapter ligation was performed with the NEBNext Quick Ligation Module and the adapters from the NEBNext Multiplex Oligos for Illumina (Dual Index Set1) kit. Next-generation sequencing (NGS) library PCR was then performed with the NEBNext Multiplex Oligos for Illumina and the NEBNext Q5 HotStart HiFi PCR Master Mix, to dual-index the libraries. Cycles of the NGS library PCR were adjusted as follows (due to the concentration measurements after ‘A-tailing’): 98∘ C for 30 s, followed by 8–10 cycles of 98∘ C for 10 s and 65∘ C for 75 s, followed by 5 min at 65∘ C followed by 4∘ C until the samples were removed from the thermocycler. Subsequently, all steps of the hybrid enrichment followed the protocol given by Bank et al. (2017) with modifications adjusted to the number of library pools and volume concentrations in our study (see File S2).
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga
Sequencing of enriched genomic libraries (Illumina, NextSeq 500)
Quality-based filtering or raw reads and adapter trimming (Trimmomatic)
5
Calculation of hybrid-enrichment statistics Sequencing, cleaning, assembly, orthology assignment Processing of individual COGs before concatenation
Map baits to filtered assemblies (BWA-mem)
Assembly of sequenced genomic libraries (IDBA-UD)
Map clean pairs of reads to filtered assemblies (BWA-mem)
Cross-contamination check of genomic assemblies (Croco)
Cross-contamination check of assembled transcriptomes (Croco) and vector contamination screening
Calculate average coverage depth of different assembly regions (e.g., target vs. non-target, SAMTools)
Orthology assignment (Orthograph)
Transcriptomes (previously published and newly assembled)
Calculate enrichment statistics (i.e., Ct / Cn, Ct / Ca)
Keep only 651 genes (COGs) for which baits were designed
Remove non-homologous fragments (MACSE, PREQUAL)
Removal of amino-acid residues that do not align to transcriptomes and reference genomes and manual curation of MSAs
Multiple sequence alignment of amino-acid sequences (FSA)
Removal of reference taxa, outlier sequences and masking of randomly similar sections (ALISCORE)
Fig. 1. Summarized workflow of the steps that were followed to sequence, clean, assemble and combine the hybrid-capture sequence data with transcriptomes and to generate individual COGs. A short workflow for calculating the hybrid-enrichment statistics is also provided.
Sequencing and assembly of the enriched genomic libraries The enriched genomic libraries for the 95 samples of Adephaga were paired-end sequenced (150 bp) on a single flow cell of an Illumina NextSeq 500 sequencer (Illumina Inc., San Diego, CA, U.S.A., Fig. 1). Sequenced raw reads of each genomic library were trimmed to remove Illumina adapter sequences and low quality reads with Trimmomatic v. 0.38 (Bolger et al., 2014, see File S2 for options). Only full pairs of trimmed reads were used for de novo assembly of the enriched genomic libraries (File S1: Table S2). De novo assembly of each genomic
library was performed with the software IDBA-UD v. 1.1.3 (see File S2 and Fig. 1) that is optimized to assemble genomic data with highly unequal coverage depth (Peng et al., 2012).
Calculation of hybrid-enrichment statistics We calculated the ratio of average per-base coverage depth of target regions (Ct) divided by the average per-base coverage depth of the nontarget regions (Ct/Cn, File S1: Table S2 and Fig. 2) as an approximate measure of the enrichment
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
6 A. Vasilikopoulos et al.
200
Ct / Cn ratio
150
100
50
0 Carabidae
Cicindelidae
Dytiscidae
Gyrinidae
Haliplidae
Noteridae
Family
Fig. 2. Box-plots of Ct/Cn ratios inferred separately for each family of Adephaga. The plots were calculated by pooling the ratios for species of the same family into the same box-plot.
success for each genomic library in our analyses. To identify the target regions, we first identified bait-binding regions in each assembled genomic library by mapping the bait nucleotide sequences to the clean assembly files (i.e., after putative cross-contaminated contigs had been removed) using the software BWA-mem v. 0.7.17 (Li & Durbin, 2009). Subsequently, we separately mapped the trimmed reads to the assemblies with the same version of BWA-mem. A summarized file with the coverage depth of each assembly position in each assembly was generated with SAMtools v. 1.7 (Li et al., 2009). We used a custom Python script and the IDs of the contigs that contained orthologous sequence (contigs assigned to any of the 651 target COGs, see below) to calculate the average per-base coverage depth of the bait-binding regions but only on those contigs that contained orthologous sequence (i.e., target regions, Ct, Fig. 1). We subsequently calculated the average per-base coverage depth of all remaining regions in the assembly for each genomic library (i.e., nontarget regions, Cn). Lastly, we calculated the average per-base coverage depth of the whole assembly for each assembled genomic library (Ca). Positions with zero coverage depth were excluded from the above calculations to avoid the inflation of enrichment statistics. We considered the statistics: Ct/Cn and Ct/Ca as approximate measures of the enrichment success for each of the 95 genomic libraries (File
S1: Table S2 and Figs 1, 2). We generated box-plots of these statistics separately for each adephagan family and performed pairwise Mann–Whitney–Wilcoxon tests between families in order to assess whether or not the values for different families were drawn from the same underlying distribution. The pairwise statistical tests were performed in R v. 3.6.3 (File S1: Table S5; R Core Team, 2020).
Cross-contamination checks and orthology assignment Putative cross-contaminated sequences or sequences of ambiguous origin within the assembled sequence-capture data were identified with the software package CroCo v. 1.1 (Simion et al., 2018). CroCo is primarily designed to screen RNA-seq data for contamination but can also potentially identify cross-contaminants from genomic data based on the assumption that the coverage of the contaminated contigs differs between the source library of contamination and the contaminated library respectively (see Simion et al., 2018 and also Mayer et al., 2016 for a similar approach). We considered contigs that were 99% similar over a fragment of 200 nucleotides as suspicious for cross-contamination (option: -tool K and otherwise default options). Contigs that were identified as putative contaminants
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga as well as those of ambiguous origin were deleted from the assemblies before downstream analyses (see File S1: Table S6 and File S2 for cross-contamination checks applied for some of the transcriptomes). Orthology assignment of genomic fragments to each of the COGs of the ortholog set was performed with Orthograph v. 0.6.3 (Petersen et al., 2017). From the 3,085 COGs of the ortholog set, we conservatively chose to analyse only the 651 COGs for which we had originally designed baits (File S1: Tables S1, S2). Orthograph-reporter script was run with the ‘protein2dna’ exonerate model for all hybrid-capture data (File S1: Table S2), whereas the default ‘protein2genome’ model was used for all transcriptomes in the dataset (File S1: Table S1, see also File S2 for additional options).
7
masked amino acids with ‘NNN’. We performed additional identification and removal of individual outlier sequences in each aligned aaCOG based on BLOSUM62 expected distances among taxa (see Dietz et al., 2019 and File S2). We then removed all sequences of the reference taxa, except for the sequences of the two hymenopteran species (Harpegnathos saltator, Nasonia vitripennis) and those of Tribolium castaneum that we included as outgroups. Lastly, alignment sections of random similarity within each aaCOG were identified with ALISCORE v. 1.2 (Misof & Misof, 2009; Kück et al., 2010), as described by Vasilikopoulos et al. (2019), and were subsequently removed with ALICUT v. 2.31 (https://github.com/PatrickKueck/AliCUT, access 16 June 2020) both at the amino-acid and the nucleotide sequence levels. The filtered and aligned aaCOGs were finally concatenated into a supermatrix with FASconCAT-G v. 1. 04 (Kück & Longo, 2014).
Data filtering, MSA, outlier-sequence removal and masking of randomly similar sections The output of Orthograph could still possibly contain non-exonic residues due to random extension of open reading frames beyond the protein-coding regions (Bank et al., 2017). Therefore, we followed additional procedures for filtering sequences within each COG. Specifically, we used the software MACSE v. 2.03 (Ranwez et al., 2018, option: -trimNonHomologous) to remove long individual sequence fragments that shared no homology with other sequences in each COG, such as those of possibly unidentified intronic fragments (Ranwez et al., 2018). The software PREQUAL v. 1.02 was subsequently used to remove shorter nonhomologous fragments such as those resulting from assembly artefacts or annotation errors (default parameters, Whelan et al., 2018). These filtering steps were applied at the nucleotide sequence level, and the resulted COGs (amino-acid COGs: aaCOGs, nucleotide COGs: nCOGs) were used for further downstream filtering. We used the software FSA v. 1.15.9 (option: -fast) to infer amino-acid MSAs for each filtered COG (Bradley et al., 2009). We selected the software FSA because it shows higher accuracy (i.e., lower false-positive alignment rate) than other MSA software and tends to leave nonhomologous amino-acid residues unaligned (Bradley et al., 2009). By aligning the amino-acid sequences with FSA, we greatly reduced the possibility of aligning nonhomologous fragments to each other. Subsequently, we filtered the aligned aaCOGs so that amino-acid residues from hybrid-enrichment data that did not align to amino-acid residues of at least one reference species (i.e., official gene set) and at least one transcriptome were masked with an ‘X’. Transcriptomic amino-acid residues that did not align to the protein-coding sequences of at least one reference taxon were also masked with an ‘X’. As a last quality check, we manually curated all aligned aaCOGs to mask putative nonhomologous amino-acid fragments (see File S2). We used these filtered amino-acid alignments as a blueprint to generate corresponding codon-based nucleotide alignments with a modified version of PAL2NAL (Suyama et al., 2006) as described by Misof et al. (2014a). A custom Python script was then used to mask all corresponding codons of the previously
Supermatrix evaluation and optimization for phylogenetic analyses We opted for an informative subset of the above-described amino-acid supermatrix by using the software MARE v. 0.1.2rc and by removing partitions with an information content of zero (IC = 0, Misof et al., 2013). After careful visual inspection of the resulted supermatrix (supermatrix A, Table 1) we observed that it still contained hypervariable alignment blocks. In addition, supermatrix A contained a large proportion of missing data (∼50%, Table 1), which can bias phylogenetic reconstructions if missing characters are not randomly distributed (Lemmon et al., 2009; Misof et al., 2014b). Additionally, supermatrix A showed evidence for deviation from the assumption of stationarity, reversibility and homogeneity (SRH) as measured with the Bowker’s and Stuart’s tests of symmetry in SymTest v. 2.0.47 (Bowker, 1948; Stuart, 1955; Misof et al., 2014a, see Table 1). Therefore, we chose to filter supermatrix A by applying strategies designed to select conserved alignment sites, reduce the degree of missing data and the potential effects of model violations in phylogenetic reconstructions (Misof et al., 2001; Sharma et al., 2014; Laumer et al., 2019). First, we identified and removed individual gene partitions that deviate from model assumptions using the -symtest option in IQ-TREE v. 2.0.4 (Naser-Khdour et al., 2019; Minh et al., 2020). The resulting filtered amino-acid supermatrix was then trimmed with the software BMGE v. 1.12 (h = 0.5, amino-acid replacement matrix: BLOSUM62) to remove hypervariable alignment sites (resulting in supermatrix D). We selected the software BMGE because it selects informative sites by inferring biologically realistic variability for each column of the alignment (Criscuolo & Gribaldo, 2010; Cai et al., 2020). We also generated five additional and independent amino-acid supermatrices by directly trimming supermatrix A or the partitions of supermatrix A with BMGE in order to examine the effects of progressively more aggressive filtering on the phylogenetic results (see Table 1). Additional supermatrices were generated by using three degrees of stringency (h = 0.5, h = 0.4 and h = 0.3, see Table 1 and File S3: Fig. S1).
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
8 A. Vasilikopoulos et al. Table 1. Summarized statistics and description for each generated and analysed amino-acid supermatrix (see File S3: Fig. S1). Saturation statistics of each supermatrix (adjusted R2 and slope) based on the patristic and p-distances are also presented. Saturation of each supermatrix was also measured with the average pairwise lambda score (see text).
Amino-acid supermatrix ID
No. of species
No. of alignment sites
A B C Da D - recodeda E Fa Ga Ha Ia Ja
136 136 136 136 136 136 136 120 100 136 136
200 017 49 468 55 521 49 797 49 797 50 614 36 511 36 511 36 511 29 361 23 442
P.I. sites 104 221 21 917 26 220 21 401 12 699 21 773 14 143 10 879 9658 11 711 7684
Percent. (%) of P.I. sites
Average pairwise λ (lambda) score
Adjusted R2 (SHETU)
Slope (SHETU)
Adjusted R2 (SHOMU)
Slope (SHOMU)
Adjusted R2 (SHOMP)
Ca
52.1% 44.3% 47.2% 43.0% 25.5% 43.0% 38.7% 29.8% 26.5% 39.9% 32.8%
0.163 0.118 0.135 0.116 0.069 0.116 0.095 0.079 0.074 0.104 0.069
– 0.425 0.369 0.451 – 0.454 0.510 0.396 0.570 0.418 0.556
– 0.126 0.111 0.133 – 0.133 0.155 0.230 0.247 0.135 0.177
– 0.486 0.403 0.512 – 0.515 0.569 0.393 0.575 0.480 0.642
– 0.213 0.182 0.226 – 0.227 0.256 0.272 0.306 0.225 0.299
– 0.479 0.405 N/A – N/A N/A N/A N/A N/A N/A
0.504 0.831 0.790 0.846 0.846 0.846 0.882 0.880 0.892 0.857 0.911
Amino-acid supermatrix ID
Average p-dist
Median pairwise P value to the Bowker’s test
A
0.154
2.14E-02
7.38E-05
0.672
58.92%
82.94%
B
0.111
1.07E-01
1.15E-02
0.620
37.44%
64.07%
C
0.127
9.46E-02
6.73E-03
0.599
40.10%
68.27%
Da
0.109
1.26E-01
1.19E-02
N/A
34.69%
64.11%
D - recodeda E
0.052 0.109
2.16E-01 1.22E-01
– 1.14E-02
N/A N/A
24.67% 35.02%
– 64.19%
Fa
0.089
1.99E-01
4.15E-02
N/A
24.98%
51.94%
Ga
0.074
2.27E-01
6.99E-02
N/A
20.35%
45.27%
Ha
0.070
2.34E-01
8.53E-02
N/A
18.85%
41.92%
Ia
0.098
1.75E-01
4.51E-02
N/A
25.59%
50.94%
Ja
0.065
2.96E-01
1.51E-01
N/A
13.97%
35.21%
Median pairwise P value to the Stuart’s test
IC
Percent. (%) of pairwise P-values < 0.05. Bowker’s test
Percent. (%) of pairwise P-values < 0.05. Stuart’s test
Description Concatenated supermatrix of masked genes with ALISCORE after partitions with IC = 0 had been removed Trimmed each gene partition of supermatrix A with BMGE, BLOSUM62, h = 0.4, keep only genes with length ≥50 amino-acid sites Trimmed each partition of supermatrix A with BMGE, BLOSUM62, h = 0.5, keep only genes with length ≥80 amino-acid sites and ≤30% missing data Removed genes that fail symmetry tests (IQ-TREE) from supermatrix A. Subsequently, trimmed resulting supermatrix with BMGE (h = 0.5, BLOSUM62) Dayhoff-6 recoded version of supermatrix D Trimmed supermatrix A with BMGE, BLOSUM62, h = 0.5 Trimmed supermatrix A with BMGE, BLOSUM62, h = 0.4 Removed distantly related outgroup species from supermatrix F Removed fast evolving ingroup species (20 ingroup species with highest LB scores) from supermatrix G Removed 50% of genes with the highest RCFV value from supermatrix A. Trimmed resulting supermatrix with BMGE, BLOSUM62, h = 0.5 Trimmed supermatrix A with BMGE, BLOSUM62, h = 0.3
a
Analysed under the Bayesian site-heterogeneous model CAT+GTR + G4 (BSHETU). P.I.: parsimony informative, Ca : Overall alignment completeness scores, IC: information content (MARE), p-dist: observed pairwise distances, N/A: Not applicable, SHETU: site-heterogeneous unpartitioned, SHOMU: site-homogeneous unpartitioned, SHOMP: site-homogeneous partitioned.
Among-species compositional heterogeneity is a potential source of systematic error that is frequently associated with fast evolving sites (Foster, 2004; Kocot et al., 2017). We generated two amino-acid supermatrices by using two different approaches for reducing among-species compositional heterogeneity (i.e., Dayhoff6-recoding and removal of genes with high relative composition frequency variation, RCFV, see Table 1 and File S2). We also tested whether the removal of distantly related
outgroup species or the removal of long-branched ingroup taxa (based on long-branch scores, LB scores, see File S2 and File S1: Table S7) affected the phylogenetic relationships. We performed a large number of statistical tests on each generated supermatrix in order to evaluate its suitability for phylogenetic reconstruction (Table 1). First, we inferred substitution saturation plots for most analysed supermatrices (Table 1 and File S2, see Misof et al., 2001; Nosenko et al., 2013) by
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga calculating pairwise amino-acid p-distances and pairwise patristic distances. We also inferred an alternative measure of substitution saturation that is independent of the patristic distances of the inferred trees; the average lambda score for each supermatrix (i.e., λ, ranging from 0.0 to 1.0) that was recently introduced for pairs of aligned sequenced data (higher values indicate higher degree of saturation, Jermiin & Misof, 2020). All pairwise λ scores in each supermatrix were calculated with the software SatuRation v. 1.0 (available from: https://github.com/lsjermiin/ SatuRation.v1.0, last access: 5 January 2021, Jermiin & Misof, 2020). We also measured the overall deviation from SRH conditions with the software SymTest v. 2.0.47 (current version available at https://github.com/ottmi/symtest, last access 20 April 2020, see Misof et al., 2014a) for each filtered supermatrix and for the original supermatrix A by applying the Bowker’s and Stuart’s tests of symmetry (Table 1). Additionally, we calculated the overall completeness scores of the analysed supermatrices and generated heatmaps of pairwise completeness scores with AliStat v. 1.11 (Wong et al., 2020, Table 1). Lastly, we screened each generated supermatrix for taxa with heterogeneous sequence divergence by generating heatmaps of pairwise mean similarity scores with ALIGROOVE v. 1.06 (Kück et al., 2014).
Concatenation-based phylogenetic analyses of amino-acid supermatrices Modelling site-specific propensities of amino acids has been shown to be more important than modelling partition-wise heterotachy in concatenation-based phylogenomic analyses (Feuda et al., 2017; Wang et al., 2019). In order to account for site-specific amino-acid preferences in the supermatrices, we analysed most amino-acid supermatrices under the site-heterogeneous model CAT + GTR + G4 (Bayesian site-heterogeneous model, BSHETU) using the software Phylobayes MPI v. 1.8 (Table 1, Lartillot et al., 2013). Two independent MCMC chains were run for each dataset until more than 20 000 samples were collected or until convergence (File S1: Table S8). We also analysed the amino-acid supermatrices using a maximum likelihood approach (ML) with IQ-TREE v. 1.6.12 (Nguyen et al., 2015). We first selected the best-fitting substitution models in ModelFinder based on the AICc criterion on the unpartitioned matrices (File S1: Table S9; Akaike, 1974; Kalyaanamoorthy et al., 2017). In order to test the relative fit of site-heterogeneous versus site-homogeneous models, we also included empirical site-heterogeneous profile mixture models in our model-selection procedure (i.e., C20, C40, C60, Quang et al., 2008). In total, more than 270 models were tested on each of supermatrices B–J (unpartitioned data) except for the recoded dataset, which was only analysed with the BSHETU model. For the partitioned supermatrices (B, C, Table 1), we also calculated an optimal partitioning scheme using an edge-linked partition model using the same version of IQ-TREE (File S2, Chernomor et al., 2016; Lanfear et al., 2014). For these supermatrices, we assessed the relative model fit of site-homogeneous unpartitioned (SHOMU), site-homogeneous partitioned (SHOMP) and
9
site-heterogeneous unpartitioned (SHETU) models by using a fixed neighbour-joining tree (File S1: Table S10 and File S2). Phylogenetic tree inference was performed for each supermatrix with the SHOMU, SHETU, PMSF (posterior mean-site frequency profile model as an approximation to the C60 SHETU model, File S2, Wang et al., 2018) and SHOMP models (where applicable). This was done in order to explore the extent to which using a suboptimal model affected phylogenetic reconstructions (File S1: Tables S9, S10). Statistical branch support of the inferred relationships in all concatenation-based ML analyses was estimated based on 2,000 ultrafast bootstrap (UFB) replicates (Hoang et al., 2018). As a complementary measure of support, we inferred quartet-concordance scores (QC) with Quartet Sampling v. 1.3.1 (Pease et al., 2018, option: -nreps 150) on the tree that resulted from the SHETU-based analysis of supermatrix D (Fig. 3). For inferring QC, we used a site-heterogeneous but less complex model than the one used to infer the tree (i.e., JTT + C20 + F + R8 instead of JTT + C60 + F + R8 due to computational limitations and using the same version of IQ-TREE, File S3: Fig. S2). Lastly, we calculated pairwise RF distances among the inferred trees under the same model (SHOMU, SHETU and PMSF) for amino-acid datasets with full taxon sampling using ETE v. 3.1.1 (Huerta-Cepas et al., 2016).
Phylogenetic analyses of nucleotide sequence data To assess the stability of phylogenetic results among analyses of different types of data, we also generated and analysed four supermatrices at the nucleotide sequence level (File S1: Table S11). Analyses of these supermatrices were performed with the same version of IQ-TREE and by selecting best-fitting SHOMP and SHOMU models (see File S2 and File S1: Table S11). We also inferred phylogenetic relationships using a model that accounts for heterotachy among sequences but has only been extensively tested in analyses of nucleotide sequence data (see File S1: Table S11, Crotty et al., 2020).
Estimating alternative and confounding signals in supermatrices via four-cluster likelihood mapping and data permutations In addition to the quartet-concordance measure, we applied the four-cluster likelihood mapping approach (FcLM, Strimmer & von Haeseler, 1997) to assess the robustness of phylogenetic results, and to measure the strength of alternative phylogenetic signals with respect to specific phylogenetic hypotheses that resulted from the analyses of supermatrix D (Fig. 3 and Table 2). The hypotheses that we tested were the following: (i) Hygrobiidae is sister to a clade of Amphizoidae + Aspidytidae (hypothesis 1) and (ii) Cicindelidae is the sister group of Carabidae (hypothesis 2). FcLM analyses were performed on different amino-acid supermatrices that were trimmed with different degrees of stringency and were based on both SHETU and SHOMU models, in an attempt to assess whether model misspecification affected the phylogenetic signal in favour
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
10
A. Vasilikopoulos et al. Harpegnathos saltator Nasonia vitripennis Fibla maclachlani Puncha ratzeburgi Protohermes xanthodes Pseudimares aphrodite Stylops melittae Xenos vesparum Agrilus planipennis Melanotus villosus Aleochara curtula Anomala sp. Oxoplatypus quadridentatus
0.08 8
Gyrinidae Dineutus
Trachypachidae Trachypachus
Cicindelidae
Tetracha Ozaena
Carabidae
Cychrus Panagaeus
Haliplidae
Haliplus Neohydrocoptuss
Noteridae Hygrobiidae Aspidytidae Amphizoidae
Hygrobia Amphizoa
Lancetes Megadytes
Dytiscidae
"HYDRADEPHAGA"
100% 95–99% 85–94% 75–84% 65–74%
Outgroups
GEADEPHAGA
Ultrafast bootstrap support
Tribolium castaneum Hydroscapha redfordi Lepicerus sp. Micromalthus debilis Priacma serrata Gyrinus marinus Gyrinini Andogyrus sp. Dineutini Macrogyrus sp. Patrus sp. Orectochilini Dineutus sp. Dineutini Porrorhynchus sp. Trachypachus gibbsii Amblycheila cylindriformis Manticorini Manticora latipennis Tetracha carolina Megacephalini Tricondyla aptera Collyridini Cicindela hybrida Cicindelini Mesacanthina cribata Pseudoxicheila tarsalis Oxycheilini Therates labiatus Cicindelini Clinidium baldufi Rhysodinae Goniotropis sp. Paussinae Ozaena sp. Siagona sp. Siagoninae Cychrus sp. Calosoma frigidum Carabinae Carabus granulatus Nebria picicornis Nebriinae Notiophilus sp. Broscinae Broscus cephalotes Elaphrus aureus Elaphrinae Loricera pilicornis Loricerinae Omophron sp. Omophroninae Clivina sp. Scaritinae Scarites subterraneus Bembidion corgenoma Trechinae Pogonus chalceus Pheropsophus sp. Brachininae Galerita sp. Morion sp. Calathus sp. Pterostichus burmeisteri Chlaenius tricolor Panagaeus bipustulatus Notiobia sp. Glyptolenus sp. Harpalinae Platynus sp. Odacantha melanura Pinacodera sp. Lachnophorini sp. Adelotopus paroensis Sphallomorpha suturalis Calophaena bicincta Peltodytes (Peltodytes) caesus Peltodytes (Neopeltodytes) oppositus Brychius elevatus Algophilus lathridioides Haliplus (Neohaliplus) lineatocollis Haliplus (Haliplidius) confinis Haliplus (Haliplus) fluviatilis Haliplus (Liaphlus) laminatus Notomicrus sp. Neohydrocoptus sp. Noterus clavicornis Canthydrus sp. Hydrocanthus oblongus Mesonoterus laevicollis Suphisellus (Pronoterus) semipunctatus Suphisellus gibbulus Suphisellus tenuicornis Hygrobia hermanni Hygrobia nigra Aspidytes niobe Sinaspidytes wrasei Amphizoa insolens Amphizoa lecontei Coptotomus sp. Coptotominae Lancetes sp. Lancetinae Agabetes acuductus Philaccolilus sp. . Neptosternus sp. Laccophilinae Laccodytes sp. Laccophilus poecilus Ilybius fenestratus Platambus maculatus Agabus undulatus Agabinae Hydrotrupes palpalis Platynectes sp. Bunites distigma Caperhantus cicurius Colymbetinae Meridiorhantus calidus Cybister lateralimarginalis Megadytes sp. Cybistrinae Sternhydrus atratus Sternhydrus scutellaris Dytiscus marginalis Hyderodes shuckardi Hydaticus pacificus Eretes griseus Dytiscinae Acilius canaliculatus Thermonectus basillaris Thermonectus intermedius Thermonectus margineguttatus Copelatus caelatipennis Exocelina sp. Copelatinae Liopterus haemorrhoidalis Batrachomatus nannup Matinae Matus sp. Hydrodytes opalinus Hydrodytinae Celina imitatrix Laccornis oblongus Hydrovatus sp. Pachydrus sp. Bidessus unistriatus Hydroglyphus geminus Derovatellus peruanus Necterosoma penicillatum Graptodytes pictus Porhydrus lineatus Hyphydrus ovatus Hydroporus erythrocephalus Stictotarsus duodecimpustulatus Hygrotus (Leptolambus) impressopunctatus
Laccornis Neptosternus
Hydroporinae
Hyphydrus Acilius
Fig. 3. Phylogenetic relationships of Adephaga as they resulted from the analysis of supermatrix D under the JTT + C60 + F + R8 site-heterogeneous model (i.e., SHETU model). Circles on tree nodes indicate branch support based on 2,000 ultrafast bootstraps (UFB). All beetle photos were provided by M. Balke. © 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
96.20% 95.90% 88.80% 17.20%
97.70% 96.20% 89.50% 9.60%
(Hyg. + Dyt.), (Rem. + Amp. + Asp.) 43.60% 48.00% 34.50% 2.40% (Tr. + Cara.), (Cici. + Rem.) 28.60% 24.20% 22.00% 3.50% (Hyg. + Rem.), (Dyt. + Amp. + Asp.) 43.40% 39.40% 18.40% 3.90% (Tr. + Cici.), (Cara. + Rem.) 35.10% 42.60% 38.00% 11.20% 95.00% 94.90% 92.00% 85.90%
96.70% 96.70% 94.40% 90.60%
(Hyg. + Amp. + Asp.), (Dyt. + Rem.) 10.70% 8.80% 36.60% 3.30% (Cici. + Cara.), (Tr. + Rem.) 32.50% 29.10% 28.80% 2.50% (Hyg. + Dyt.), (Rem. + Amp. + Asp.) 8.10% 8.30% 11.70% 10.60% (Tr. + Cara.), (Cici. + Rem.) 12.60% 12.80% 16.60% 21.90% (Hyg. + Rem.), (Dyt. + Amp. + Asp.) 28.80% 29.60% 29.00% 36.00% (Tr. + Cici.), (Cara. + Rem.) 5.90% 6.20% 7.30% 13.80% Supermatrix D Supermatrix E Supermatrix F Supermatrix J Hypo2 (30 912 quartets) Supermatrix D Supermatrix E Supermatrix F Supermatrix J
(Hyg. + Amp. + Asp.), (Dyt. + Rem.) 59.80% 58.80% 53.70% 44.00% (Cici. + Cara.), (Tr. + Rem.) 76.50% 75.90% 68.10% 50.20% Hypo1 (25 296 quartets)
Note: for the FcLM analyses we only included supermatrices that are comparable with respect to the effects of data trimming because they resulted from direct trimming of supermatrix A. Supermatrix D resulted from trimming a slightly different version of supermatrix A from which only 12 genes had been removed. Amp.: Amphizoidae, Asp.: Aspidytidae, Hyg.: Hygrobiidae, Dyt.: Dytiscidae, Rem.: remaining species, Cici.: Cicindelidae, Cara.: Carabidae, Tr.: Trachypachidae, SHETU: site-heterogeneous unpartitioned, SHOMU: site-homogeneous unpartitioned.
97.70% 97.70% 96.90% 94.70%
(Hyg. + Dyt.), (Rem. + Amp. + Asp.) 6.30% 6.60% 8.60% 9.10% (Tr. + Cara.), (Cici. + Rem.) 24.80% 25.10% 29.90% 35.00% (Hyg. + Amp. + Asp.), (Dyt. + Rem.) 65.30% 64.40% 61.50% 51.00% (Cici. + Cara.), (Tr. + Rem.) 67.20% 66.80% 60.90% 48.30%
(Hyg. + Rem.), (Dyt. + Amp. + Asp.) 27.30% 27.90% 28.20% 36.60% (Tr. + Cici.), (Cara. + Rem.) 5.70% 5.80% 6.10% 11.40%
98.90% 98.90% 98.30% 96.70%
Total resolved quartets (%) Alternative topology 2 supermatrix D Alternative topology 1 supermatrix D Total resolved Given topology quartets (%) supermatrix D Given topology supermatrix D
Alternative topology 1 supermatrix D
Alternative topology 2 supermatrix D
Total resolved Given topology quartets (%) supermatrix D
Alternative topology 1 supermatrix D
Alternative topology 2 supermatrix D
SHOMU model (original data) SHETU model (permuted data) SHETU model (original data)
Table 2. Detailed results of the four-cluster likelihood mapping analyses (FcLM) for the two examined phylogenetic hypotheses. Results (i.e., percentages) are shown only for the fully resolved quartets (i.e., quartets falling within the corner areas of the triangular Voronoi diagrams, see Strimmer & von Haeseler, 1997).
Phylogenomics of Adephaga 11 of specific hypotheses (Table 2). In addition, FcLM analyses under the better-fitting SHETU models were performed with permutations of data (i.e., randomization of phylogenetic signal, permutation no. I, see Misof et al., 2014a), in order to assess whether or not the FcLM support for a particular inferred relationship under the SHETU models resulted from misleading signal (Table 2, Misof et al., 2014a).
Summary coalescent phylogenetic analyses To explore the sensitivity of our concatenation-based analyses to the putative effects of incomplete lineage sorting (ILS), we conducted summary coalescent phylogenetic analyses (SCAs) with ASTRAL III v. 5.7.3 (Zhang et al., 2018a). As SCAs are prone to gene-tree estimation errors (Mirarab et al., 2016; Sayyari et al., 2017) we took steps to reduce these effects on our analyses. Alignment trimming methods have been shown to be detrimental in phylogenetic inference of gene trees (Tan et al., 2015) and therefore we selected the unmasked amino-acid alignments for these analyses (before trimming with ALISCORE, Fig. 1, File S3: Fig. S1). However, in order to reduce the negative effects of fragmentary sequences (Sayyari et al., 2017), which are common for hybrid-capture data (Hosner et al., 2016), we (i) removed alignment sites with more than or equal to 50% ambiguous characters, and then (ii) removed sequences for which more than 75% of sequence length contained ambiguous characters. Finally, we kept only genes that had a length of at least 150 amino acids and less than 50% missing data. The filtering tasks were performed with custom Perl scripts. In total, 348 filtered gene alignments were used for SCA. Gene trees were inferred after selecting the best-fitting models (SHOMU models) with the same version of IQ-TREE (see File S2). Branch support of individual gene trees was calculated based on 10 000 SH-aLRT replicates (Guindon et al., 2010; Simmons & Kessenich, 2020). SCAs were then conducted with ASTRAL after collapsing weakly supported branches ( 0.30), whereas analyses of supermatrix F have reached the convergence value of maxdiff. = 0.307 (considered acceptable in our study). N.I.: not inferred, N.A.: not applicable, SHETU: site-heterogeneous unpartitioned, SHOMP: site-homogeneous partitioned, SHOMU: site-homogeneous unpartitioned, BSHETU: Bayesian CAT+ GTR + G4 model.
(Fig. 3 and File S3: Figs S20–S24). Within Geadephaga, the monophyly of tiger beetles (Cicindelidae) and their placement as sister to monophyletic ground beetles (Carabidae) were inferred in most analyses under the site-heterogeneous models (BSHETU, SHETU, PMSF, Table 3; File S3: Figs S5–S19, S34–S42) and was also supported by analyses of nucleotide sequence data (File S3: Figs S20–S24). In contrast, a clade Trachypachidae + Carabidae was only obtained in analyses of supermatrix J and only under conditions of model misspecification (i.e., SHOMU model) or under the PMSF approximation, yet with no strong statistical branch support (Table 3; File S3: Figs S33, S42). Concerning the inferred position of the family Hygrobiidae, all ML analyses under the better-fitting SHETU models supported a clade of Hygrobiidae + (Amphizoidae + Aspidytidae) and most of them with strong UFB support (e.g., Fig. 3). QC score also strongly supported this clade (File S3: Fig. S2, QC = 0.33). UFB support in favour this clade under SHETU models was lower when more stringent trimming criteria were applied, but the inference of this clade remained robust to the selection of dataset when a SHETU model was applied (Table 3). On the other hand, analyses under the SHOMU and SHOMP models were inconsistent regarding this hypothesis (Table 3). Specifically, SHOMU analyses of the most stringently trimmed supermatrix under full taxon sampling (supermatrix J) supported a clade Dytiscidae + (Amphizoidae + Aspidytidae) as sister to Hygrobiidae but not with strong statistical branch support (Table 3). In general, progressive trimming with more stringent criteria resulted in shift from a strongly or moderately supported Hygrobiidae + (Amphizoidae + Aspidytidae) clade (supermatrix D and E) to a poorly supported Dytiscidae + (Amphizoidae + Aspidytidae) clade (supermatrix F and J) but only in conditions of model misspecification (SHOMU models). This pattern is also observed under BSHETU model but only
for the most stringently trimmed supermatrix (supermatrix J, Table 3). Phylogenetic analyses with the PMSF approximation to the SHETU model (using a SHOMU-based guide tree with a different topology, Table 3) restored the monophyly of Hygrobiidae + (Amphizoidae + Aspidytidae) for most supermatrices (e.g., supermatrices F, G, I, J, but not for supermatrix C). This suggests that the clade Dytiscidae + (Amphizoidae + Aspidytidae) inferred under SHOMU models for these supermatrices is likely an artefact due to model misspecification. Overall, a clade that includes Amphizoidae, Aspidytidae and Dytiscidae is never strongly supported even in the few instances that it is inferred under a site-heterogeneous model (BSHETU or PMSF, Table 3; File S3: Figs S16, S18, S19, S35, S43). Additional support for a clade Hygrobiidae + (Amphizoidae + Aspidytidae) comes from the results after removing distant outgroups and long-branched ingroup taxa from supermatrix F. Specifically, removing only distantly related outgroup taxa did not result in strong UFB support for this clade under the SHETU model (93%, Table 3) but when long-branched ingroup taxa were also removed, UFB support for the above-mentioned clade increased under the same model (98%). Additionally, the topology flipped from the clade Dytiscidae + (Amphizoidae + Aspidytidae) to the clade Hygrobiidae + (Amphizoidae + Aspidytidae) under the SHOMU and BSHETU models when long-branched ingroup species were removed (although not with strong support under the BSHETU, Table 3). This suggests that removal of distant outgroups without also accounting for branch-length heterogeneity of the ingroup might result in erroneous topology even when a site-heterogeneous model is used. Phylogenetic analyses of the Dayhoff6-recoded matrix D recovered unexpected and poorly supported clades with respect to the internal phylogeny of Dytiscoidea and more generally Adephaga (e.g., Gyrinidae + Geadephaga and Amphizoidae + Dytiscidae with low support, File S3: Fig. S14). Although the BSHETU
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga 15 analyses of the recoded matrix did not reach convergence (File S3: Fig. S14 and File S1: Table S8, maxdiff = 0.49, more than 29 000 samples per MCMC chain), these observations suggest that amino-acid data-recoding might be detrimental when excessive alignment trimming and data filtering have been applied before recoding the data.
Internal phylogeny of Carabidae, Cicindelidae, Dytiscidae and Gyrinidae based on analyses of concatenated sequence data Analyses of amino-acid and nucleotide supermatrices in a concatenation framework yielded the monophyly of all subfamilies in Dytiscidae (Fig. 3). However, phylogenetic relationships among these dytiscid subunits were unstable and not consistently resolved in all analyses except for a few cases. For instance, the subfamily Hydrodytinae was always inferred as sister to Hydroporinae with strong UFB and QC support (Fig. 3 and File S3: Fig. S2). The subfamilies Coptotominae and Lancetinae were always inferred as sister groups (Fig. 3; File S3: Figs S5–S19). In addition, all concatenation-based analyses resulted in a clade that includes all subfamilies of Dytiscidae excluding Coptotominae, Laccophilinae and Lancetinae with strong UFB and QC support (Fig. 3 and File S3: Figs S2, S5–S44). Specifically, most analyses with the SHETU models recovered Lancetinae + Coptotominae as sister to Laccophilinae + remaining Dytiscidae (Fig. 3; File S3: Figs S5–S12). In addition, most analyses of amino-acid supermatrices suggested the placement of Copelatinae as sister to a clade Matinae + (Hydrodytinae + Hydroporinae) (Fig. 3; File S3: Figs S5–S19, S25–S44). Lastly, the clades Agabinae + Colymbetinae and Cybistrinae + Dytiscinae were inferred consistently with strong support (Fig. 3; File S3: Figs S5–S44). Concerning the internal phylogeny of Cicindelidae, the tribe Manticorini was inferred as sister to all other subfamilies in all concatenation-based analyses (Fig. 3; File S3: Figs S5–S44). This result received high QC or high UFB support across concatenation-based analyses (File S3: Figs S2, S5–S44). Although a paraphyletic Manticorini was inferred in a few instances, this result was likely an artefact due to the extremely high degree of missing data for the species Manticora latipennis (File S1: Table S2). The tribe Megacephalini was placed as sister to all remaining Cicindelidae except Manticorini, whereas the tribe Collyridini was inferred as sister to a clade that included Cicindelini and Oxycheilini (Fig. 3 and File S3: Figs S5–S19, S25–S44). In contrast to Cicindelidae, the internal phylogeny of the megadiverse Carabidae remained largely unstable across analyses of different supermatrices and models (File S3: Figs S5–S44). However, some relationships were robustly inferred. For instance, the subfamily Trechinae was always inferred as sister to Brachininae + monophyletic Harpalinae, whereas the subfamilies Paussinae, Rhysodinae and Siagoninae were placed in a monophyletic group close to the base of the tree of Carabidae in analyses of amino-acid supermatrices (Fig. 3 and File S3: Figs S5–S19, S25–S44). Lastly, Carabinae was inferred as
sister to Nebriinae in most phylogenetic analyses of amino-acid supermatrices (Fig. 3 and File S3: Figs S5–S19, S25–S44). Within Gyrinidae, a strongly supported clade Dineutini + Orectochilini (as sister to Gyrinini) was inferred in different concatenation analyses of different types of data and models (Fig. 3 and File S3: Figs S5–S44). Dineutini was inferred as paraphyletic with respect to Orectochilini in analyses of amino-acid supermatrices but not always with strong UFB support (Fig. 3). Additionally, the inferred QC score did not support a paraphyletic Dineutini (QC = −0.1, File S3: Fig. S2). Analyses of nucleotide sequence data mostly suggested a monophyletic Dineutini as sister to Orectochilini, but monophyly of Dineutini was only strongly supported in one analysis of nucleotide sequence data (supermatrix D_nt, File S1: Table S11 and File S3: Fig. S24).
Comparison of different schemes of evolutionary modelling and the predictability of substitution saturation In total, 277 models were tested on each unpartitioned amino-acid supermatrix with ModelFinder. The results show that SHETU models significantly outperformed the best SHOMU models for all supermatrices in an unpartitioned context (File S1: Table S9). All the best-fitting SHETU models included 60 categories of fixed empirical amino-acid frequencies (i.e., C60 site-heterogeneous models) suggesting that the most complex SHETU models fitted the data better even for the most stringently trimmed supermatrices (e.g., supermatrices F and J, File S1: Table S9). Comparison of the optimal partitioning schemes (SHOMP) for supermatrices B and C with the complex SHETU models showed that site-heterogeneous models (SHETU) fitted these datasets better than both partitioned and unpartitioned site-homogeneous models (SHOMP and SHOMU, File S1: Table S10). Based on the observation that SHETU models fit the data better, the inferred saturation statistics showed that using a site-homogeneous model (SHOMP or SHOMU) resulted in underestimation of the amount of substitution saturation in the amino-acid supermatrices when a measure that is dependent on patristic distances was used (i.e., adjusted R2 , Table 1).
Stability of inferred relationships of Adephaga across analyses with different evolutionary models We calculated all pairwise normalized RF distances among trees inferred under the same model (SHOMU, SHETU or PMSF) for those amino-acid datasets with full taxon sampling (seven trees per model, supermatrices B, C, D, E, F, I, J, Fig. 5). We assessed whether topological distances between inferred trees differ when using different evolutionary models. Although RF distances of inferred trees did not significantly differ between PMSF and SHOMU models (P value = 0.237, Mann–Whitney–Wilcoxon test with continuity correction) or between PMSF and SHETU models (P value = 0.136, Mann–Whitney–Wilcoxon test with continuity correction),
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
16
A. Vasilikopoulos et al.
Pairwise norm. RF distance
0.15
0.10
0.05
0.00 PMSF
SHETU
SHOMU
Model
Fig. 5. Box-plots of all pairwise normalized Robinson–Foulds distances among trees that were inferred from different amino-acid supermatrices under the same type of model (normalized distances, only maximum likelihood analyses). We only included distances among trees that were inferred with full taxon sampling (i.e., supermatrices: B, C, D, E, F, I, J). SHETU: site-heterogeneous unpartitioned model, PMSF: posterior mean-site frequency profile model, SHOMU: site-homogeneous unpartitioned model.
RF distances of inferred trees were lower in analyses of SHETU models when compared with the SHOMU models (P value = 0.013, Mann–Whitney–Wilcoxon test with continuity correction, Fig. 5). This result is congruent with the consistent inference of the clade Hygrobiidae + (Amphizoidae + Aspidytidae) under SHETU models that were instead not consistently inferred under the SHOMU models, and constitutes further evidence that full site-heterogeneous empirical mixture models (C60, ML-based) result in greater stability of the inferred relationships than the less complex SHOMU models (Table 3 and Fig. 5).
Effects of removing hypervariable sites, distantly related outgroups and long-branched taxa on the statistical properties of amino-acid supermatrices Removal of hypervariable sites had a positive impact on the statistical properties of amino-acid supermatrices in terms of eliminating potential confounding factors (Table 1). In particular, trimming the supermatrices with BMGE resulted in reduction of total and pairwise missing data (Table 1 and File S3: Figs S45–S54) and reduced deviation from SRH conditions as indicated by the reduced percentage of pairwise comparisons that failed the corresponding symmetry tests in the analysed supermatrices (Table 1, Bowker’s test: 35.02%, 24.98% and 13.97% failed tests in supermatrices E, F and J respectively, see File S3: Figs S55–S65). Additionally, progressive removal of hypervariable sites resulted in progressively increasing
completeness of the supermatrices (Ca scores: 0.846, 0.882 and 0.911 for supermatrices E, F and J respectively, Table 1, File S3: Figs S45–S54). Supermatrices D and E did not significantly differ when comparing their statistical properties because only 12 genes from supermatrix A failed the symmetry tests in IQ-TREE and had therefore been removed before trimming (Table 1). Pairwise alignment similarity scores of taxa and indices for substitution saturation also improved with BMGE trimming (supermatrices D, E, F and J, Table 1 and File S3: Figs S66–S95), suggesting that progressively removing hypervariable sites results in progressively less saturated supermatrices (supermatrices D, E, F and J). The average λ scores within each supermatrix also showed that progressive removal of hypervariable sites resulted in supermatrices with less decay of potential historical signal (i.e., lower average λ scores, supermatrices D, E, F and J in Table 1). On the other hand, progressively more aggressive trimming of hypervariable sites resulted in progressive reduction of total parsimony-informative sites and reduced percentage of parsimony-informative sites (from 43.00% in supermatrix E to 32.80% in supermatrix J, Table 1). In a similar fashion, Dayhoff6-recoding resulted in removal of 40.66% of parsimony-informative sites from supermatrix D (Table 1). Removal of distantly related outgroups from supermatrix F resulted in a less saturated supermatrix according to average λ score, whereas the linear regression of p- and patristic distances under the SHOMU and SHETU models showed reduced adjusted R2 value (i.e., suggesting higher saturation) compared with the dataset before removing distantly related outgroups (i.e., supermatrix F). Comparisons of saturation statistics among datasets and models showed that conventional statistics of substitution saturation (R2 and slope of regression) are dependent on the applied model (Table 1). Despite this, removal of distantly related outgroups from supermatrix F resulted in reduced proportion of failed pairwise symmetry tests (Bowker’s test: 24.98%, 20.35%, 18.85% failed tests in supermatrices F, G, H respectively). Removal of long-branched ingroup taxa (see File S1: Table S7) resulted in further decrease in potential deviations from SRH conditions and also in further reduction in the degree of saturation (Bowker’s test: 24.98%, 20.35%, 18.85% failed tests, λ scores: 0.095, 0.079, 0.074 in supermatrices F, G, H, respectively).
Effects of removing hypervariable sites on the branch support statistics of well-established adephagan relationships We examined how removing hypervariable sites with BMGE using different degrees of stringency affected phylogenetic branch support for previously well-established clades of Adephaga and their outgroups. A clade that includes all adephagan families except Gyrinidae was strongly supported when using a moderate trimming strategy (supermatrices D, E, Fig. 6B) but UFB support for this relationship decreased with more aggressive trimming of the data under the SHETU and PMSF models (SHETU: 93% and 87% support in supermatrices F and J respectively, Fig. 6B). This pattern is also observed under the
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga 17
(A)
Percentage of branches (%)
30
20
Branches with < 100% support Branches with < 95% support
10
0 Supermatrix D
Supermatrix E
Supermatrix F
Supermatrix J
Dataset
(B) Adephaga excluding Gyrinidae U
OM
SH
SH
ET
U PM
SF
BS
HE
Haliplidae + Dytiscoidea
Coleoptera
TU
U
OM SH
SH
ET
U PM
SF
HE BS
TU
U
OM
SH
Supermatrix D
Supermatrix D
Supermatrix D
Supermatrix E
Supermatrix E
Supermatrix E
Supermatrix F
Supermatrix F
Supermatrix F
Supermatrix J
Supermatrix J
Supermatrix J
Branch support (UFB or PP)
SH
ET
U PM
SF
BS
HE
TU
Aspidytidae
Coleopterida
100 or 1.00
U
95–99 or 0.95–0.99
OM
SH
90–94 or 0.90–0.94
SH
ET
U
SF
PM
HE BS
TU
U
OM
SH
SH
ET
U PM
SF
T HE BS
U
80–89 or 0.80–0.89 50–79 or 0.50–0.79
Supermatrix D
Supermatrix D
Supermatrix E
Supermatrix E
Supermatrix F
Supermatrix F
Supermatrix J
Supermatrix J
Not analyzed Not inferred
Fig. 6. Effects of removing hypervariable sites on the branch support statistics when different degrees of trimming stringency were applied (i.e., h = 0.5 for supermatrices D and E, h = 0.4 for supermatrix F, h = 0.3 for supermatrix J). For these comparisons we only included supermatrices that are directly comparable because they resulted from direct trimming of supermatrix A. Supermatrix D resulted from trimming a slightly different version of supermatrix A from which only 12 genes had been removed. (A) Percentage of branches with UFB support lower than 100% (red bars) and lower than 95% (blue bars) under the SHETU model in analyses of selected amino-acid supermatrices. (B) Branch support (UFB or posterior probability) for specific well-established phylogenetic clades of Adephaga and outgroups (based on morphology and other molecular phylogenetic studies) depending on the dataset that was analysed. SHOMU: site-homogeneous unpartitioned, SHETU: site-heterogeneous unpartitioned, PMSF: posterior mean-site frequency profile model, BSHETU: Bayesian site-heterogeneous CAT+GTR + G4 model. The BSHETU analyses of supermatrix D did not reach convergence (maxdiff. = 1), whereas BSHETU analyses of supermatrix F have reached the value of maxdiff. = 0.307 (considered here as marginally acceptable, see File S1: Table S8). BSHETU analyses of supermatrix J have also reached the acceptable convergence value maxdiff. = 0.293. © 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
18
A. Vasilikopoulos et al.
complex BSHETU model (0.94 and 0.78 posterior probability in supermatrices F and J, respectively, Fig. 6B), whereas analyses under a mis-specified model (SHOMU) still gave strong support for this relationship (98% in supermatrix J). A similar pattern is observed for the monophyly of a clade Haliplidae + Dytiscoidea, which is inferred under all models but receives lower support in the analyses of supermatrices that were trimmed more aggressively (99% and 92% UFB support in supermatrices F and J under the SHETU model respectively, Fig. 6B). Additionally, excessive trimming of the supermatrix A resulted in very low UFB support for the monophyly of Coleoptera under the better-fitting SHETU model and even resulted in nonmonophyletic beetles in cases of model misspecification (Fig. 6B, supermatrix F). The monophyly of Aspidytidae is also less well-supported in the analyses of supermatrices that were produced by very stringent trimming (supermatrices F and J, 81% and 99% respectively under the SHETU model, Fig. 6B). Lastly, trimming of the data with progressively more stringent criteria resulted in the increase of the proportion of clades that are poorly supported under the better-fitting SHETU models (total proportion of branches with 20 000 amino-acid sites) to be considered phylogenomic datasets, yet the proportion of well-supported clades in their inferred trees is drastically reduced in comparison to less stringently trimmed datasets. These observations suggest that a balance between removing data-driven bias and phylogenetic information should be pursued in phylogenomic analyses.
Site-heterogeneous models result in greater stability of phylogenetic relationships of Adephaga Models that account for site-specific amino-acid propensities in the supermatrices, by incorporating heterogeneity in the amino-acid equilibrium frequencies among sites, have been shown to provide a better fit to the data than site-homogeneous models (partitioned or unpartitioned, Feuda et al., 2017). Our analyses confirm these results although our model selection procedure was not performed in a Bayesian framework to include the most complex site-heterogeneous models (i.e., CAT, Lartillot & Philippe, 2004). Nevertheless, recent research shows that when the number of amino-acid equilibrium frequency categories is fixed (e.g., C60 models), the model can potentially describe heterogeneous processes in the data as well as the unconstrained CAT model (Li et al., 2021). Therefore, the use of an unconstrained number of amino-acid equilibrium frequency categories in phylogenetic analyses is not justified (Li et al., 2021). An interesting outcome of our study is that C60 site-heterogeneous models result in more stable phylogenetic relationships than unpartitioned site-homogeneous models. Specifically, we observed that irrespective of the inferred phylogenetic position of Hygrobiidae under SHOMU model, analyses under the SHETU model (and most analyses under the PMSF model) resulted in a clade Hygrobiidae + (Amphizoidae + Aspidytidae). In addition, comparison of the pairwise RF distances of inferred trees among different models suggests that SHETU models result in more stable phylogenetic relationships of Adephaga and are potentially less affected by the trimming or gene selection regimes. Due to computational limitations, we were not able to test this hypothesis for the CAT + GTR + G4 model as not all analyses reached convergence and given that we were not able to perform BSHETU analyses for all datasets. Nevertheless, we suggest that SHETU models may help to reduce incongruence in analyses of different amino-acid supermatrices. Lastly, we corroborate previous claims that site-homogeneous models underestimate substitution saturation (Song et al., 2016; Lozano-Fernandez et al., 2019) for a wide selection of amino-acid datasets and trimming regimes.
GTD analyses and locus-subsampling strategies highlight gene-tree errors in the data GTD analyses on the complete set of loci but also on the selected subsets of loci suggest that our inferred gene trees are characterized by widespread gene-tree errors. The vast majority of gene trees strongly rejected any given well-known clade in Adephaga or in their outgroup but also any alternative phylogenetic hypothesis for the controversial groupings of Adephaga. Further indirect evidence for the extent of gene-tree errors in our dataset is provided by observing the distribution of phylogenetic information among the inferred gene trees. It is frequently assumed that GTDs are mainly due to biological factors such as ILS (Linkem et al., 2016; Cloutier et al., 2019). Despite this, we consider unlikely that ILS has affected all possible deep nodes in the phylogeny of Adephaga and their outgroups and therefore suggest that the observed GTD patterns are very likely due to gene-tree errors. This is more apparent when considering that our GTD analyses mostly show strongly rejected alternative phylogenetic hypotheses in the vast majority of relevant gene trees, rather than strongly supported discordance among alternative phylogenetic hypotheses. Our results confirm the views of other authors who suggest that the biasing effects of biological GTD is possible but might be less important than other biasing factors such as model misspecification and gene-tree errors at deep evolutionary timescales (Gatesy & Springer, 2014; Bryant & Hahn, 2020). Although there is no direct evidence from our analyses that the errors affect specific branches of our inferred species tree, our observations suggest that the results of the different SCAs cannot be trusted with confidence. This is further corroborated from comparing the distances of the selected concatenation-based tree to the trees inferred with SCA using different subsets of genes. These comparisons show that the coalescent method is sensitive to the set of input gene trees. It is, however, encouraging that the SCA recovered many well-established relationships of Adephaga when all genes are sampled, although some of them with low support. It should be noted that the inability of the SCA to infer congruent results with the concatenation-based tree or strongly supported results might also be due to the small number of genes in the selected gene subsets. Specifically, we observed that species trees inferred using the four smallest subsets of genes had the highest topological distance from the concatenation-based tree. In addition, a recent study showed that the ASTRAL method can infer species trees more accurately when thousands of loci are sampled (Tilic et al., 2020). In our study, we investigated whether using genes with higher phylogenetic information reduces potential gene-tree estimation error, yet the potential of increasing the accuracy of SCA by reducing systematic error has to be explored (e.g., by using empirical site-heterogeneous models, such as C10, for inferring individual gene trees Quang et al., 2008). Despite this, our results show that selecting genes based on the likelihood mapping criterion may be a better approach than selecting genes based on number of parsimony-informative sites or the average branch support when aiming at reducing incongruence between SCA and
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga 23 concatenation-based analyses. This result is in accordance with previous research that suggests likelihood mapping may be a good a priori estimator of phylogenetic informativeness (Klopfstein et al., 2017).
Conclusions We provide a new set of DNA-hybridization baits that show great promise in recovering protein-coding exons for evolutionary genomic investigations in Adephaga. Using an extensive sampling of species, by combining hybrid-capture sequence data and transcriptomes, we are able to clarify the phylogenetic relationships of the major groups such as the sister group relationship of Gyrinidae to all other families, a clade Haliplidae + Dytiscoidea, and the sister group relationship of Trachypachidae to a clade Carabidae + Cicindelidae. Furthermore, our extensive analyses under different trimming regimes and models shed light on the evolution of the families in Dytiscoidea. We show that moderate supermatrix trimming and a better-fitting site-heterogeneous model place Hygrobiidae as sister to a clade Amphizoidae + monophyletic Aspidytidae. Excessive removal of hypervariable sites using stringent trimming strategies should be avoided as it can lead to potential reduction in phylogenetic signal and reduced resolution of phylogenetic relationships. Site-heterogeneous models fit the data better but most importantly our results show that analyses with C60 site-heterogeneous models result in increased stability of inferred phylogenetic relationships of Dytiscoidea and Adephaga in general. Hence, incongruence between analyses of different subsets of amino-acid supermatrices may be ameliorated by using C60 models. Moreover, our analyses of a carefully curated set of genes suggest that gene-tree errors are prominent in the data and possibly responsible for poorly supported or incongruent species trees in SCA or for incongruent results between concatenation and SCA. Thus, our results show that scientists should take measures to eliminate or minimize gene-tree errors before attributing GTD and phylogenomic incongruence to other factors (e.g., ILS). As we have shown, a promising solution for reducing incongruence between coalescent-based and concatenation-based analyses is to select informative genes based on the likelihood mapping criterion.
Supporting Information
Additional supporting information may be found online in the Supporting Information section at the end of the article. File S1. All supplementary tables (Tables S1–S12) that include: (1) an overview of transcriptomes used in the study and results of orthology assignment for transcriptomes, (2) detailed statistics of the hybrid-enrichment data including orthology assignment results, (3) results of tiling design experiments for bait design, (4) collection and voucher information for the insect samples processed in this
study, (5) Mann-Whitney-Wilcoxon tests for pairwise tests of hybrid-enrichment statistics, (6) summarized results of cross-contamination checks for the hybrid-enrichment data, (7) LB scores of species in supermatrix G, (8) convergence statistics of the Bayesian phylogenetic analyses, (9) model selection results for all amino-acid supermatrices, (10) model comparison statistics for partitioned amino-acid supermatrices based on a fixed neighbor-joining tree, (11) description of analyzed nucleotide supermatrices and their inferred statistics, (12) summarized statistics of the different subsets of genes that were used in SCA. File S2. Supplementary experimental procedures that are not described in detail in the main materials and methods section. File S3. All supplementary figures (Figs S1–S106) that include: (1) the summarized workflow for generating amino-acid supermatrices after the MARE step, (2) quartet-concordance results for the tree resulted from the SHETU-based analysis of supermatrix D, (3) box-plots of number of recovered loci per family for the hybrid-enrichment data, (4) box-plots of Ct/Ca ratio for each family of Adephaga, (5) all inferred phylogenetic trees with branch support values, (6) AliStat, SymTest and ALIGROOVE heatmaps for each amino-acid dataset, (7) all saturation plots inferred under different models, (8) all trees inferred from SCA and (9) GTD analyses for analyzed subsets of genes (LM, PI and SH subsets).
Acknowledgements MB, ON, RGB and BM acknowledge the German Research Foundation (DFG) for funding the project ‘Die Integration von Phylogenomik, Sammlungsbeständen, innovativer Morphologie und umfangreicher paläontologischer Daten - Phylogenie und Evolution der Adephaga (Coleoptera) als Modellfall’ (BA 2152/24-1, NI 1387/7-1, BE 1789/11-1, MI 649/19-1). MB acknowledges support from the SNSB-Innovative scheme. We thank Wendy Moore (University of Arizona, U.S.A.), for granting access to the transcriptome of Metrius contractus before its official release. The authors would like to thank Claudia Etzbauer, Panagiotis Provataris, Jan Philip Oeyen, Malte Petersen and Lars Podsiadlowski for helpful discussions in various steps of the analyses. The authors would also like to thank Dr. Alexandr Prokin for providing valuable background information on fossils of Adephaga. The authors declare that there are no conflicts of interest.
Author contributions MB, ON, RGB and BM conceived the initial project. MB, DRM, ON, RGB and BM contributed to funding acquisition. AV, MB, ON, RGB and BM designed the study. MB and LH collected and provided insect samples. AV and SK performed all
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
24
A. Vasilikopoulos et al.
molecular laboratory experiments. AV and JMP performed assembly and cross-contamination checks for transcriptomes. SM and JMP performed NCBI sequence submissions. AV performed assembly, contamination checks and further processing of combined hybrid-enrichment data and transcriptomes. AV and KM performed phylogenetic analyses. CM provided bioinformatic methods for outlier detection. AV, MB and RGB drafted the manuscript with AV taking the lead. All authors contributed with comments in later versions of the manuscript.
Data availability statement The datasets supporting the conclusions of this article have been deposited in the figshare digital repository (doi: https://doi.org/ 10.6084/m9.figshare.14838390, ortholog set, bait nucleotide sequences, assemblies of hybrid-enrichment data, filtered multiple sequence alignments, supermatrices, treefiles and custom scripts). New hybrid-enrichment genetic data are deposited in GenBank (Bioproject ID: PRJNA645047, see also File S1: Table S2). Raw reads for the transcriptome of Chlaenius tricolor have been deposited in GenBank (SRA: SRR13633634, see also File S1: Table S1). Open Access funding enabled and organized by Projekt DEAL.
References Akaike, H. (1974) A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. Baca, S.M., Alexander, A., Gustafson, G.T. & Short, A.E.Z. (2017a) Ultraconserved elements show utility in phylogenetic inference of Adephaga (Coleoptera) and suggest paraphyly of ‘Hydradephaga’. Systematic Entomology, 42, 786–795. Baca, S.M., Toussaint, E.F.A., Miller, K.B. & Short, A.E.Z. (2017b) Molecular phylogeny of the aquatic beetle family Noteridae (Coleoptera: Adephaga) with an emphasis on data partitioning strategies. Molecular Phylogenetics and Evolution, 107, 282–292. Balke, M., Ribera, I. & Beutel, R.G. (2003) ASPIDYTIDAE: on the discovery of a new beetle family: detailed morphological analysis, description of a second species, and key to fossil and extant adephagan families (Coleoptera). Water Beetles of China (ed. by M.A. Jäch and L. Ji), pp. 53–66. Wien: Zoologisch-Botanische Gesellschaft & Wiener Coleopterologenverein. Balke, M., Ribera, I., Beutel, R., Viloria, A., Garcia, M. & Vogler, A.P. (2008) Systematic placement of the recently discovered beetle family Meruidae (Coleoptera: Dytiscoidea) based on molecular data. Zoologica Scripta, 37, 647–650. Ballesteros, J.A. & Sharma, P.P. (2019) A critical appraisal of the placement of Xiphosura (Chelicerata) with account of known sources of phylogenetic error. Systematic Biology, 68, 896–917. Bank, S., Sann, M., Mayer, C. et al. (2017) Transcriptome and target DNA enrichment sequence data provide new insights into the phylogeny of vespid wasps (Hymenoptera: Aculeata: Vespidae). Molecular Phylogenetics and Evolution, 116, 213–226. Belkaceme, T. (1991) Skelet und Muskulatur des Kopfes und Thorax von Noterus laevis Sturm. Ein Beitrag zur Morphologie und Phylogenie der Noteridae (Coleoptera: Adephaga). Stuttgarter Beiträge zur Naturkunde, 462, 1–94. Beutel, R.G. (1992a) Larval head structures of Omoglymmius hamatus and their implications for the relationships of Rhysodidae
(Coleoptera: Adephaga). Insect Systematics & Evolution, 23, 169–184. Beutel, R.G. (1992b) Phylogenetic analysis of thoracic structures of Carabidae (Coleoptera: Adephaga). Journal of Zoological Systematics and Evolutionary Research, 30, 53–74. Beutel, R.G. (1994) On the systematic position of Hydrotrupes palpalis Sharp (Coleoptera: Dytiscidae). Aquatic Insects, 16, 157–164. Beutel, R.G. & Roughley, R.E. (1987) On the systematic position of the genus Notomicrus Sharp (Hydradephaga, Coleoptera). Canadian Journal of Zoology, 65, 1898–1905. Beutel, R.G. & Roughley, R.E. (1988) On the systematic position of the family Gyrinidae (Coleoptera: Adephaga). Journal of Zoological Systematics and Evolutionary Research, 26, 380–400. Beutel, R.G. & Roughley, R.E. (1993) Phylogenetic analysis of Gyrinidae based on characters of the larval head (Coleoptera: Adephaga). Insect Systematics & Evolution, 24, 459–468. Beutel, R.G. & Ruhnau, S. (1990) Phylogenetic analysis of the genera of Haliplidae (Coleoptera) based on characters of adults. Aquatic Insects, 12, 1–17. Beutel, R.G., Balke, M. & Steiner, W.E. (2006) The systematic position of Meruidae (Coleoptera, Adephaga) and the phylogeny of the smaller aquatic adephagan beetle families. Cladistics, 22, 102–131. Beutel, R.G., Wang, B., Tan, J.J., Ge, S.Q., Ren, D. & Yang, X.K. (2013) On the phylogeny and evolution of Mesozoic and extant lineages of Adephaga (Coleoptera, Insecta). Cladistics, 29, 147–165. Beutel, R.G., Yan, E., Richter, A., Büsse, S., Miller, K.B., Yavorskaya, M. & Wipfler, B. (2017) The head of Heterogyrus milloti (Coleoptera: Gyrinidae) and its phylogenetic implications. Arthropod Systematics and Phylogeny, 75, 261–280. Beutel, R.G., Pohl, H., Yan, E.V. et al. (2019a) The phylogeny of Coleopterida (Hexapoda) – morphological characters and molecular phylogenies. Systematic Entomology, 44, 75–102. Beutel, R.G., Yan, E., Yavorskaya, M., Büsse, S., Gorb, S.N. & Wipfler, B. (2019b) On the thoracic anatomy of the Madagascan Heterogyrus milloti and the phylogeny of Gyrinidae (Coleoptera). Systematic Entomology, 44, 336–360. Beutel, R.G., Ribera, I., Fikáˇcek, M., Vasilikopoulos, A., Misof, B. & Balke, M. (2020) The morphological evolution of the Adephaga (Coleoptera). Systematic Entomology, 45, 378–395. Bi, K., Vanderpool, D., Singhal, S., Linderoth, T., Moritz, C. & Good, J.M. (2012) Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics, 13, 403. Bolger, A.M., Lohse, M. & Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. Bowker, A.H. (1948) A test for symmetry in contingency tables. Journal of the American Statistical Association, 43, 572–574. Bradley, R.K., Roberts, A., Smoot, M. et al. (2009) Fast statistical alignment. PLoS Computational Biology, 5, e1000392. Bragg, J.G., Potter, S., Bi, K. & Moritz, C. (2016) Exon capture phylogenomics: efficacy across scales of divergence. Molecular Ecology Resources, 16, 1059–1068. Bryant, D. & Hahn, M.W. (2020) The concatenation question. Phylogenetics in the genomic era (ed. by C. Scornavacca, F. Delsuc and N. Galtier), pp. 3.4:1–3.4:23. https://hal.archives-ouvertes.fr/hal02535651. Burmeister, E.-G. (1976) Der Ovipositor der Hydradephaga (Coleoptera) und seine phylogenetische Bedeutung unter besonderer Berücksichtigung der Dytiscidae. Zoomorphologie, 85, 165–257. Cai, C., Tihelka, E., Pisani, D. & Donoghue, P.C.J. (2020) Data curation and modeling of compositional heterogeneity in insect phylogenomics: a case study of the phylogeny of Dytiscoidea (Coleoptera:
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga 25 Adephaga). Molecular Phylogenetics and Evolution, 147, 106782. Chernomor, O., Von Haeseler, A. & Minh, B.Q. (2016) Terrace aware data structure for phylogenomic inference from supermatrices. Systematic Biology, 65, 997–1008. Cloutier, A., Sackton, T.B., Grayson, P., Clamp, M., Baker, A.J. & Edwards, S.V. (2019) Whole-genome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Systematic Biology, 68, 937–955. Criscuolo, A. & Gribaldo, S. (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology, 10, 210. Crotty, S.M., Minh, B.Q., Bean, N.G., Holland, B.R., Tuke, J., Jermiin, L.S. & Von Haeseler, A. (2020) GHOST: recovering historical signal from heterotachously evolved sequence alignments. Systematic Biology, 69, 249–264. Crowson, R.A. (1960) The phylogeny of Coleoptera. Annual Review of Entomology, 5, 111–134. Delsuc, F., Brinkmann, H. & Philippe, H. (2005) Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics, 6, 361–375. Désamoré, A., Laenen, B., Miller, K.B. & Bergsten, J. (2018) Early burst in body size evolution is uncoupled from species diversification in diving beetles (Dytiscidae). Molecular Ecology, 27, 979–993. Dietz, L., Dömel, J.S., Leese, F., Mahon, A.R. & Mayer, C. (2019) Phylogenomics of the longitarsal Colossendeidae: the evolutionary history of an Antarctic sea spider radiation. Molecular Phylogenetics and Evolution, 136, 206–214. Dressler, C. & Beutel, R.G. (2010) The morphology and evolution of the adult head of Adephaga (Insecta: Coleoptera). Arthropod Systematics and Phylogeny, 68, 239–287. Duran, D.P. & Gough, H.M. (2020) Validation of tiger beetles as distinct family (Coleoptera: Cicindelidae), review and reclassification of tribal relationships. Systematic Entomology, 45, 723–729. Evangelista, D., Simon, S., Wilson, M.M. et al. (2021) Assessing support for Blaberoidea phylogeny suggests optimal locus quality. Systematic Entomology, 46, 157–171. Faircloth, B.C., McCormack, J.E., Crawford, N.G., Harvey, M.G., Brumfield, R.T. & Glenn, T.C. (2012) Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology, 61, 717–726. Feuda, R., Dohrmann, M., Pett, W. et al. (2017) Improved modeling of compositional heterogeneity supports sponges as sister to all other animals. Current Biology, 27, 3864–3870. Forsyth, D.J. (1970) The structure of the defence glands of the Cicindelidae, Amphizoidae, and Hygrobiidae (Insecta: Coleoptera). Journal of Zoology, 160, 51–69. Foster, P. (2004) Modeling compositional heterogeneity. Systematic Biology, 53, 485–495. Freitas, F.V., Branstetter, M.G., Griswold, T. & Almeida, E.A.B. (2021) Partitioned gene-tree analyses and gene-based topology testing help resolve incongruence in a phylogenomic study of host-specialist bees (Apidae: Eucerinae). Molecular Biology and Evolution, 38, 1090–1100. Gatesy, J. & Springer, M.S. (2014) Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Molecular Phylogenetics and Evolution, 80, 231–266. Gontcharov, A.A., Marin, B. & Melkonian, M. (2004) Are combined analyses better than single gene phylogenies? A case study using SSU rDNA and rbcL sequence comparisons in the Zygnematophyceae (Streptophyta). Molecular Biology and Evolution, 21, 612–624.
Gough, H.M., Duran, D.P., Kawahara, A.Y. & Toussaint, E.F.A. (2019) A comprehensive molecular phylogeny of tiger beetles (Coleoptera, Carabidae, Cicindelinae). Systematic Entomology, 44, 305–321. Gough, H.M., Allen, J.M., Toussaint, E.F.A., Storer, C.G. & Kawahara, A.Y. (2020) Transcriptomics illuminate the phylogenetic backbone of tiger beetles. Biological Journal of the Linnean Society, 129, 740–751. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W. & Gascuel, O. (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology, 59, 307–321. Gustafson, G.T., Bergsten, J., Ranarilalatiana, T., Randriamihaja, J.H. & Miller, K.B. (2017a) The morphology and behavior of the endemic Malagasy whirligig beetle Heterogyrus milloti Legros, 1953 (Coleoptera: Gyrinidae: Heterogyrinae). The Coleopterists Bulletin, 71, 315–328. Gustafson, G.T., Prokin, A.A., Bukontaite, R., Bergsten, J. & Miller, K.B. (2017b) Tip-dated phylogeny of whirligig beetles reveals ancient lineage surviving on Madagascar. Scientific Reports, 7, 8619. Gustafson, G.T., Alexander, A., Sproul, J.S., Pflug, J.M., Maddison, D.R. & Short, A.E.Z. (2019) Ultraconserved element (UCE) probe set design: base genome and initial design parameters critical for optimization. Ecology and Evolution, 9, 6933–6948. Gustafson, G.T., Baca, S.M., Alexander, A.M. & Short, A.E.Z. (2020) Phylogenomic analysis of the beetle suborder Adephaga with comparison of tailored and generalized ultraconserved element probe performance. Systematic Entomology, 45, 552–570. Gustafson, G.T., Miller, K.B., Michat, M.C., Alarie, Y., Baca, S.M., Balke, M. & Short, A.E.Z. (2021) The enduring value of reciprocal illumination in the era of insect phylogenomics: a response to Cai et al. (2020). Systematic Entomology, 46, 473–486. Hawlitschek, O., Hendrich, L. & Balke, M. (2012) Molecular phylogeny of the squeak beetles, a family with disjunct Palearctic-Australian range. Molecular Phylogenetics and Evolution, 62, 550–554. Hoang, D.T., Chernomor, O., von Haeseler, A., Minh, B.Q. & Le, S.V. (2018) UFBoot2: improving the ultrafast bootstrap approximation. Molecular Biology and Evolution, 35, 518–522. Hosner, P.A., Faircloth, B.C., Glenn, T.C., Braun, E.L. & Kimball, R.T. (2016) Avoiding missing data biases in phylogenomic inference: an empirical study in the landfowl (Aves: Galliformes). Molecular Biology and Evolution, 33, 1110–1125. Huerta-Cepas, J., Serra, F. & Bork, P. (2016) ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular Biology and Evolution, 33, 1635–1638. Hunt, T., Bergsten, J., Levkanicova, Z. et al. (2007) A comprehensive phylogeny of beetles reveals the evolutionary origins of a superradiation. Science, 318, 1913–1916. Irisarri, I. & Meyer, A. (2016) The identification of the closest living relative(s) of tetrapods: phylogenomic lessons for resolving short ancient internodes. Systematic Biology, 65, 1057–1075. Jäch, M.A. & Balke, M. (2008) Global diversity of water beetles (Coleoptera) in freshwater. Hydrobiologia, 595, 419–442. Jermiin, L.S. & Misof, B. (2020) Measuring historical and compositional signals in phylogenetic data. bioRxiv. https://doi.org/10.1101/ 2020.01.03.894097. Kalyaanamoorthy, S., Minh, B.Q., Wong, T.K.F., von Haeseler, A. & Jermiin, L.S. (2017) ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods, 14, 587–589. Kapli, P. & Telford, M.J. (2020) Topology dependent asymmetry in systematic errors affects phylogenetic placement of Ctenophora and Xenacoelomorpha. Science Advances, 6, eabc5162.
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
26
A. Vasilikopoulos et al.
Kapli, P., Yang, Z. & Telford, M.J. (2020) Phylogenetic tree building in the genomic age. Nature Reviews Genetics, 21, 428–444. Karin, B.R., Gamble, T., Jackman, T.R. & Vidal, N. (2020) Optimizing phylogenomics with rapidly evolving long exons: comparison with anchored hybrid enrichment and ultraconserved elements. Molecular Biology and Evolution, 37, 904–922. Kavanaugh, D.H. (1986) A systematic review of amphizoid beetles (Amphizoidae: Coleoptera) and their phylogenetic relationships to other Adephaga. Proceedings of the California Academy of Sciences, 44, 67–109. Klopfstein, S., Massingham, T. & Goldman, N. (2017) More on the best evolutionary rate for phylogenetic analysis. Systematic Biology, 66, 769–785. Kocot, K.M., Struck, T.H., Merkel, J. et al. (2017) Phylogenomics of Lophotrochozoa with consideration of systematic error. Systematic Biology, 66, 256–282. Kück, P. & Longo, G.C. (2014) FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. Frontiers in Zoology, 11, 81. Kück, P. & Struck, T.H. (2014) BaCoCa - a heuristic software tool for the parallel assessment of sequence biases in hundreds of gene and taxon partitions. Molecular Phylogenetics and Evolution, 70, 94–98. Kück, P., Meusemann, K., Dambach, J., Thormann, B., von Reumont, B.M., Wägele, J.W. & Misof, B. (2010) Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology, 7, 10. Kück, P., Meid, S.A., Groß, C., Wägele, J.W. & Misof, B. (2014) AliGROOVE - visualization of heterogeneous sequence divergence within multiple sequence alignments and detection of inflated branch support. BMC Bioinformatics, 15, 294. Lanfear, R., Calcott, B., Kainer, D., Mayer, C. & Stamatakis, A. (2014) Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evolutionary Biology, 14, 82. Lartillot, N. & Philippe, H. (2004) A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21, 1095–1109. Lartillot, N., Brinkmann, H. & Philippe, H. (2007) Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7, S4. Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. (2013) PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Systematic Biology, 62, 611–615. Laumer, C.E., Fernández, R., Lemer, S. et al. (2019) Revisiting metazoan phylogeny with genomic sampling of all phyla. Proceedings of the Royal Society B: Biological Sciences, 286, 20190831. Lawrence, J.F. & Newton, A.F., JR (1982) Evolution and classification of beetles. Annual Review of Ecology and Systematics, 13, 261–290. ´ nski, A., Seago, A.E., Thayer, M.K., Newton, A.F. Lawrence, J.F., Slipi´ & Marvaldi, A.E. (2011) Phylogeny of the Coleoptera based on morphological characters of adults and larvae. Annales Zoologici, 61, 1–217. Lemmon, E.M. & Lemmon, A.R. (2013) High-throughput genomic data in systematics and phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 44, 99–121. Lemmon, A.R., Brown, J.M., Stanger-Hall, K. & Lemmon, E.M. (2009) The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Systematic Biology, 58, 130–145. Lemmon, A.R., Emme, S.A. & Lemmon, E.M. (2012) Anchored hybrid enrichment for massively high-throughput phylogenomics. Systematic Biology, 61, 727–744. Li, H. & Durbin, R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, 1754–1760.
Li, H., Handsaker, B., Wysoker, A. et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. Li, Y., Shen, X.-X., Evans, B., Dunn, C.W. & Rokas, A. (2021) Rooting the animal tree of life. Molecular Biology and Evolution. https://doi .org/10.1093/molbev/msab170. Linkem, C.W., Minin, V.N. & Leaché, A.D. (2016) Detecting the anomaly zone in species trees and evidence for a misleading signal in higher-level skink phylogeny (Squamata: Scincidae). Systematic Biology, 65, 465–477. López-López, A. & Vogler, A.P. (2017) The mitogenome phylogeny of Adephaga (Coleoptera). Molecular Phylogenetics and Evolution, 114, 166–174. Lorenz, W. (2020) CarabCat: global database of ground beetles (version Oct 2017) [WWW document]. Species 2000 ITIS Cat. Life, 2020-12-01. URL https://www.catalogueoflife.org/ [accessed on 12 November 2020]. Lozano-Fernandez, J., Tanner, A.R., Giacomelli, M., Carton, R., Vinther, J., Edgecombe, G.D. & Pisani, D. (2019) Increasing species sampling in chelicerate genomic-scale datasets provides support for monophyly of Acari and Arachnida. Nature Communications, 10, 2295. Maddison, D.R., Moore, W., Baker, M.D., Ellis, T.M., Ober, K.A., Cannone, J.J. & Gutell, R.R. (2009) Monophyly of terrestrial adephagan beetles as indicated by three nuclear genes (Coleoptera: Carabidae and Trachypachidae). Zoologica Scripta, 38, 43–62. Mayer, C., Sann, M., Donath, A. et al. (2016) BaitFisher: a software package for multispecies target DNA enrichment probe design. Molecular Biology and Evolution, 33, 1875–1886. Mayer, C., Dietz, L., Call, E., Kukowka, S., Martin, S. & Espeland, M. (2021) Adding leaves to the Lepidoptera tree: capturing hundreds of nuclear genes from old museum specimens. Systematic Entomology, 46, 649–671. McCormack, J.E., Hird, S.M., Zellmer, A.J., Carstens, B.C. & Brumfield, R.T. (2013) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution, 66, 526–538. McKenna, D.D., Wild, A.L., Kanda, K. et al. (2015) The beetle tree of life reveals that Coleoptera survived end-Permian mass extinction to diversify during the Cretaceous terrestrial revolution. Systematic Entomology, 40, 835–880. McKenna, D.D., Shin, S., Ahrens, D. et al. (2019) The evolution and genomic basis of beetle diversity. Proceedings of the National Academy of Sciences, 116, 24729–24737. Michat, M.C., Alarie, Y. & Miller, K.B. (2017) Higher-level phylogeny of diving beetles (Coleoptera: Dytiscidae) based on larval characters. Systematic Entomology, 42, 734–767. Miller, K.B. (2001) On the phylogeny of the Dytiscidae (Insecta: Coleoptera) with emphasis on the morphology of the female reproductive system. Insect Systematics & Evolution, 32, 45–92. Miller, K.B. (2009) On the systematics of Noteridae (Coleoptera: Adephaga: Hydradephaga): phylogeny, description of a new tribe, genus and species, and survey of female genital morphology. Systematics and Biodiversity, 7, 191–214. Miller, K.B. & Bergsten, J. (2012) Phylogeny and classification of whirligig beetles (Coleoptera: Gyrinidae): relaxed-clock model outperforms parsimony and time-free Bayesian analyses. Systematic Entomology, 37, 706–746. Miller, K.B. & Bergsten, J. (2014) The phylogeny and classification of predaceous diving beetles. Ecology, systematics, and the natural history of predaceous diving beetles (Coleoptera: Dytiscidae) (ed. by D. Yee), pp. 49–172. Springer, Dordrecht. Minh, B.Q., Schmidt, H.A., Chernomor, O., Schrempf, D., Woodhams, M.D., von Haeseler, A. & Lanfear, R. (2020) IQ-TREE 2: new models
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Phylogenomics of Adephaga 27 and efficient methods for phylogenetic inference in the genomic era. Molecular Biology and Evolution, 37, 1530–1534. Mirarab, S., Bayzid, M.S. & Warnow, T. (2016) Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Systematic Biology, 65, 366–380. Misof, B. & Misof, K. (2009) A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. Systematic Biology, 58, 21–34. Misof, B., Rickert, A.M., Buckley, T.R., Fleck, G. & Sauer, K.P. (2001) Phylogenetic signal and its decay in mitochondrial SSU and LSU rRNA gene fragments of Anisoptera. Molecular Biology and Evolution, 18, 27–37. Misof, B., Meyer, B., von Reumont, B.M., Kück, P., Misof, K. & Meusemann, K. (2013) Selecting informative subsets of sparse supermatrices increases the chance to find correct trees. BMC Bioinformatics, 14, 348. Misof, B., Liu, S., Meusemann, K. et al. (2014a) Phylogenomics resolves the timing and pattern of insect evolution. Science, 346, 763–767. Misof, B., Meusemann, K., von Reumont, B.M., Kück, P., Prohaska, S.J. & Stadler, P.F. (2014b) A priori assessment of data quality in molecular phylogenetics. Algorithms for Molecular Biology, 9, 22. Mongiardino Koch, N. & Thompson, J.R. (2021) A total-evidence dated phylogeny of Echinoidea combining phylogenomic and paleontological data. Systematic Biology, 70, 421–439. Naser-Khdour, S., Minh, B.Q., Zhang, W., Stone, E.A., Lanfear, R. & Bryant, D. (2019) The prevalence and impact of model violations in phylogenetic analysis. Genome Biology and Evolution, 11, 3341–3352. Nguyen, L.-T., Schmidt, H.A., Von Haeseler, A. & Minh, B.Q. (2015) IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32, 268–274. Nilsson, A.N. (1989) On the genus Agabetes Crotch (Coleoptera, Dytiscidae), with a new species from Iran. Annales Entomologici Fennici, 55, 35–40. Nosenko, T., Schreiber, F., Adamska, M. et al. (2013) Deep metazoan phylogeny: when different genes tell different stories. Molecular Phylogenetics and Evolution, 67, 223–233. Pease, J.B., Brown, J.W., Walker, J.F., Hinchliff, C.E. & Smith, S.A. (2018) Quartet sampling distinguishes lack of support from conflicting support in the green plant tree of life. American Journal of Botany, 105, 385–403. Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28, 1420–1428. Petersen, M., Meusemann, K., Donath, A. et al. (2017) Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics, 18, 111. Philippe, H., de Vienne, D.M., Ranwez, V., Roure, B., Baurain, D. & Delsuc, F. (2017) Pitfalls in supermatrix phylogenomics. European Journal of Taxonomy, 283, 1–25. Phillips, M.J., Delsuc, F. & Penny, D. (2004) Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution, 21, 1455–1458. Portik, D.M. & Wiens, J.J. (2021) Do alignment and trimming methods matter for phylogenomic (UCE) analyses? Systematic Biology, 70, 440–462. Quang, L.S., Gascuel, O. & Lartillot, N. (2008) Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics, 24, 2317–2323. R Core Team (2020) R: A Language and Environment for Statistical Computing.
Ranwez, V., Douzery, E.J.P., Cambon, C., Chantret, N. & Delsuc, F. (2018) MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Molecular Biology and Evolution, 35, 2582–2584. Ribera, I., Hogan, J.E. & Vogler, A.P. (2002) Phylogeny of hydradephagan water beetles inferred from 18S rRNA sequences. Molecular Phylogenetics and Evolution, 23, 43–62. Ribera, I., Vogler, A.P. & Balke, M. (2008) Phylogeny and diversification of diving beetles (Coleoptera: Dytiscidae). Cladistics, 24, 563–590. Robinson, D.F. & Foulds, L.R. (1981) Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147. Rodríguez-Ezpeleta, N., Brinkmann, H., Roure, B., Lartillot, N., Lang, B.F. & Philippe, H. (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Systematic Biology, 56, 389–399. Sann, M., Niehuis, O., Peters, R.S. et al. (2018) Phylogenomic analysis of Apoidea sheds new light on the sister group of bees. BMC Evolutionary Biology, 18, 71. Sayyari, E., Whitfield, J.B. & Mirarab, S. (2017) Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Molecular Biology and Evolution, 34, 3279–3291. Sayyari, E., Whitfield, J.B. & Mirarab, S. (2018) DiscoVista: interpretable visualizations of gene tree discordance. Molecular Phylogenetics and Evolution, 122, 110–115. Sharma, P.P., Kaluziak, S.T., Pérez-Porro, A.R., González, V.L., Hormiga, G., Wheeler, W.C. & Giribet, G. (2014) Phylogenomic interrogation of Arachnida reveals systemic conflicts in phylogenetic signal. Molecular Biology and Evolution, 31, 2963–2984. Short, A.E.Z. (2018) Systematics of aquatic beetles (Coleoptera): current state and future directions. Systematic Entomology, 43, 1–18. Shull, V.L., Vogler, A.P., Baker, M.D., Maddison, D.R. & Hammond, P.M. (2001) Sequence alignment of 18S ribosomal RNA and the basal relationships of adephagan beetles: evidence for monophyly of aquatic families and the placement of Trachypachidae. Systematic Biology, 50, 945–969. Simion, P., Belkhir, K., François, C. et al. (2018) A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data. BMC Biology, 16, 28. Simion, P., Delsuc, F. & Philippe, H. (2020) To what extent current limits of phylogenomics can be overcome? Phylogenetics in the genomic era (ed. by C. Scornavacca, F. Delsuc and N. Galtier), pp. 2.1:1–2.1:34. https://hal.archives-ouvertes.fr/hal-02535366. Simmons, M.P. & Kessenich, J. (2020) Divergence and support among slightly suboptimal likelihood gene trees. Cladistics, 36, 322–340. Song, H., Sheffield, N.C., Cameron, S.L., Miller, K.B. & Whiting, M.F. (2010) When phylogenetic assumptions are violated: base compositional heterogeneity and among-site rate variation in beetle mitochondrial phylogenomics. Systematic Entomology, 35, 429–448. Song, F., Li, H., Jiang, P. et al. (2016) Capturing the phylogeny of Holometabola with mitochondrial genome data and Bayesian site-heterogeneous mixture models. Genome Biology and Evolution, 8, 1411–1426. Spangler, P.J. & Steiner, W.E. Jr (2005) A new aquatic beetle family, Meruidae, from Venezuela (Coleoptera: Adephaga). Systematic Entomology, 30, 339–357. Steenwyk, J.L., Buida, T.J. III, Li, Y., Shen, X.-X. & Rokas, A. (2020) ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biology, 18, e3001007. Strimmer, K. & von Haeseler, A. (1997) Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proceedings of the National Academy of Sciences, 94, 6815–6819. Stuart, A. (1955) A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412–416.
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
28
A. Vasilikopoulos et al.
Suyama, M., Torrents, D. & Bork, P. (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research, 34, 609–612. Talavera, G. & Castresana, J. (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Systematic Biology, 56, 564–577. Tan, G., Muffato, M., Ledergerber, C., Herrero, J., Goldman, N., Gil, M. & Dessimoz, C. (2015) Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Systematic Biology, 64, 778–791. Tilic, E., Sayyari, E., Stiller, J., Mirarab, S. & Rouse, G.W. (2020) More is needed – thousands of loci are required to elucidate the relationships of the ‘flowers of the sea’ (Sabellida, Annelida). Molecular Phylogenetics and Evolution, 151, 106892. Toussaint, E.F.A., Beutel, R.G., Morinière, J. et al. (2016) Molecular phylogeny of the highly disjunct cliff water beetles from South Africa and China (Coleoptera: Aspidytidae). Zoological Journal of the Linnean Society, 176, 537–546. Toussaint, E.F.A., Seidel, M., Arriaga-Varela, E. et al. (2017) The peril of dating beetles. Systematic Entomology, 42, 1–10. Vasilikopoulos, A., Balke, M., Beutel, R.G. et al. (2019) Phylogenomics of the superfamily Dytiscoidea (Coleoptera: Adephaga) with an evaluation of phylogenetic conflict and systematic error. Molecular Phylogenetics and Evolution, 135, 270–285. Vasilikopoulos, A., Gustafson, G.T., Balke, M., Niehuis, O., Beutel, R.G. & Misof, B. (2021) Resolving the phylogenetic position of Hygrobiidae (Coleoptera: Adephaga) requires objective statistical tests and exhaustive phylogenetic methodology: a response to Cai et al. (2020). Molecular Phylogenetics and Evolution, 162, 106923.
van Vondel, B.J. (2019) Features of the metacoxal air-storage space as additional characters for reconstructing the phylogeny of Haliplidae (Coleoptera). Tijdschrift voor Entomologie, 162, 13–32. Wang, H.-C., Minh, B.Q., Susko, E. & Roger, A.J. (2018) Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Systematic Biology, 67, 216–235. Wang, H.-C., Susko, E. & Roger, A.J. (2019) The relative importance of modeling site pattern heterogeneity versus partition-wise heterotachy in phylogenomic inference. Systematic Biology, 68, 1003–1019. Whelan, S., Irisarri, I. & Burki, F. (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics, 34, 3929–3930. Wong, T.K.F., Kalyaanamoorthy, S., Meusemann, K., Yeates, D.K., Misof, B. & Jermiin, L.S. (2020) A minimum reporting standard for multiple sequence alignments. NAR Genomics and Bioinformatics, 2, lqaa024. https://doi.org/10.1093/nargab/lqaa024. Young, A.D. & Gillung, J.P. (2020) Phylogenomics – principles, opportunities and pitfalls of big-data phylogenetics. Systematic Entomology, 45, 225–247. Zhang, C., Rabiee, M., Sayyari, E. & Mirarab, S. (2018a) ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19, 153. ´ nski, A. & Zhang, S.-Q., Che, L.-H., Li, Y., Liang, D., Pang, H., Slipi´ Zhang, P. (2018b) Evolutionary history of Coleoptera revealed by extensive sampling of genes and species. Nature Communications, 9, 205. Accepted 1 July 2021
© 2021 The Authors. Systematic Entomology published by John Wiley & Sons Ltd on behalf of Royal Entomological Society. doi: 10.1111/syen.12508
Species
Agrilus planipennis
Aleochara curtula
Amphizoa insolens
Amphizoa lecontei
Anomala sp.
Aspidytes niobe
Batrachomatus nannup
Bembidion corgenoma
Calosoma frigidum
Carabus granulatus
Chlaenius tricolor
Cicindela hybrida
Clinidium baldufi
Cybister lateralimarginalis
Dineutus sp.
Elaphrus aureus
Fibla maclachlani
Gyrinus marinus
Haliplus (Haliplus) fluviatilis
Hydroscapha redfordi
Hygrobia hermanni
Hygrobia nigra
Lepicerus sp.
Liopterus haemorrhoidalis
Melanotus villosus
Micromalthus debilis
Noterus clavicornis
Oxoplatypus quadridentatus
Pogonus chalceus
Priacma serrata
Protohermes xanthodes
Pseudimares aphrodite
Puncha ratzeburgi
Sinaspidytes wrasei
Stylops melittae
Thermonectus intermedius
Trachypachus gibbsii
Xenos vesparum
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Strepsiptera
Coleoptera
Coleoptera
Strepsiptera
Coleoptera
Raphidioptera
Neuroptera
Megaloptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Raphidioptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Coleoptera
Order
Elateridae
Dytiscidae
Lepiceridae
Hygrobiidae
Hygrobiidae
Hydroscaphidae
Haliplidae
Gyrinidae
Raphidiidae
Carabidae
Gyrinidae
Dytiscidae
Carabidae
Carabidae
Carabidae
Carabidae
Carabidae
Carabidae
Dytiscidae
Amphizoidae
Scarabaeidae
Amphizoidae
Amphizoidae
Staphylinidae
Buprestidae
Family
Carabidae
Curculionidae
Noteridae
Adephaga
Adephaga
Adephaga
Xenidae
Trachypachidae
Dytiscidae
Stylopidae
Aspidytidae
Raphidiidae
Myrmeleontidae
Corydalidae
Archostemata Cupedidae
Adephaga
Adephaga
Adephaga
Archostemata Micromalthidae
Polyphaga
Adephaga
Myxophaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Adephaga
Polyphaga
Adephaga
Adephaga
Polyphaga
Polyphaga
Suborder McKenna et al. (2019)
Source of data, see File S2 for references
Corydalinae
Trechinae
Copelatinae
Gyrininae
Elaphrinae
Gyrininae
Cybistrinae
Rhysodinae
Cicindelinae
Harpalinae
Carabinae
Carabinae
Trechinae
Matinae
Rutelinae
McKenna et al. (2019)
McKenna et al. (2019)
Bousseau et al. (2014)
Misof et al. (2014)
Vasilikopoulos et al. (2019)
Vasilikopoulos et al. (2020)
Vasilikopoulos et al. (2020)
Vasilikopoulos et al. (2020)
Peters et al. (2014)
Van Belleghem et al. (2012)
McKenna et al. (2019)
Vasilikopoulos et al. (2019)
McKenna et al. (2019)
McKenna et al. (2019)
Vasilikopoulos et al. (2019)
Misof et al. (2014)
Vasilikopoulos et al. (2019)
Vasilikopoulos et al. (2019)
McKenna et al. (2019)
Vasilikopoulos et al. (2019)
Misof et al. (2014)
Vasilikopoulos et al. (2020)
McKenna et al. (2019)
Vasilikopoulos et al. (2019)
Vasilikopoulos et al. (2019)
McKenna et al. (2019)
McKenna et al. (2019)
this study
Peters et al. (2014)
Seppey et al. (2019), McKenna et al. (2019)
Pflug et al. (2020)
Vasilikopoulos et al. (2019)
Vasilikopoulos et al. (2019)
McKenna et al. (2019)
Vasilikopoulos et al. (2019)
Vasilikopoulos et al. (2019)
Aleocharinae Misof et al. (2014), Pauli et al. (2016)
Agrilinae
Subfamily
SRS1130134
SRS4551443
SRS462933 not submitted, available from figshare (link in Bousseau et al., 2014)
SRS976415
SRS851413
SRS851414
SRS851415
SRS369694
SRS295765, SRS295764, SRS295760
SRS976364
SRS976372
SRS976335
SRS976331
SRS2403778
SRS462869
SRS2401245
SRS2401244
SRS976387
SRS976318
SRS462864
SRS851455
SRS976395
SRS976399
SRS976403
SRS976313
SRS976316
SRS8198743
SRS369485
SRS976409
SRS4551447
SRS2403779
SRS2401200
SRS976424
SRS2403774
SRS2428194
SRS462784
SRS976425
yes
Not performed cross-contamination check and vector cont. check performed in this study
yes
yes
yes
yes
yes
Not performed
Not performed
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
Not performed cross-contamination check and vector cont. check performed in this study
yes
yes cross-contamination check and vector cont. check performed in this study
yes
yes
yes
yes cross-contamination check and vector cont. check performed in this study
yes
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
581
651
651
587
651
647
645
647
533
651
639
651
646
651
651
646
650
649
650
651
651
632
651
651
651
651
651
651
593
650
650
651
651
649
650
650
645
635
Cross-contaminations and vector Used for bait No. of NCBI-SRS number contaminations previously performed design orthologous hits
186,208
239,692
232,631
186,413
237,248
234,676
230,237
235,991
84,831
234,803
232,576
237,432
227,128
246,187
217,978
222,557
233,754
240,424
232,794
239,552
227,318
209,371
241,439
217,974
239,346
209,510
239,549
241,593
135,447
224,402
233,116
230,926
239,432
239,180
230,613
237,776
233,241
217,257
4
0
0
6
20
4
4
1
0
0
3
3
9
23
0
7
5
7
2
27
6
21
21
26
6
9
7
0
0
1
0
0
6
5
0
0
1
1
Total no. of amino-acids No. of Xs
4
2
1
4
2
5
5
4
13
9
7
6
6
3
10
10
8
1
3
6
3
5
7
2
5
6
5
3
4
6
8
3
4
2
5
4
3
5
No. of stop signs
381
431
422
372
429
419
416
428
179
420
430
433
418
445
388
407
432
438
426
438
407
389
438
395
435
372
436
438
269
406
420
412
438
439
431
436
423
399
N50 of lengths
320
368
357
317
364
362
356
364
159
360
363
364
351
378
334
344
359
370
358
367
349
331
370
334
367
321
367
371
228
345
358
354
367
368
354
365
361
342
Average length
273.0
323.0
316.0
261.0
317.0
315.0
307.0
312.0
155.0
314.0
313.0
314.0
310.0
328.0
297.0
305.0
317.5
322.0
315.0
324.0
315.0
290.0
322.0
308.0
320.0
288.0
322.0
322.0
208.0
304.0
304.5
317.0
320.0
318.0
309.5
318.5
318.0
306.0
Median length
1,849
2,366
2,365
2,362
2,366
2,372
2,052
2,370
553
2,069
2,370
2,366
1,061
2,132
1,092
1,222
1,448
2,366
1,419
2,366
1,223
1,220
2,366
1,387
2,366
1,094
2,272
2,365
790
1,267
2,365
1,291
1,678
2,154
1,153
1,915
1,217
1,222
Max. length
Table S1: Overview of the transcriptomes that were used for bait design and for downstream phylogenetic reconstructions. Statistics of the orthology assignment with Orthograph based on the 651 genes for which baits had been originally designed are also provided (amino-acid output of Orthograph). Note that for initial bait design we also included the transcriptome of Metrius contractus (kindly provided by Wendy Moore that was not included for downstream orthology assignment and phylogenetic analyses).
21
77
52
26
61
62
50
59
28
73
16
66
55
57
58
65
74
76
75
76
85
63
67
66
71
44
65
77
33
74
73
49
73
69
47
60
77
61
Min. length
Noteridae
Dytiscidae
Dytiscidae
Carabidae
Dytiscidae
Dytiscidae
Carabidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Carabidae
Carabidae
Carabidae
Dytiscidae
Haliplidae
Haliplidae
Haliplidae
Dytiscidae
Dytiscidae
Noteridae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Dytiscidae
Carabidae
Dytiscidae
Carabidae
Gyrinidae
Cicindelidae
Dytiscidae
Dytiscidae
14 Canthydrus sp.
15 Caperhantus cicurius
16 Celina imitatrix
17 Clivina sp.
18 Copelatus caelatipennis
19 Coptotomus sp.
20 Cychrus sp.
21 Derovatellus peruanus
22 Dytiscus marginalis
23 Eretes griseus
24 Exocelina sp.
25 Galerita sp.
26 Glyptolenus sp.
27 Goniotropis sp.
28 Graptodytes pictus
29 Haliplus (Haliplidius) confinis
30 Haliplus (Liaphlus) laminatus
31 Haliplus (Neohaliplus) lineatocollis
32 Hydaticus pacificus
33 Hyderodes shuckardi
34 Hydrocanthus oblongus
35 Hydrodytes opalinus
36 Hydroglyphus geminus
37 Hydroporus erythrocephalus
38 Hydrotrupes palpalis
39 Hydrovatus fasciatus
40 Hygrotus (Leptolambus) impressopunctatus
41 Hyphydrus ovatus
42 Ilybius fenestratus
43 Laccodytes sp.
44 Laccophilus poecilus
45 Laccornis oblongus
46 Lachnophorini sp.
47 Lancetes sp.
48 Loricera pilicornis
49 Macrogyrus sp.
50 Manticora latipennis
51 Matus sp.
52 Megadytes sp.
Carabidae
9 Broscus cephalotes
Carabidae
Dytiscidae
8 Bidessus unistriatus
13 Calophaena bicincta
Gyrinidae
7 Andogyrus sp.
Carabidae
Cicindelidae
6 Amblycheila cylindriformis
12 Calathus sp.
Haliplidae
5 Algophilus lathridioides
Dytiscidae
Dytiscidae
4 Agabus undulatus
11 Bunites distigma
Dytiscidae
3 Agabetes acuductus
Haliplidae
Carabidae
2 Adelotopus paroensis
10 Brychius elevatus
Dytiscidae
Family
1 Acilius canaliculatus
Species_name
Cybistrinae
Matinae
Gyrininae
Loricerinae
Lancetinae
Harpalinae
Hydroporinae
Laccophillinae
Laccophillinae
Agabinae
Hydroporinae
Hydroporinae
Hydroporinae
Agabinae
Hydroporinae
Hydroporinae
Hydroporinae
Noterinae
Dytiscinae
Dytiscinae
Haliplinae
Haliplinae
Haliplinae
Hydroporinae
Paussinae
Harpalinae
Harpalinae
Copelatinae
Dytiscinae
Dytiscinae
Hydroporinae
Carabinae
Coptotominae
Copelatinae
Scaritinae
Celinae
Colymbetinae
Noterinae
Harpalinae
Harpalinae
Colymbetinae
Haliplinae
Broscinae
Hydroporinae
Gyrininae
Haliplinae
Agabinae
Laccophillinae
Harpalinae
Dytiscinae
Subfamily
Cybistrini
Matini
Manticorini
Dineutini
Loricerini
Lancetini
Lebiitae, Lachnophorini
Laccornini
Laccophilini
Laccophilini
Agabini
Hyphydrini
Hygrotini
Hydrovatini
Hydrotrupini
Hydroporini
Bidessini
Hydrodytinae
Noterini
Hyderodini
Hydaticini
Neohaliplus
Liaphlus
Haliplidius
Hydroporini
Ozaenini
Platynitae, Platynini
Dryptitae, Galeritini
Copelatini
Eretini
Dytiscini
Vatellini
Cychrini
Coptotomini
Copelatini
Clivinini
Methlini
Colymbetini
Noterini
Lebiitae, Calophaenini
Pterostichitae, Sphodrini
Colymbetini
Broscini
Bidessini
Dineutini
Manticorini
Agabini
Agabetini
Pseudomorphini
Aciliini
Supertribe, Tribe or equivalent
SRR12339100
SRR12339101
SRR12339103
SRR12339104
SRR12339105
SRR12339106
SRR12339107
SRR12339108
SRR12339109
SRR12339110
SRR12339111
SRR12339112
SRR12339114
SRR12339115
SRR12339116
SRR12339117
SRR12339118
SRR12339119
SRR12339120
SRR12339121
SRR12339122
SRR12339123
SRR12339125
SRR12339126
SRR12339127
SRR12339128
SRR12339129
SRR12339130
SRR12339131
SRR12339132
SRR12339133
SRR12339134
SRR12339136
SRR12339137
SRR12339138
SRR12339139
SRR12339140
SRR12339141
SRR12339142
SRR12339143
SRR12339144
SRR12339050
SRR12339058
SRR12339069
SRR12339080
SRR12339091
SRR12339102
SRR12339113
SRR12339124
SRR12339135
SRR12339051
SRR12339052
NCBI-SRA
SAMN15489395
SAMN15489380
SAMN15489308
SAMN15489302
SAMN15489323
SAMN15489379
SAMN15489373
SAMN15489334
SAMN15489358
SAMN15489389
SAMN15489357
SAMN15489356
SAMN15489355
SAMN15489342
SAMN15489312
SAMN15489354
SAMN15489339
SAMN15489372
SAMN15489315
SAMN15489329
SAMN15489387
SAMN15489365
SAMN15489364
SAMN15489363
SAMN15489362
SAMN15489371
SAMN15489327
SAMN15489326
SAMN15489301
SAMN15489383
SAMN15489353
SAMN15489370
SAMN15489322
SAMN15489319
SAMN15489314
SAMN15489369
SAMN15489313
SAMN15489311
SAMN15489386
SAMN15489324
SAMN15489341
SAMN15489309
SAMN15489361
SAMN15489340
SAMN15489352
SAMN15489382
SAMN15489321
SAMN15489310
SAMN15489343
SAMN15489318
SAMN15489337
SAMN15489351
1,475,199
1,539,138
730,929
1,728,525
1,701,165
1,213,049
1,308,631
1,463,810
1,906,079
2,424,354
1,600,231
1,303,854
1,081,345
814,217
1,316,434
1,078,618
1,816,009
1,612,204
1,820,881
1,482,895
1,823,398
1,888,669
2,730,884
2,709,057
1,268,679
2,551,143
1,110,028
2,222,895
1,773,505
1,695,897
1,955,387
1,518,126
1,840,940
1,610,693
1,565,817
1,695,940
1,464,339
1,433,542
2,770,449
1,786,710
2,015,071
1,507,272
2,190,501
1,795,325
1,338,798
1,493,247
1,445,537
1,918,412
1,422,789
1,602,973
1,562,994
1,576,264
547.673
199.939
181.105
232.014
296.420
116.655
42.503
304.470
691.159
294.752
96.605
614.181
453.718
1,260,612
1,341,092
654,267
1,443,307
1,512,258
979,646
1,190,884
1,204,671
1,730,790
2,053,239
1,263,608
1,083,435
911,425
627,714
1,176,182
980,897
1,619,428
1,461,245
1,554,289
1,207,930
1,571,295
1,541,106
2,457,971
2,366,957
1,100,057
2,234,816
986,807
1,943,362
1,579,672
1,465,209
1,764,884
1,222,816
1,587,342
1,213,822
1,371,847
1,452,885
1,282,909
1,237,392
64.472
440.181
7.645
151.402
928.646
594.908
245.433
96.602
463.023
909.407
133.371
120.035
224.192
61.574
126.149
272.325
136.549
281.174
621.068
103.816
283.611
323.927
508.727
705.845
415.541
305.124
483.622
779.495
161.499
654.268
127.983
173.729
213.863
120.610
420.437
213.503
66.323
428.929
48.906
31.732
11.512
32.735
22.156
25.687
24.311
21.861
8.924
12.626
28.583
33.098
22.890
22.918
16.670
26.488
13.976
17.919
14.837
22.354
21.739
17.450
11.930
17.401
27.600
14.784
26.219
10.930
29.478
21.418
26.021
41.856
11.950
15.412
26.871
21.223
20.697
14.157
9.442
13.951
14.149
26.160
12.204
13.451
22.998
49.334
17.061
9.091
21.196
17.215
23.550
33.657
48.553
27.614
11.515
31.942
18.803
19.574
22.599
21.363
8.308
10.749
27.722
32.373
21.920
22.438
15.986
24.792
13.687
17.017
13.147
21.608
20.463
15.604
10.128
13.761
25.158
14.161
21.942
9.413
28.393
18.585
25.333
40.860
11.443
14.756
24.937
20.227
20.477
12.213
8.027
11.960
13.580
25.184
11.653
12.472
22.405
49.449
16.018
7.787
19.626
16.877
20.008
30.009
Average per base coverage Average per Average per depth of base base coverage target coverage depth of nonregions depth of target regions (Ct) assembly (Ca) (Cn)
2,491,285 1860.950
1,633,195
1,830,923
1,298,746
1,743,167
1,477,295
1,102,301
662,713
1,261,902
1,615,555
1,272,001
1,403,858
1,324,609
1,314,765
No. of sequenced No. of clean reads reads (pairs) (pairs) NCBI-Biosample
1.318
13.872
0.664
4.625
41.913
23.160
10.095
4.419
51.886
72.026
4.666
3.627
9.795
2.687
7.568
10.281
9.770
15.691
41.859
4.644
13.046
18.563
42.644
40.563
15.056
20.639
18.445
71.316
5.479
30.547
4.919
4.151
17.897
7.826
15.647
10.060
3.204
30.297
197.090
39.257
14.131
6.923
19.011
22.037
5.072
0.862
17.846
76.023
13.906
5.612
26.080
13.481
Ct / Ca
1.328
15.941
0.664
4.740
49.388
30.394
10.860
4.522
55.734
84.605
4.811
3.708
10.228
2.744
7.891
10.984
9.977
16.523
47.239
4.804
13.860
20.759
50.231
51.292
16.517
21.546
22.041
82.812
5.688
35.204
5.052
4.252
18.689
8.174
16.860
10.556
3.239
35.122
231.825
45.793
14.723
7.191
19.910
23.767
5.207
0.860
19.007
88.755
15.019
5.724
30.697
15.119
Ct / Cn
605
569
177
479
550
531
545
471
577
576
516
471
508
430
542
532
508
527
556
533
567
621
642
639
514
602
542
607
556
547
548
450
565
484
565
534
458
561
605
571
573
523
574
557
446
273
523
596
557
502
554
564
92.93%
87.40%
27.19%
73.58%
84.49%
81.57%
83.72%
72.35%
88.63%
88.48%
79.26%
72.35%
78.03%
66.05%
83.26%
81.72%
78.03%
80.95%
85.41%
81.87%
87.10%
95.39%
98.62%
98.16%
78.96%
92.47%
83.26%
93.24%
85.41%
84.02%
84.18%
69.12%
86.79%
74.35%
86.79%
82.03%
70.35%
86.18%
92.93%
87.71%
88.02%
80.34%
88.17%
85.56%
68.51%
41.94%
80.34%
91.55%
85.56%
77.11%
85.10%
86.64%
76,250
76,463
32,834
65,359
86,705
67,213
80,523
60,948
96,784
97,792
64,048
61,163
59,105
50,466
72,792
69,853
63,647
66,604
83,026
66,183
73,760
86,546
102,507
98,929
70,052
92,050
80,285
113,958
73,651
84,167
75,020
56,428
77,784
60,471
69,287
73,222
56,192
87,315
124,538
93,878
85,635
70,583
86,422
81,606
53,028
27,225
70,874
94,576
78,933
63,282
78,779
71,865
No. of Percentag orthologous e (%) of Total No. hits (genes genes of amino recovered) recovered acids
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
43
46
5
31
25
34
26
28
39
35
32
25
25
11
31
21
23
29
26
30
25
33
30
42
24
41
13
36
28
22
50
26
24
29
16
16
16
39
37
39
38
23
38
34
23
12
23
36
23
26
31
28
No. of No. of stop Xs symbols
Table S2: Overview of species used for target sequence capture of protein-coding exons with the newly designed bait set. NCBI accession numbers for the raw reads are given for each species. Total number of sequenced reads, number of reads after quality trimming and statistics of the enrichment efficiency are also provided (see description main text). Statistics of orthology assignment of the sequenced genomic libraries with Orthograph are also provided (amino-acid output of Orthograph). Note: for the species Hydrocanthus oblongus two different samples were processed and sequenced but only one of them was included in this study. All data are available on NCBI under the Bioproject number: PRJNA645047.
143
155
261
153
183
142
168
142
200
195
142
143
127
132
156
146
138
143
176
139
144
161
187
181
152
175
174
222
149
183
157
139
155
142
132
156
135
181
247
194
171
153
171
167
128
107
152
189
165
143
160
143
N50 of protein length
126
134
185
136
157
126
147
129
167
169
124
129
116
117
134
131
125
126
149
124
130
139
159
154
136
152
148
187
132
153
136
125
137
124
122
137
122
155
205
164
149
134
150
146
118
99
135
158
141
126
142
127
Average length
117
122
122
123
139.5
117
138
115
147
153.5
114.5
119
109
107.5
120
118
115.5
113
136
111
113
123
140
135
123
138.5
132.5
169
117
129
121
116
123
114
112
122
112
137
181
144
137
119
128
129
114
91
117
142.5
122
118
127
112
Median length
633
637
1136
620
643
605
643
625
743
643
568
598
633
524
643
636
493
409
540
547
676
630
608
640
626
643
643
877
643
643
639
533
584
631
505
523
557
999
779
848
664
618
643
643
407
324
626
681
643
569
571
643
Max. length
28
33
11
38
39
32
40
32
31
38
7
32
42
28
31
32
27
32
19
33
33
36
6
35
24
36
23
33
35
44
23
32
35
38
36
31
33
28
11
35
32
38
39
32
33
33
10
31
15
40
35
33
Min. length
Dytiscidae
Carabidae
Noteridae
Carabidae
Carabidae
Dytiscidae
Noteridae
Dytiscidae
Carabidae
Carabidae
Noteridae
Carabidae
Carabidae
Carabidae
Dytiscidae
Carabidae
Gyrinidae
Haliplidae
Haliplidae
Carabidae
Dytiscidae
Carabidae
Dytiscidae
Dytiscidae
Carabidae
Dytiscidae
Gyrinidae
Noteridae
Cicindelidae
Carabidae
Carabidae
Carabidae
Carabidae
Dytiscidae
Dytiscidae
Dytiscidae
Noteridae
Noteridae
Cicindelidae
Cicindelidae
Dytiscidae
Dytiscidae
Cicindelidae
53 Meridiorhantus calidus
54 Mesacanthina cribata
55 Mesonoterus laevicollis
56 Morion sp.
57 Nebria picicornis
58 Necterosoma penicillatum
59 Neohydrocoptus sp.
60 Neptosternus brevior
61 Notiobia sp.
62 Notiophilus sp.
63 Notomicrus sp.
64 Odacantha melanura
65 Omophron sp.
66 Ozaena sp.
67 Pachydrus sp.
68 Panagaeus bipustulatus
69 Patrus sp.
70 Peltodytes (Peltodytes) caesus
71 Peltodytes (Neopeltodytes) oppositus
72 Pheropsophus sp.
73 Philaccolilus sp.
74 Pinacodera sp.
75 Platambus maculatus
76 Platynectes sp.
77 Platynus sp.
78 Porhydrus lineatus
79 Porrorhynchus sp.
80 Suphisellus (Pronoterus) semipunctatus
81 Pseudoxicheila tarsalis
82 Pterostichus burmeisteri
83 Scarites subterraneus
84 Siagona sp.
85 Sphallomorpha suturalis
86 Sternhydrus atratus
87 Sternhydrus scutellaris
88 Stictotarsus duodecimpustulatus
89 Suphisellus gibbulus
90 Suphisellus tenuicornis
91 Tetracha carolina
92 Therates labiatus
93 Thermonectus basillaris
94 Thermonectus margineguttatus
95 Tricondyla aptera
Dytiscinae
Dytiscinae
Noterinae
Noterinae
Hydroporinae
Cybistrinae
Cybistrinae
Harpalinae
Siagoninae
Scaritinae
Harpalinae
Noterinae
Gyrininae
Hydroporinae
Harpalinae
Agabinae
Agabinae
Harpalinae
Laccophillinae
Brachininae
Haliplinae
Haliplinae
Gyrininae
Harpalinae
Hydroporinae
Paussinae
Omophroninae
Harpalinae
Noterinae
Nebriinae
Harpalinae
Laccophillinae
Noterinae
Hydroporinae
Nebriinae
Harpalinae
Noterinae
Colymbetinae
Collyridini, Tricondylina
Aciliini
Aciliini
Cicindelini, Theratina
Megacephalini
Noterini
Noterini
Hydroporini
Cybistrini
Cybistrini
Pseudomorphini
Siagonini
Scaritini
Pterostichitae, Pterostichini
Oxycheilini, Oxychilina
Pronoterini
Dineutini
Hydroporini
Platynitae, Platynini
Hydrotrupini
Agabini
Lebiitae, Lebiini
Laccophilini
Brachinini
Orectochilini
Chleniitae, Panagaeini
Hyphydrini
Ozaenini
Omophronini
Lebiitae, Odacanthini
Notomicrini
Nebriini
Harpalitae, Harpalini
Laccophilini
Neohydrocoptini
Hydroporini
Nebriini
Pterostichitae, Morionini
Noterini
Cicindelini, Prothymina
Colymbetini
SRR12339053
SRR12339054
SRR12339055
SRR12339056
SRR12339057
SRR12339059
SRR12339060
SRR12339061
SRR12339062
SRR12339063
SRR12339064
SRR12339065
SRR12339066
SRR12339067
SRR12339068
SRR12339070
SRR12339071
SRR12339072
SRR12339073
SRR12339074
SRR12339075
SRR12339076
SRR12339077
SRR12339078
SRR12339081
SRR12339079
SRR12339082
SRR12339083
SRR12339084
SRR12339085
SRR12339086
SRR12339087
SRR12339088
SRR12339089
SRR12339090
SRR12339092
SRR12339093
SRR12339094
SRR12339095
SRR12339096
SRR12339097
SRR12339098
SRR12339099
SAMN15489305
SAMN15489391
SAMN15489320
SAMN15489304
SAMN15489346
SAMN15489393
SAMN15489316
SAMN15489368
SAMN15489331
SAMN15489330
SAMN15489336
SAMN15489306
SAMN15489345
SAMN15489338
SAMN15489325
SAMN15489390
SAMN15489385
SAMN15489360
SAMN15489317
SAMN15489303
SAMN15489367
SAMN15489344
SAMN15489381
SAMN15489378
SAMN15489359
SAMN15489347
SAMN15489384
SAMN15489335
SAMN15489377
SAMN15489376
SAMN15489348
SAMN15489333
SAMN15489349
SAMN15489388
SAMN15489328
SAMN15489307
SAMN15489350
SAMN15489332
SAMN15489366
SAMN15489375
SAMN15489392
SAMN15489374
SAMN15489394
1,650,429
1,846,851
2,491,394
1,576,403
1,433,030
2,049,206
1,952,185
1,659,600
1,810,908
1,493,014
1,445,528
1,450,414
1,501,382
1,903,256
1,388,885
1,789,803
1,604,396
1,351,638
1,737,438
1,472,236
2,106,958
1,699,972
2,397,075
1,548,712
1,886,465
1,340,576
1,608,735
2,043,803
804,152
2,335,528
2,971,071
1,024,622
1,751,584
1,718,030
1,584,000
1,558,597
1,474,981
1,311,993
1,210,450
970,629
1,772,500
1,444,865
1,845,516
276.182
817.633
245.323
285.128
238.113
266.421
130.661
397.181
132.751
263.525
519.492
172.154
1,411,203
1,591,903
2,181,339
1,438,276
1,227,851
1,831,814
1,634,373
1,462,934
1,668,154
1,280,011
980,824
1,254,299
1,257,010
1,628,733
1,280,020
1,417,133
1,239,356
1,204,643
1,499,585
1,288,752
1,840,616
1,517,882
2,181,057
1,333,876
1,715,827
1,128,334
1,388,097
1,643,688
632,912
2,085,232
63.331
143.850
788.522
111.868
404.714
547.022
206.998
91.818
245.587
276.288
492.590
387.087
290.951
137.420
236.504
271.938
149.935
146.408
876.327
198.243
774.915
648.978
603.801
440.383
963.459
549.174
630.939
390.336
47.151
602.793
2,614,000 1764.372
887,304
1,493,215
1,541,055
1,453,974
1,330,957
1,310,717
1,153,769
1,020,650
850,325
1,242,472
1,238,473
1,543,715
24.854
28.384
42.701
14.366
28.388
8.507
6.716
31.056
18.020
36.526
25.716
21.102
10.072
17.317
22.052
10.705
23.444
29.140
20.343
23.199
22.239
28.403
10.231
25.813
21.159
27.371
16.632
12.106
8.350
20.847
16.940
17.435
9.384
17.747
13.302
14.442
15.435
12.084
22.067
21.949
12.514
37.933
17.247
24.648
26.906
33.861
13.988
26.255
7.808
6.553
30.518
16.779
33.188
22.103
18.678
9.466
16.842
20.906
10.149
22.632
28.346
16.072
21.818
18.907
24.639
9.424
23.159
18.327
24.268
14.843
10.651
8.133
19.024
14.151
15.829
8.682
17.073
12.458
13.781
14.510
11.746
20.008
20.975
11.552
29.676
16.661
9.982
2.548
5.068
18.466
7.787
14.257
64.306
30.822
2.956
13.628
7.564
19.155
18.344
28.887
7.936
10.725
25.403
6.395
5.024
43.078
8.545
34.844
22.849
59.017
17.060
45.535
20.064
37.935
32.243
5.647
28.915
104.156
15.841
87.134
13.824
21.435
16.488
17.261
10.812
17.999
6.048
21.058
13.695
2.569
5.346
23.287
7.997
15.415
70.062
31.590
3.009
14.637
8.325
22.287
20.724
30.736
8.159
11.313
26.794
6.625
5.165
54.526
9.086
40.986
26.339
64.068
19.016
52.570
22.630
42.507
36.648
5.797
31.686
124.685
17.448
94.178
14.369
22.886
17.278
18.361
11.124
19.851
6.329
22.812
17.505
10.333
470
635
638
534
512
566
445
489
542
498
514
535
539
553
545
543
515
485
586
548
577
538
544
545
558
515
523
579
412
594
604
534
588
510
545
516
529
494
535
517
514
602
540
72.20%
97.54%
98.00%
82.03%
78.65%
86.94%
68.36%
75.12%
83.26%
76.50%
78.96%
82.18%
82.80%
84.95%
83.72%
83.41%
79.11%
74.50%
90.02%
84.18%
88.63%
82.64%
83.56%
83.72%
85.71%
79.11%
80.34%
88.94%
63.29%
91.24%
92.78%
82.03%
90.32%
78.34%
83.72%
79.26%
81.26%
75.88%
82.18%
79.42%
78.96%
92.47%
82.95%
56,471
84,862
87,717
68,553
70,898
99,222
68,549
64,839
75,037
62,376
68,412
75,824
70,909
84,413
77,767
87,649
65,881
61,681
92,385
76,705
84,302
80,971
82,044
84,753
86,425
76,167
76,739
86,581
51,670
90,638
103,319
71,634
119,546
77,089
84,056
73,689
84,795
72,652
80,874
75,032
81,750
78,180
72,096
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
40
34
25
20
25
17
31
89
32
22
35
20
22
22
32
17
20
38
35
24
27
24
31
38
19
27
45
22
41
40
18
26
29
36
23
33
25
28
26
16
27
36
134
154
159
142
165
204
174
151
163
143
149
165
148
172
166
189
141
139
192
160
172
178
177
177
178
170
166
170
134
177
195
152
239
172
180
161
184
166
176
169
184
151
150
120
133
137
128
138
175
154
132
138
125
133
141
131
152
142
161
127
127
157
139
146
150
150
155
154
147
146
149
125
152
171
134
203
151
154
142
160
147
151
145
159
129
133
108
117
118
117
117
156
135
119
120
113
119.5
127
120
137
130
143
116
115
139
124
127
135
130
139
135
133
129
138
114
136.5
148.5
118
183.5
139
140
130
145
123
133
125
142.5
116
120
642
643
801
643
643
666
720
643
643
643
681
643
531
629
643
575
608
643
470
617
802
643
639
805
787
448
619
642
550
786
1045
533
941
735
643
591
684
761
643
670
751
643
643
40
37
32
36
39
31
31
28
29
34
30
33
32
11
8
26
44
35
33
30
30
32
29
44
13
40
27
29
12
14
30
32
31
24
36
32
9
6
30
38
17
11
32
50
40
30
40
20
20
2
3
4
5
6
7
120
120
120
120
120
120
120
Bait length (bp)
1
2
2
3
3
3
4
Number of baits per bait region
120
140
160
180
200
220
240
0.15
0.15
0.15
0.15
0.15
0.15
0.15
Length of bait regions (bp) Cluster threshold
725
673
614
552
487
426
381
No. of genes
1,210
1,023
886
742
616
519
453
165,330
141,002
120,376
103,233
89,084
77,256
67,199
651
601
531
460
389
327
280
1040*
673
740
605
479
388
325
122,037
98,219
78,999
63,093
50,849
41,279
34,132
Results of Baitfilter analyses after removing baits with hits to multiple genomic regions in the genome of Bembidion sp. nr. transversale No. of CDS No. of CDS No. of genes after features after No. of bait regions features No. of bait regions filtering filtering after filtering
Results of Baitfisher for each of the different tiling design experiments
*Note: Because the bait kit we used only allowed a maximum of size of 6Mbp for the baits, not all exons from experiment no. 7 were included in the final bait set (i.e. 923 exons out of a potential total of 1040 exons were targeted)
40
Bait offset
1
No. of tiling design experiment
Parameters used for different tiling design experiments with BaitFisher
Table S3: Summarized results from the different tiling design experiments with Baitfisher. The same set of species was specified in all experiments (see File S2).
2013
2017
2017
2011
2013
2017
2017
2007
2017
2018
2013
2017
2017
2017
2017
2017
2006
2017
2013
2017
2017
2010
2017
2017
2017
2017
2017
Clivina sp.
Copelatus caelatipennis
Coptotomus sp.
Cychrus sp.
Derovatellus peruanus
Dytiscus marginalis
Eretes griseus
Exocelina sp.
Galerita sp.
Glyptolenus sp.
Goniotropis sp.
Graptodytes pictus
Haliplus (Haliplidius) confinis
Haliplus (Liaphlus) laminatus
Haliplus (Neohaliplus) lineatocollis
Hydaticus pacificus
Hyderodes shuckardi
Hydrocanthus oblongus
Hydrodytes opalinus
Hydroglyphus geminus
Hydroporus erythrocephalus
Hydrotrupes palpalis
Hydrovatus fasciatus
Hygrotus (Leptolambus) impressopunctatus
Hyphydrus ovatus
Ilybius fenestratus
Laccodytes sp.
1980s
Bunites distigma
2017
2017
Brychius elevatus
Celina imitatrix
2017
Broscus cephalotes
2017
2017
Bidessus unistriatus
Caperhantus cicurius
2012
Andogyrus sp.
2017
1994
Amblycheila cylindriformis
Canthydrus sp.
2010
Algophilus lathridioides
2017
2017
Agabus undulatus
Calophaena bicincta
2017
Agabetes acuductus
2017
1990
Adelotopus paroensis
Calathus sp.
2017
Year collected
Acilius canaliculatus
Species
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
kept in 70% or less ethanol for >15 years
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 70% ethanol
pinned collection specimen, most likely initially preserved with ethyl acetate
kept in low grade ethanol for >5 years
preserved in 96% ethanol
preserved in 96% ethanol
pinned collection specimen, most likely initially preserved with ethyl acetate
preserved in 96% ethanol
Preservation
x (dry collection)
x
x
x
x
x
x
x
x (dry collection)
x
x
x
x
x
x
x
x
x
x
x
x
x
x (dry collection)
x
x
x
x
x
x
x
x
x
x
x
x
x (dry collection)
x
x
x
X
Existing voucher (DNA or tissue)
Table S4: Collection information for the processed samples used for hybrid enrichment (voucher deposited at Zoological State Collections Munich, Germany). Collection ID
Adephaga_091
Adephaga_026
Adephaga_033
Adephaga_034
Adephaga_077
Adephaga_085
Adephaga_064
Adephaga_028
Adephaga_087
Adephaga_057
Adephaga_078
Adephaga_079
Adephaga_017
Adephaga_015
Adephaga_016
Adephaga_023
Adephaga_069
Adephaga_067
Adephaga_065
Adephaga_030
Adephaga_040
Adephaga_009
Adephaga_088
Adephaga_082
Adephaga_022
Adephaga_061
Adephaga_073
Adephaga_058
Adephaga_080
Adephaga_076
Adephaga_063
Adephaga_047
Adephaga_012
Adephaga_014
Adephaga_052
Adephaga_027
Adephaga_008
Adephaga_097
Adephaga_018
Adephaga_037
Adephaga_021
Adephaga_096
Adephaga_011
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
Biomaterial provider
2017
2010
2013
1990
2012
2007
2004
2014
2017
2017
2013
2016
2013
2017
2014
2017
2009
2017
2017
2017
2015
2017
2013
2013
2017
2017
2017
2017
2013
2013
2017
2017
2007
2017
2017
2017
2016
2017
2017
2017
2010
1998
2006
2006
2017
2017
Laccophilus poecilus
Laccornis oblongus
Lachnophorini sp.
Lancetes sp.
Loricera pilicornis
Macrogyrus sp.
Manticora latipennis
Matus sp.
Megadytes sp.
Meridiorhantus calidus
Mesacanthina cribata
Mesonoterus laevicollis
Morion sp.
Nebria picicornis
Necterosoma penicillatum
Neohydrocoptus sp.
Neptosternus brevior
Notiobia sp.
Notiophilus sp.
Notomicrus sp.
Odacantha melanura
Omophron sp.
Ozaena sp.
Pachydrus sp.
Panagaeus bipustulatus
Patrus sp.
Peltodytes (Peltodytes) caesus
Peltodytes (Neopeltodytes) oppositus
Pheropsophus sp.
Philaccolilus sp.
Pinacodera sp.
Platambus maculatus
Platynectes sp.
Platynus sp.
Porhydrus lineatus
Porrorhynchus sp.
Suphisellus (Pronoterus) semipunctatus
Pseudoxicheila tarsalis
Pterostichus burmeisteri
Scarites subterraneus
Siagona sp.
Sphallomorpha suturalis
Sternhydrus atratus
Sternhydrus scutellaris
Stictotarsus duodecimpustulatus
Suphisellus gibbulus
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
pinned collection specimen, most likely initially preserved with ethyl acetate
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
pinned collection specimen, most likely initially preserved with ethyl acetate
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
x
x (dry collection)
x
X
x
x
x
x
x (dry collection)
x
x
x
x
x
x
x (dry collection)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x (dry collection)
x
x
x
x (dry collection)
x
x
Adephaga_059
Adephaga_035
Adephaga_093
Adephaga_102
Adephaga_099
Adephaga_053
Adephaga_045
Adephaga_046
Adephaga_066
Adephaga_006
Adephaga_010
Adephaga_024
Adephaga_051
Adephaga_029
Adephaga_025
Adephaga_048
Adephaga_092
Adephaga_071
Adephaga_020
Adephaga_019
Adephaga_013
Adephaga_038
Adephaga_107
Adephaga_070
Adephaga_044
Adephaga_054
Adephaga_055
Adephaga_049
Adephaga_068
Adephaga_105
Adephaga_056
Adephaga_101
Adephaga_043
Adephaga_074
Adephaga_004
Adephaga_041
Adephaga_003
Adephaga_001
Adephaga_090
Adephaga_098
Adephaga_031
Adephaga_083
Adephaga_095
Adephaga_072
Adephaga_081
Adephaga_075
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
2016
2017
2017
2017
2016
2017
Suphisellus tenuicornis
Tetracha carolina
Therates labiatus
Thermonectus basillaris
Thermonectus margineguttatus
Tricondyla aptera
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
preserved in 96% ethanol
x
x
x
x
x
x
Adephaga_036
Adephaga_002
Adephaga_060
Adephaga_062
Adephaga_042
Adephaga_007
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
M. Balke
0.00282
Dytiscidae
Gyrinidae
Haliplidae
Noteridae
Cicindelidae
Dytiscidae
Gyrinidae
Cicindelidae
Dytiscidae
Gyrinidae
Cicindelidae
Gyrinidae
Gyrinidae
Carabidae
Carabidae
Carabidae
Carabidae
Carabidae
Noteridae
Noteridae
Noteridae
Haliplidae
Haliplidae
Haliplidae
Dytiscidae
Dytiscidae
Cicindelidae
0.78790
0.40570
0.89130
0.04242
0.00038
0.00058
0.04848
0.00008
0.00062
0.04282
0.09944
0.07101
0.00004
p-value (Ct / Cn)
Cicindelidae
Paired test
0.91430
0.40570
0.68790
0.04242
0.00024
0.00117
0.04848
0.00004
0.00133
0.02713
0.06624
0.07138
0.00003
0.00404
p-value (Ct / Ca)
Table S5: Summarized results of the Mann-Whitney-Wilcoxon tests for the enrichment statistics in different pairs of families of Adephaga (see also Fig. 2, Fig. S4).
23,096
20,782
64,056
22,380
Agabetes acuductus
Agabus undulatus
Algophilus lathridioides
Amblycheila cylindriformis
51,544
17,391
43,622
38,082
Brychius elevatus
Bunites distigma
Calathus sp.
Calophaena bicincta
16,989
35,135
23,426
34,156
12,119
21,540
7,306
Hyderodes shuckardi
Hydrocanthus oblongus
Hydrodytes opalinus
Hydroglyphus geminus
Hydroporus erythrocephalus
Hydrotrupes palpalis
Hydrovatus fasciatus
11,819
25,814
Hydaticus pacificus
27,230
31,603
Haliplus (Neohaliplus) lineatocollis
Loricera pilicornis
72,706
Haliplus (Liaphlus) laminatus
Lancetes sp.
51,834
Haliplus (Haliplidius) confinis
20,702
14,791
Graptodytes pictus
Lachnophorini sp.
67,211
Goniotropis sp.
17,468
13,366
Glyptolenus sp.
66,510
72,529
Galerita sp.
Laccornis oblongus
19,504
Exocelina sp.
Laccophilus poecilus
27,077
Eretes griseus
56,800
25,412
Dytiscus marginalis
15,641
10,588
Derovatellus peruanus
Laccodytes sp.
46,647
Cychrus sp.
Ilybius fenestratus
20,477
Coptotomus sp.
11,087
18,944
Copelatus caelatipennis
Hyphydrus ovatus
23,744
Clivina sp.
13,318
18,217
Celina imitatrix
Hygrotus (Leptolambus) impressopunctatus
30,306
Caperhantus cicurius
100,981
38,321
Broscus cephalotes
Canthydrus sp.
14,175
Bidessus unistriatus
7,033
25,652
Adelotopus paroensis
Andogyrus sp.
15,151
Acilius canaliculatus
Species / Target enrichment assembly
26,273
11,806
18,914
16,733
64,560
56,659
15,308
9,725
13,172
7,231
21,093
11,916
34,000
23,353
34,523
16,319
25,251
30,902
71,452
51,028
14,276
59,757
13,134
70,845
17,851
26,888
23,701
10,507
46,062
20,246
18,697
22,733
17,990
29,931
100,317
37,862
43,161
14,520
49,918
37,350
13,920
5,792
21,987
63,455
20,581
22,793
24,889
14,982
No. of contigs No. of never suspected assemble as being d contigs contaminated
957
13
1,788
735
1,950
141
333
1,362
146
75
447
203
156
73
612
670
563
701
1,254
806
515
7,454
232
1,684
1,653
189
1,711
81
585
231
247
1,011
227
375
664
220
461
2,871
1,626
971
255
1,241
393
601
201
303
763
169
No. of putatively contaminated contigs (suspects)
715
7
524
405
1,017
86
167
666
90
58
309
116
108
56
352
182
206
283
602
511
285
1,852
96
802
723
85
702
59
218
91
81
308
126
91
352
106
182
1,497
660
318
153
333
76
249
93
140
149
100
188
0
1,194
219
731
31
67
368
8
1
69
16
12
3
226
457
332
372
460
176
57
5,555
121
824
772
80
924
9
268
110
33
675
57
268
245
94
205
1,268
782
606
38
875
64
64
72
103
155
32
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
No. of No. of No. of contigs contigs with clean with low high contigs coverage coverage
24
3
38
64
167
17
52
212
20
5
43
31
20
8
18
3
9
18
129
62
102
29
4
26
75
12
31
4
57
7
13
4
12
6
54
2
31
26
117
12
23
4
17
97
13
23
41
21
30
3
32
47
35
7
47
116
28
11
26
40
16
6
16
28
16
28
63
57
69
18
11
32
82
12
54
8
42
23
120
24
32
10
13
18
43
80
67
35
41
29
236
191
23
37
418
15
No. of contigs of No. of dubious contaminate origin d contigs
CroCo results
Table S6: Summarized statistics of the cross-contamination checks for each species (hybrid-enrichment data) and summarized length statistics of the clean assemblies.
99.11
99.94
93.89
98.11
98.59
99.90
98.93
93.72
99.57
99.76
99.35
99.28
99.85
99.92
99.25
97.12
98.61
98.67
99.10
99.43
98.44
91.66
98.98
98.78
95.23
99.61
96.02
99.79
99.21
99.31
99.12
97.03
99.44
99.06
99.69
99.70
99.36
92.09
98.12
98.29
99.28
87.08
98.58
99.45
99.48
99.29
97.60
99.54
0.69
0.00
5.76
1.25
1.09
0.05
0.42
3.31
0.06
0.01
0.32
0.13
0.03
0.01
0.64
2.68
1.28
1.17
0.63
0.33
0.38
8.26
0.90
1.13
3.95
0.29
3.63
0.08
0.57
0.53
0.17
2.84
0.31
0.88
0.24
0.24
0.46
7.29
1.51
1.58
0.26
12.44
0.28
0.09
0.34
0.44
0.60
0.21
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Percentage Percentage (%) of contigs Percentage (%) (%) of clean with low of contigs with contigs coverage high coverage
0.08
0.02
0.18
0.36
0.25
0.02
0.33
1.91
0.15
0.06
0.19
0.25
0.05
0.03
0.05
0.01
0.03
0.05
0.17
0.11
0.68
0.04
0.02
0.03
0.38
0.04
0.12
0.03
0.12
0.03
0.06
0.01
0.06
0.01
0.05
0.00
0.07
0.14
0.22
0.03
0.16
0.05
0.07
0.15
0.06
0.09
0.15
0.13
Percentage of contigs of dubious origin
0.11
0.02
0.15
0.26
0.05
0.01
0.30
1.04
0.21
0.15
0.12
0.33
0.04
0.02
0.04
0.16
0.06
0.08
0.08
0.10
0.46
0.02
0.08
0.04
0.42
0.04
0.21
0.07
0.09
0.11
0.63
0.10
0.17
0.03
0.01
0.04
0.09
0.46
0.12
0.09
0.28
0.41
1.05
0.29
0.11
0.16
1.62
0.09
26,988
11,813
19,438
17,138
65,577
56,745
15,475
10,391
13,262
7,289
21,402
12,032
34,108
23,409
34,875
16,501
25,457
31,185
72,054
51,539
14,561
61,609
13,230
71,647
18,574
26,973
24,403
10,566
46,280
20,337
18,778
23,041
18,116
30,022
100,669
37,968
43,343
16,017
50,578
37,668
14,073
6,125
22,063
63,704
20,674
22,933
25,038
15,082
578
466
519
501
549
534
574
568
571
639
483
578
508
539
498
453
475
504
548
526
549
521
585
568
525
493
482
566
518
448
510
549
578
472
598
601
549
510
503
516
558
357
730
492
580
516
467
499
Percentage (%) of putatively contaminated No. of clean Mean contig contigs contigs length
433
400
416
414
465
456
441
440
449
465
419
444
429
428
430
401
414
415
457
437
434
448
455
481
425
433
416
424
438
400
422
439
442
424
506
473
443
413
436
431
430
347
516
437
459
422
396
409
Median contig length
Clean assembly statistics
589
446
551
478
545
535
590
566
591
673
470
588
491
517
484
449
470
485
543
507
560
530
613
581
522
492
475
586
504
432
503
555
571
469
623
610
534
488
493
493
547
397
864
490
610
496
488
487
N50 of contig lengths
15,435
15,935
10,360
9,650
10,095
16,161
11,568
11,265
9,822
6,683
14,583
9,022
8,292
9,562
16,013
9,734
11,261
12,775
11,618
13,589
8,188
16,503
8,340
8,482
11,233
15,930
12,364
8,798
13,174
9,017
10,545
10,623
14,645
11,129
14,637
8,688
15,282
11,036
9,832
11,239
7,261
8,859
19,968
11,504
12,564
13,975
8,536
15,211
Max. contig length
37,243
26,345
63,641
23,175
32,196
18,577
29,228
14,662
19,071
42,153
25,353
30,580
36,840
21,940
13,651
Peltodytes (Neopeltodytes) oppositus
Pheropsophus sp.
Philaccolilus sp.
Pinacodera sp.
Platambus maculatus
Platynectes sp.
Platynus sp.
Porhydrus lineatus
Porrorhynchus sp.
Suphisellus (Pronoterus) semipunctatus
Pseudoxicheila tarsalis
Pterostichus burmeisteri
Scarites subterraneus
Siagona sp.
Sphallomorpha suturalis
19,604
19,897
Peltodytes (Peltodytes) caesus
Tricondyla aptera
28,916
Patrus sp.
22,541
52,533
Panagaeus bipustulatus
Thermonectus margineguttatus
14,651
Pachydrus sp.
24,332
40,889
Ozaena sp.
29,215
63,799
Omophron sp.
Thermonectus basillaris
15,053
Odacantha melanura
Therates labiatus
59,413
Notomicrus sp.
18,167
30,797
Notiophilus sp.
68,301
32,881
Notiobia sp.
Tetracha carolina
31,168
Neptosternus brevior
Suphisellus tenuicornis
36,044
Neohydrocoptus sp.
69,300
22,201
Necterosoma penicillatum
Suphisellus gibbulus
16,808
Nebria picicornis
14,361
13,906
Morion sp.
Stictotarsus duodecimpustulatus
28,518
Mesonoterus laevicollis
11,807
11,796
Mesacanthina cribata
26,440
31,081
Meridiorhantus calidus
Sternhydrus scutellaris
11,736
Sternhydrus atratus
15,450
Megadytes sp.
9,254
20,010
Matus sp.
Manticora latipennis
Macrogyrus sp.
18,625
20,536
23,057
28,761
16,179
58,315
59,663
13,150
24,123
9,527
13,307
21,684
36,762
29,925
22,411
40,949
17,143
13,434
28,538
17,842
30,905
20,978
63,513
20,835
36,798
17,236
28,812
50,148
14,496
38,924
62,398
14,862
59,303
30,324
32,649
30,648
34,269
22,030
15,616
13,189
28,342
11,726
29,585
9,868
15,319
9,228
18,192
979
2,005
1,275
454
1,988
9,986
9,637
1,211
2,317
2,280
344
256
78
655
2,942
1,204
1,928
1,228
690
735
1,291
2,197
128
5,510
445
2,661
104
2,385
155
1,965
1,401
191
110
473
232
520
1,775
171
1,192
717
176
70
1,496
1,868
131
26
1,818
354
767
386
113
601
6,314
6,610
486
905
1,101
139
103
42
267
589
674
560
712
286
361
559
810
68
2,272
141
1,107
67
856
60
620
679
166
83
113
106
301
416
95
424
229
81
47
606
733
79
12
677
578
997
615
325
1,328
746
564
590
423
444
54
119
9
345
2,325
382
1,302
195
345
343
640
1,365
33
3,121
284
1,488
6
1,490
88
1,296
665
1
3
334
110
135
1,315
38
742
453
14
8
852
1,023
31
1
988
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
148
115
3
8
2649
2259
84
786
634
7
12
8
9
11
87
28
196
35
8
62
8
17
33
7
57
16
20
3
14
43
11
12
5
7
50
26
21
8
8
39
6
5
60
6
1
83
28
93
159
13
51
277
204
51
203
101
144
22
19
34
17
60
38
124
24
23
30
14
10
84
13
9
15
19
4
35
14
13
12
21
9
34
18
17
18
27
42
9
33
52
15
12
70
96.81
94.50
96.34
98.83
92.36
94.62
95.63
94.95
94.65
90.01
98.49
99.30
99.90
98.73
90.71
98.74
92.82
96.48
98.61
97.98
97.72
94.01
99.90
87.70
99.18
92.18
99.87
97.08
99.35
96.71
98.86
99.83
99.95
98.83
99.61
99.29
96.22
99.65
95.43
96.49
99.66
99.80
97.13
90.32
99.66
99.84
94.29
2.94
4.42
2.52
1.11
7.30
1.09
0.81
4.10
1.59
3.76
0.39
0.54
0.02
1.12
9.17
0.90
6.82
1.32
1.18
1.84
1.98
5.88
0.05
11.84
0.76
7.47
0.02
2.83
0.60
3.16
1.04
0.00
0.00
1.08
0.33
0.43
3.64
0.17
4.41
3.25
0.04
0.06
2.74
8.71
0.20
0.01
4.93
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.09
0.65
0.47
0.01
0.04
3.87
3.25
0.58
2.97
5.36
0.05
0.05
0.02
0.02
0.04
0.20
0.14
1.33
0.11
0.04
0.19
0.03
0.02
0.12
0.01
0.28
0.05
0.03
0.02
0.03
0.06
0.07
0.02
0.01
0.02
0.16
0.07
0.09
0.04
0.05
0.13
0.05
0.01
0.51
0.03
0.01
0.41
0.14
0.41
0.65
0.04
0.28
0.40
0.29
0.35
0.76
0.85
1.05
0.10
0.05
0.11
0.06
0.14
0.19
0.84
0.08
0.12
0.09
0.06
0.01
0.31
0.03
0.04
0.05
0.03
0.02
0.08
0.02
0.08
0.02
0.06
0.02
0.10
0.04
0.07
0.10
0.19
0.14
0.07
0.10
0.44
0.09
0.12
0.34
18,979
21,303
23,443
28,874
16,780
64,629
66,273
13,636
25,028
10,628
13,446
21,787
36,804
30,192
23,000
41,623
17,703
14,146
28,824
18,203
31,464
21,788
63,581
23,107
36,939
18,343
28,879
51,004
14,556
39,544
63,077
15,028
59,386
30,437
32,755
30,949
34,685
22,125
16,040
13,418
28,423
11,773
30,191
10,601
15,398
9,240
18,869
436
534
472
452
514
505
553
499
511
495
475
577
492
480
605
523
509
560
524
510
494
490
567
513
514
511
506
487
511
586
551
537
607
594
609
596
535
474
617
599
653
477
476
450
488
520
1,615
428
416
403
429
415
471
437
425
427
407
432
425
418
438
432
437
438
425
422
422
424
429
447
421
420
417
423
434
469
452
446
448
485
460
449
429
413
482
429
465
417
400
408
410
420
689
392
445
530
469
457
494
491
552
488
495
481
456
568
485
462
586
511
496
575
510
494
488
477
600
506
509
522
505
478
499
604
576
538
618
605
639
579
523
484
644
619
739
457
455
445
501
507
3,227
11,608
15,968
14,009
16,030
13,833
16,412
7,526
14,529
15,811
16,636
12,302
27,098
8,715
14,169
164,411
16,023
11,948
11,103
6,533
16,089
10,583
12,654
14,638
17,129
13,614
18,784
15,259
8,621
13,697
8,232
14,165
23,311
11,055
19,704
11,585
10,515
11,590
16,556
14,797
20,868
13,311
10,695
16,526
10,096
12,033
13,466
267,990
14.9879553005658 14.9095415607033 14.6262646776515 14.0169271717379 13.7453785344514 12.9845136243213 12.5516818518748 12.2625950880255 12.1261851428834 11.0016461065822 10.1081691569445 9.23998836614435 8.56925720515656 8.50467351425688 8.42795722618228
Noterus clavicornis
Bidessus unistriatus
Peltodytes caesus
Hydroglyphus geminus
Derovatellus peruanus
Mesacanthina cribata
Canthydrus sp.
Pseudoxicheila tarsalis
Peltodytes oppositus
Porrorhynchus sp.
Cicindela hybrida
Andogyrus sp.
Hygrotus (Leptolambus) impressopunctatus
Hyphydrus ovatus
Therates labiatus
4.61849225302622
15.9976172935927
Macrogyrus sp.
Hydrovatus sp.
16.2812441570849
5.2603106696089
16.5908715138856
Gyrinus marinus
4.63082083280957
17.0917760376562
Hydrocanthus oblongus
Tetracha carolina
17.8361334480708
Clinidium baldufi
Tricondyla aptera
18.4343336403483
Manticora latipennis
20.086940972042
Mesonoterus laevicollis
Neohydrocoptus sp.
21.0047843064449
23.1634883014588
Dineutus sp.
Suphisellus tenuicornis
23.5273688528818
21.9476118164781
28.0875072726362
Suphisellus (Pronoterus) semipunctatus
21.0625551931415
35.5918036760022
Notomicrus sp.
Suphisellus gibbulus
52.5222714837805
Patrus sp.
Pachydrus sp.
81.8293488629726
Priacma serrata
LB score
Micromalthus debilis
Species
Table S7: LB-score statistics of each species in supermatrix G based on the tree inferred under the SHETU model.
1.62364890946678
Hydroporus erythrocephalus.
-5.50246986082786 -5.53795566066612 -5.55048753133522 -5.89915978514883 -6.15294705032785
Haliplus laminatus
Calophaena bicincta
Haliplus confinis
-4.1377474171876
Copelatus caelatipennis
Ozaena sp.
-4.04777931386015
Panagaeus bipustulatus
Scarites subterraneus
-3.59625411767508
Morion sp.
-5.33529238277406
-3.39165167752329
Odacantha melanura
-5.03450592251712
-3.18355006468255
Siagona sp.
Cychrus sp.
-2.93680802725671
Laccornis oblongus
Pheropsophus sp.
-2.74719850897069
Notiobia sp.
-4.21832565846587
-2.41460947784844
Haliplus lineatocollis
-1.66343455684053
Brychius elevatus
Galerita sp. -1.03862428904037
-0.902693888635164
Nebria picicornis
Adelotopus paroensis
-0.397318319569817
Eretes griseus
Celina imitatrix
0.552158453354612 -0.255889253109809
Bembidion corgenoma
1.56172563603556
1.83842247486454
Philaccolilus sp.
0.697554609830475
2.00321504504037
Porhydrus lineatus
Pinacodera sp.
2.07604755068354
Neptosternus sp.
Notiophilus sp.
2.14975525611938 2.10280584058766
Hydrodytes opalinus
2.42665718053103
Laccodytes sp.
Graptodytes pictus
2.7650535680152
2.86842991567915
Algophilus lathridioides
2.55131110143494
3.40288567571252
Stictotarsus duodecimpustulatus
Laccophilus poecilus
3.55377854622276
Loricera pilicornis
Sphaliomorpha suturalis
4.18652492550238
Necterosoma penicillatum
-14.287415643254
-14.8461912820432 -14.9856870011585 -15.3084359566286
Matus sp.
Ilybius fenestratus
Broscus cephalotes
-13.8944029708267
Sternhydrus atratus
-14.8307056656241
-13.1448225043927
Elaphrus aureus
Platambus maculatus
-12.3913476059428
Amphizoa insolens
Agabetes acuductus
-12.2111425723117
-11.5835538654435
Sternhydrus scutellaris
-12.3871909509032
-11.3350674309252
Trachypachus gibbsii
Coptotomus sp.
-11.2261819104233
Platynus sp.
Amphizoa lecontei
-11.1116265535797 -11.1807823954766
-10.9461426825717
Clivina sp.
Liopterus haemorrhoidalis
-10.9292185965262
Megadytes sp.
Cybister lateralimarginalis
-9.6687544591983 -10.4889506742851
Calathus sp.
-9.48225910086887
Hygrobia nigra
Lancetes sp.
-9.08444551105834
-8.88398442684992
Hygrobia hermanni
-9.10812530864255
-8.64422426667182
Chlaenius tricolor
Sinaspidytes wrasei
-8.33851823087746
Glyptolenus sp.
Acilius canaliculatus
-8.17266250905532
-7.36803497351586
Thermonectus basillaris
-7.92106642129098
-7.30510043127757
Pterostichus burmeisteri
Exocelina sp.
-7.07418993055707
Carabus granulatus
Lachnophorini sp.
-6.92382739240991
Haliplus fluviatilis
-7.47367304119996
-6.77185663026029
Pogonus chalceus
-7.81985616884988
-6.54023200108131
Thermonectus intermedius
Goniotropis sp.
-6.37474826198299
Amblycheila cylindriformis
Omophron sp.
-6.30657933159926
Thermonectus margineguttatus
-16.1707226151063 -16.1919408399064 -16.4135486272736 -16.5091780230715 -16.6117986302361 -17.8199723143495 -17.8995553967374 -19.979759209938 -20.008885139661
Bunites distigma
Hydrotrupes palpalis
Batrachomatus nannup
Aspidytes niobe
Hydaticus pacificus
Platynectes sp.
Caperhantus cicurius
Hyderodes shuckardi
Meridiorhantus calidus -20.3963050509039
-16.1705342699442
Calosoma frigidum
Dytiscus marginalis
-15.5200822594503
Agabus undulatus
29,617, 30,692
24,276, 24,277
20,324, 22,806
20,547, 20,295
23,265, 22,931
20,813, 22,094
Supermatrix D - recoded
Supermatrix F
Supermatrix G
Supermatrix H
Supermatrix I
Supermatrix J
2,000
2,000
2,000
2,000
2,000
2,000
2,000
Burn-in
5
5
5
5
5
5
5
Sampling frequency
0.29331
0.33291
0.16444
0.22626
0.30707
0.49131
1
Maxdiff**
bpcomp
67
629
50
73
119
532
90
loglik
166
150
519
332
123
415
37
length
89
1,250
71
171
371
2,227
221
alpha
1,115
621
623
702
747
1,142
1,430
Nmode
78
272
53
82
127
1,610
315
statent
150
202
183
124
63
740
33
statalpha
tracecomp (effsize***)
***Note: According to the manual of Phylobayes values > 50 are acceptable.
937
42
557
59
74
3,183
75
rrent
**Note: According to the manual of Phylobayes if maxdiff = 1 after 10000 generations the chains have likely stuck in a local optimum and values < 0.3 are acceptable. We considered the maxdiff value for supermatrix F as marginally acceptable since maxdiff value was estimated to 0.307.
*Note: For supermatrix D less cycles were run due to computational limitations.
10,030, 10,246
Supermatrix D*
No. of MCMC cycles (run1 , run2)
Table S8: Summarized convergence statistics of the Bayesian phylogenetic analyses performed with the software Phylobayes under the CAT+GTR+G4 model.
18,813
20,931
17,571
18,324
22,276
17,410
6,223
rrmean
JTT+C60+F+R8
JTT+C60+F+R8
Supermatrix C
Supermatrix D
JTTDCMUT+C60+F+R7
JTTDCMUT+C60+F+R6
JTT+C60+F+R5
JTT+C60+F+R7
JTTDCMUT+C60+F+R5
Supermatrix F
Supermatrix G
Supermatrix H
Supermatrix I
Supermatrix J
**Note: It was not analyzed with the maximum-likelihood approach.
*Note: Not analyzed.
JTTDCMUT+C60+F+R9
Supermatrix E
-
JTT+C60+F+R8
Supermatrix B
Supermatrix D – recoded**
-
Best model
Supermatrix A*
Dataset
LG+R9
LG+R10
LG+R10
LG+R10
LG+R10
LG+R10
-
LG+R10
LG+F+R10
LG+R10
-
Best sitehomogeneous model
843160.811
1447045.90
1081638.86
1263178.26
1667656.21
2706783.69
-
2661340.94
3365688.93
2699060.74
-
AICc for best model (AICc)
891371.387
1531968.58
1142790.64
1337483.23
1766780.96
2864913.19
-
2816918.86
3545261.41
2852218.04
-
AICc for best sitehomogeneous model
Table S9: Summarized results of the model selection procedure in ModelFinder for all partitioned and unpartitioned supermatrices (IQ-TREE v. 1.6.12).
-48210.58
-84922.68
-61151.77
-74304.97
-99124.75
-158129.50
-
-155577.92
-179572.48
-153157.30
-
ΔAICcAICc
2854344.989
3549562.162
Supermatrix B
Supermatrix C
AICc
Site-homogeneous (SHOMU)
3552120.506
2856869.834
BIC
3369952.821
2701618.882
AICc
Site-heterogeneous (SHETU)
3373178.732
2704802.417
BIC
3513279.535
2821151.198
AICc
Partitioned site-homogeneous (SHOMP)
Table S10: Comparison of different models (and partitioning schemes) for the analyzed partitioned matrices (see Table 1). Comparisons were based on a fixed neighbour-joining tree constructed with MEGA X (JTT + uniform rates).
BIC
3531601.641
2838600.738
136
136
136
136
Supermatrix B_nt
Supermatrix C_nt
Supermatrix D_nt
56,812
54,175
100,900
127,584
No. of No. of nucleotide species alignment sites
Supermatrix A_nt
Nucleotide dataset
17,630
17,282
37,788
48,680
31.0%
31.9%
37.5%
38.2%
0.789
0.868
0.814
0.788
0.037
0.054
0.041
0.035
Percentage of Overall Minimum Parsimony parsimony alignment completeness informative informative completenes score for pairs of sites sites s score (Ca) sequences (Cr ij)
Statistics
0.061
0.074
0.090
0.091
4.19E-02
4.70E-10
7.39E-16
1.51E-17
Average pMedian distance pairwise pbetween value to the sequences Bowker’s test
Table S11: Summarized statistics of the analyzed nucleotide supermatrices and information on what analyses were performed for each supermatrix.
3.96E-02
1.96E-08
52.02%
88.98%
Partitioned (15 tree searches, partitioned)
-
-
-
Unpartitioned under the GTR+FO*H4 model (15 tree searches)
Maximum likelihood analyses with the GHOST model (i.e., accounting for heterotachy)
Corresponding supermatrix A after keeping only second codon positions and trimming each partition with BMGE (h=0.5). Only partitions with >= 4 52.21% species, >= 80 sites, =150 nucleotide sites and > 6.60E-13 93.85% 92.56% 30% missing data from supermatrix A_nt
126.6
130.3
156.8
165.4
155.2
169.0
170.9
175.5
All genes (n = 348)
LM subset (n = 174)
SH subset (n = 174)
PI subset (n = 175)
LM+SH subset (n = 104)
LM+PI subset (n = 87)
SH+PI subset (n = 130)
LM+PI+SH (n = 74)
Set of genes
Average no. of parsimony informative sites per gene
158.5
157.0
148.0
142.5
150.0
144.5
118.5
119.0
Median no. of parsimony informative sites per gene
245.0
239.0
244.0
226.0
242.0
226.0
209.0
214.0
261.7
253.8
259.6
239.3
260.6
240.7
225.3
231.3
Median Mean length of length of genes genes
643
643
643
643
643
643
643
643
Max. length of genes
154
150
154
150
150
150
150
150
0.168
0.219
0.170
0.161
0.227
0.205
0.154
0.206
Median Min. proportion length of of missing genes data
0.176
0.213
0.176
0.162
0.224
0.200
0.159
0.204
Mean proportion of missing data
0.300
0.375
0.300
0.300
0.386
0.375
0.300
0.386
Max. proportion of missing data
0.073
0.073
0.073
0.050
0.073
0.050
0.044
0.044
Min. proportion of missing data
111.0
87.5
117.0
116.5
111.0
104.0
123.0
122.0
101.3
92.1
104.2
105.8
99.4
96.1
112.8
109.1
133
134
134
134
136
134
135
136
41
41
41
41
41
41
41
41
0.114
0.114
0.114
0.117
0.119
0.114
0.117
0.122
Median no. of Mean no. of Max. no. of Min. no. of Median sequences sequences sequences sequences RCFV
Table S12: Summarized statistics for the different groups (i.e., sets) of genes that were analyzed with the summary coalescent phylogenetic method. RCFV values were calculated with BaCoCa v. 1.105 (see materials and methods).
0.114
0.116
0.114
0.117
0.123
0.116
0.117
0.124
Mean RCFV
0.194
0.230
0.194
0.194
0.324
0.230
0.210
0.324
Max. RCFV
0.039
0.039
0.039
0.039
0.039
0.039
0.039
0.039
Min. RCFV
File S2
Supplementary experimental procedures Bait design We used the previously inferred ortholog set of Vasilikopoulos et al. (2019) and 24 transcriptomes of Adephaga (File S1: Table S1) to generate codon-based nucleotide sequence alignments of the genes in the ortholog set (see Misof et al., 2014 for details on generating the codon-based nucleotide alignments). The sequences of the reference species of the ortholog set (Harpegnathos saltator, Nasonia vitripennis, Bombyx mori, Danaus plexippus, Anopheles gambiae) were removed before generating the codon-based nucleotide sequence alignments except for the sequences of Tribolium castaneum. Subsequently we screened these alignments for regions suitable for bait design. The sequences of T. castaneum (already part of the reference taxon set, Vasilikopoulos et al., 2019) were kept in the alignments to allow BaitFisher to cut these alignments according to CDS features (i.e., coding exons, Mayer et al., 2016). We used the genome assembly of T. castaneum v. 5.2 (Herndon et al., 2020; Richards et al., 2008) (scaffolds downloaded from BeetleBase, last access 28 October 2019, Kim et al., 2010) and the same version of gene annotation (GFF file downloaded from iBeetle-Base, Dönitz et al. 2014, last access: October 28th 2019) to identify CDS boundaries within each the codon-based nucleotide sequence alignment (Mayer et al., 2016). The sequences of T. castaneum were automatically excluded by Baitfisher before inferring the DNA sequences of the baits. The required taxonomic group string was specified as follows in all tiling design
experiments:
(Clinidium_baldufi), (Gyrinus_marinus),
(Aspidytes_niobe),
(Sinaspidytes_wrasei),
(Cybister_lateralimarginalis), (Haliplus_fluviatilis),
(Dineutus_sp), (Hygrobia_hermanni,
(Cicindela_hybrida), (Elaphrus_aureus), Hygrobia_nigra),
(Noterus_clavicornis), (Thermonectus_intermedius), (Carabus_granulatus, Calosoma_frigidum), (Pogonus_chalceus),
(Amphizoa_lecontei,
Amphizoa_insolens),
(Liopterus_haemorrhoidalis),
(Batrachomatus_nannup), (Metrius_contractus), (Bembidion_corgenoma), (Chlaenius_tricolor), (Trachypachus_gibbsii). The cluster threshold was set to 0.15. The rest of the options were
1
specified according to the parameters in File S1: Table S3 separately for each tiling design experiment.
Removal of baits with multiple hits to a reference genome (Bembidion corgenoma, Gustafson et al., 2019) was performed with BaitFilter v. 1.0.5 (Mayer et al., 2016) using the following options: -blast-min-hit-coverage-of-baits-in-tiling-stack 0.80 --blast-first-hit-evalue 0.0000001 --blast-secondhit-evalue 0.00001 and by using BLAST+ v. 2.6.0 (Camacho et al., 2009). This filtering was performed separately for the set of baits that resulted in each tiling design experiment. Subsequently, the best bait region per protein-coding exon in each tiling design experiment was kept (option in Baitfilter: --mode fb). For those CDS regions that were captured in multiple tilingdesign experiments only the longest bait regions among experiments were considered. This was accomplished by adding the bait regions from the different experiments (non-redundantly for coding exons, starting from results of experiments that allowed longer regions and adding regions from experiments that allowed shorter bait regions) to a combined file with the baits until the maximum size of ~5.99Mbp of DNA was reached (i.e., max. size of bait sequences for the DNA target enrichment kit that was used: SureSelectXT2 Target Enrichment System, Agilent Technologies Inc., Santa Clara, U.S.A.). The last task was performed with a custom Perl script.
Hybrid enrichment of target genomic DNA sequences For enriching the target gDNA in the indexed libraries, we followed the procedure for capture library size >3.0 Mb outlined in Agilent's SureSelectXT2 Target Enrichment System Protocol for Illumina Paired-End Multiplexed Sequencing (Version E1 published in June 2015, pages 55–74), with minor modifications (see Bank et al., 2017). We used a SureSelectXT2 Custom 5.99 Mbp library of 49,786 120bp-long baits and pooled the indexed libraries (8 samples per pool) before the hybridization reaction. After pooling the libraries, the total volume of the pools was reduced to 3.5 μl with a SpeedVac R SPD 111V (ThermoFisher Scientific, Waltham, MA, USA). Hybridization with the baits was allowed for 48 h at 65 °C in a GeneAmp PCR System 2720. We then performed the physical separation of the target DNA from the remaining DNA by adding 50 μl Dynabeads MyOne
2
Streptavidin T1 beads and by incubating the mixture for 30 min. at room temperature (Bank et al., 2017). After washing of the beads, the captured DNA was re-suspended in 30 μl nuclease-free water and post-amplified in an on-bead PCR reaction (Bank et al., 2017). For post-amplification, we followed Agilent's protocol by applying the PCR cycling program for a capture library size of > 1.5 Mb with a slightly increased cycle number as described by Bank et al. (2017). We purified the amplified target DNA with AMPure XP beads in a ratio of 1:0.75 to remove oligonucleotide primer dimers and to further select fragments with a length between 200 and 500 bp (Bank et al., 2017). Each of the twelve library pools was eluted in 40 μl nuclease-free water and checked for quality and quantity with a Fragment Analyzer and a Quantus Fluorometer.
Quality trimming of raw genomic reads and assembly of the target-DNA enrichment data Raw reads and bases of poor quality as well as Illumina adapter sequences were removed with Trimmomatic v. 0.38 (Bolger et al., 2014) using the following options: ILLUMINACLIP:TruSeq3PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25 (note: the core Illumina adapter sequence is the same in the TruSeq and NEBNext library preparation kit). Genomic assemblies were generated with a customized compilation of IDBA-UD v. 1.1.3 (Peng et al., 2012) as described by Mayer et al. (2016) (in order for the software to be able to deal with read lenghts of 150 bp) and using the options: --step 5 --maxk 120.
Sequencing, assembly and cleaning of new transcriptomes and exploitation of previously published transcriptomes We included 38 transcriptomes in our combined dataset for inferring the phylogeny of Adephaga. 36 of these transcriptomes have been used in other phylogenetic studies (File S1: Table S1) (Boussau et al., 2014; McKenna et al., 2019; Misof et al., 2014; Pauli et al., 2016; Peters et al., 2014; Pflug et al., 2020; Seppey et al., 2019; Van Belleghem et al., 2012; Vasilikopoulos et al., 2020, 2019). The transcriptomes of Chlaenius tricolor and B. corgenoma were included here for the first time a phylogenetic analysis although the transcriptome of B. corgenoma has already been
3
published in another study (Pflug et al., 2020; as Bembidion sp. nr. transversale, see File S1: Table S1).
Libraries for C. tricolor were prepared at Oregon State University. In short, mRNA was isolated using NEBNext Poly(A) mRNA Magnetic Isolation Module (New England Biolabs, Ipswich, MA, U.S.A.), and libraries were constructed with NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs). The fragment size distribution of each library was characterized with a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, U.S.A.) using the High Sensitivity DNA Analysis Kit and 1μl of sample. These libraries were then sequenced on an Illumina HiSeq 2000 at Oregon State University.
Transcriptome reads for the transcriptome of C. tricolor were trimmed with Trimmomatic and assembled with Trinity (v. r20140413) (Grabherr et al., 2011), both implemented within the Agalma v. 0.5 pipeline using default parameters for all parts of the pipeline (Dunn et al., 2013).
Cross-contamination checks and vector contamination screening for transcriptomes Most analyzed transcriptomes have been previously checked for cross-species contaminations and vector contaminations (see File S1: Table S1). Cross-contamination checks were only performed here for the transcriptomes of C. tricolor, Trachypachus gibbsii, B. corgenoma and Amphizoa insolens because these transcriptomes have been initially processed in the same laboratory at Oregon State University. The cross-contamination check was performed among the transcriptomes of these four species and also included other transcriptomes processed in the same laboratory (not included in our analyses) using Croco v. 1.1 (Simion et al., 2018) with the following options: --tool: Kallisto, --fold-threshold: 2, --trim5: 0, --trim3: 0, --minimum-coverage: 0.2, --suspect_id: 99, -suspect_len: 200, --overexpression: 300. The filtered transcriptome assemblies were subsequently screened for vector contaminations using the UniVec v. 10.0 database as described by Misof et al. (2014). The raw sequenced reads of the transcriptome of C. tricolor have been deposited at the NCBI-SRA database (File S1: Table S1).
4
Orthology assignment for target DNA-enrichment data and transcriptomes Because Orthograph is designed to process transcriptomic data (Petersen et al., 2017) we changed the default exonerate model for orthograph-reporter to “protein2dna” for processing of the genomic libraries (option: --exonerate-alignment-model). The protein2dna model aligns the query protein sequence to a DNA sequence, incorporating all the appropriate gaps and frameshifts but without including modelling of introns (Slater & Birney, 2005). The orthology assignment for the transcriptomes was performed with the default exonerate model (i.e. “protein2genome”). The rest of the options were identical for all datasets: max-blast-searches = 50, blast-max-hits = 50, orfoverlap-minimum = 0.5, extend-orf = 1, minimum-transcript-length = 30, substitute-u-with = X and otherwise default options.
Manual inspection and curation of amino-acid alignments When manually inspecting the amino-acid alignments, we observed in a few instances that the hybrid-capture data of very few species would share the same few amino-acid residues that were not observed in the same column when looking at sequences of transcriptome or genome reference data. When this phenomenon was observed at the borders of the captured regions and all transcriptomes or genome data were homogeneous in terms of amino-acid residues for that column we considered that these residues could represent sequencing errors, frameshift errors or intronic residues (see Bank et al., 2017) and we masked them manually with an X. We acknowledge that this might have masked a few true amino-acid residues at the borders of the captured regions but we suggest that it is better to follow a conservative approach intended to remove erroneous sequence data rather than to include data with potentially erroneous phylogenetic signal.
5
Outlier sequence removal based on pairwise BLOSUM62 distances using a sliding window approach in the multiple sequence alignments of individual genes We screened the amino-acid multiple sequence alignments for outlier sequences before removing randomly similar sections with ALISCORE (Fig. 1). Outlier identification and removal was performed with the same procedure described by Dietz et al. (2019) (The OliInSeq program is available upon request from C. Mayer, [email protected]). Sequences identified as outliers in 25% or more of the sequence windows, were removed completely from the multiple sequence alignment. The window size for the pairwise comparisons was adjusted to 20 amino acids. Corresponding outlier sequences were subsequently also removed from the codon-based nucleotide sequence data.
Controlling for among-species compositional heterogeneity, removal of distantly related outgroup species and removal of ingroup species with long branches In order to reduce the sensitivity of our phylogenetic analyses to compositional heterogeneity among species, we generated and analyzed a Dayhoff6-recoded version of supermatrix D. As an alternative approach to reduce among-species compositional heterogeneity in the data, another independent supermatrix was generated for the same purpose by keeping only the 50% of genes with the lowest degree of among-species compositional heterogeneity (RCFV values calculated with BaCoCa v. 1.105, Kück & Struck, 2014). The 322 compositionally homogeneous (i.e., with reduced RCFV values) genes were then concatenated into a new supermatrix which was subsequently trimmed with BMGE (h = 0.5, BLOSUM62) to remove hypervariable sites (supermatrix I, Table 1 of the main text).
We tested whether removal of distantly related outgroup species affected the phylogenetic relationships as has been previously suggested for other taxonomic groups (Philippe et al., 2009; Pisani et al., 2015). For this reason, we generated one additional matrix by removing distantly related outgroup species from supermatrix F (Table 1). Species-specific LB scores were calculated with TreSpEx v. 1.1 (Struck, 2014). Calculation of LB scores for each species was performed
6
using the inferred tree of supermatrix F under the site-heterogeneous model (SHETU). Subsequently, all outgroup species were removed except for the two species of Archostemata (16 removed species). From the putative closest outgroup clade of Adephaga that includes the suborders Archostemata + Myxophaga (McKenna et al., 2019) we kept the two species of Archostemata because they had lower LB scores than the two species of Myxophaga. This filtering resulted in supermatrix G (n = 120 species).
In a second step, we tested whether removal of long-branched ingroup species affected phylogenetic reconstructions by removing the 20 ingroup species with the highest long-branch scores from supermatrix G (LB scores). Specifically, we repeated the calculation of LB scores based on the tree inferred under the SHETU model for supermatrix G and removed the 20 ingroup species with the highest LB scores (File S1: Table S7). This removal resulted in supermatrix H (n = 100 species).
Calculation of supermatrix statistics We inferred substitution saturation plots for most analyzed amino-acid supermatrices. Pairwise patristic and p-distances were calculated with TreSpEx v. 1.1 (Struck, 2014) by providing the best maximum-likelihood (ML) trees and their corresponding amino-acid supermatrices. Substitution saturation plots were then inferred in R v. 3.6.3 (R Core Team, 2020).
Model selection and phylogenetic inference in concatenation-based analyses Phylogenetic tree reconstructions and model selection analyses were conducted with IQ-TREE v. 1.6.12 (Nguyen et al., 2015). Model selection in all supermatrices (unpartitioned, B–J) was performed by also testing the relative fit of empirical profile site-heterogeneous mixture models (option: -mfreq FU, F, C20+F, C40+F C60+F). During model selection, these amino-acid frequency profiles were combined with the exchange rates of the most commonly used single-matrix aminoacid models: LG (Le & Gascuel, 2008), WAG (Whelan & Goldman, 2001), JTT (Jones et al., 1992), JTTDCMUT, and DCMUT (Kosiol & Goldman, 2005). Additionally, the model selection procedure
7
on the unpartitioned matrices (B–J) involved all possible combinations for modelling among-site rate heterogeneity in the data (options: -mrate E,I,G,I+G,R -cmin 4) (Kalyaanamoorthy et al., 2017). Lastly, we also included the LG4X and LG4M mixture models in our model selection procedure (Le et al., 2012). For the partitioned supermatrices (B, C)
an optimal partitioning
scheme with site-homogeneous models was inferred using the rcluster algorithm (Lanfear et al., 2014) in IQ-TREE 1.6.12 (options: -m MFP+MERGE -rcluster 10 -rcluster-max 5000 -merit AICc mset LG, WAG, JTT, JTTDCMUT, DCMUT -madd LG4X, LG4M -mrate E, I, G, I+G, R). We inferred 15 maximum-likelihood trees under the site-homogeneous partitioned models and 15 trees under the unpartitioned site-homogeneous models (the site-homogeneous models models that showed the best fit to the datasets with the exclusion of site-heterogeneous models). Only one tree search for each supermatrix was conducted under the better-fitting site-heterogeneous models due to computational limitations. Lastly, we performed 15 independent trees with the approximation to the site-heterogeneous models (PMSF) (Wang et al., 2018), by using the best tree that resulted from the analyses under the unpartitioned site-homogeneous model as a guide tree. All maximumlikelihood tree searches were performed using random starting trees (option: -t RANDOM) and the best tree in each analysis (among the 15 inferred trees wherever applicable) was selected based on the log-likelihood scores. The relative model fit of partitioned, unpartitioned site-homogeneous and unpartitioned site-heterogeneous models was performed based on a fixed neighbor-joining tree computed with MEGA X v. 10.0.5 (Kumar et al., 2018) and using the JTT model with uniform rates. Comparison of partitioned with unpartitioned models was then performed in IQ-TREE (same version) based on this fixed neighbor-joining tree using both the AICc and BIC criteria (File S1: Table S10).
Optimization of the partitioning scheme and model selection for the nucleotide sequence data (option: -st DNA) was performed in IQ-TREE v. 1.6.12 with the following options: -m MFP+MERGE -mset GTR, K2P, F81, TN, JC, HKY -rcluster-max 5000 -rcluster 10 -mrate E, I, G, I+G, R. The unpartitioned nucleotide sequence datasets were analyzed using the same combinations of models as above. We performed 15 independent maximum-likelihood tree searches under the
8
best-fitting models in all cases and calculated branch support based on 2000 ultrafast bootstraps (Hoang et al., 2018). For supermatrix nt_B we also performed 15 independent tree searches using the GTR+FO*H4 model (unpartitioned) that accounts for heterotachy among sequences (Crotty et al., 2020). Deviation of nucleotide supermatrices from stationary reversible and homogeneous conditions as well as inference of completeness scores were performed using the same statistical criteria and software that was used to analyze the amino-acid supermatrices (File S1: Table S11, see main text).
Single gene-tree phylogenetic inference and likelihood mapping analyses for individual genes For inferring the gene trees we first selected the best-fit substitution models for each of the 348 genes in IQ-TREE 1.6.12 based on the AICc criterion using the same set of site-homogeneous models as above (-merit AICc, cmax, cmin: default) and inferred 10 gene trees per gene using the random starting tree option (-t RANDOM). We then selected the gene trees with the best loglikelihood score for downstream analyses. Branch support for each gene tree was estimated based on 10000 SH-aLRT replicates (Guindon et al., 2010), because this measure was previously shown to outperform other branch support measures for identifying dubious clades in single gene-tree analyses (Simmons & Kessenich, 2020). Branches with support values lower than 50% were collapsed before performing summary coalescent analyses for all subsets of gene-trees (see main text). For measuring the phylogenetic informativeness of individual genes based on likelihood mapping we performed likelihood-mapping analyses with the same version of IQ-TREE and the best fitted models. We considered all possible quartets for each gene (options -lmap ALL -wql) and the proportion of fully resolved quartets was calculated with a custom Python script.
References Bank, S., Sann, M., Mayer, C. et al. (2017) Transcriptome and target DNA enrichment sequence data provide new insights into the phylogeny of vespid wasps (Hymenoptera: Aculeata: Vespidae). Molecular Phylogenetics and Evolution, 116, 213–226.
9
Bolger, A.M., Lohse, M. & Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. Boussau, B., Walton, Z., Delgado, J.A. et al. (2014) Strepsiptera, phylogenomics and the long branch attraction problem. PLoS One, 9, e107709. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. & Madden, T.L., (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. Crotty, S.M., Minh, B.Q., Bean, N.G., Holland, B.R., Tuke, J., Jermiin, L.S. & von Haeseler, A., (2020) GHOST: recovering historical signal from heterotachously evolved sequence alignments. Systematic Biology, 69, 249–264. Dietz, L., Dömel, J.S., Leese, F., Mahon, A.R. & Mayer, C. (2019) Phylogenomics of the longitarsal Colossendeidae: the evolutionary history of an Antarctic sea spider radiation. Molecular Phylogenetics and Evolution, 136, 206–214. Dönitz, J., Schmitt-Engel, C., Grossmann, D. et al. (2014) iBeetle-Base: a database for RNAi phenotypes in the red flour beetle Tribolium castaneum. Nucleic Acids Research, 43, D720– D725. Dunn, C.W., Howison, M. & Zapata, F. (2013) Agalma: an automated phylogenomics workflow. BMC Bioinformatics, 14, 330. Grabherr, M.G., Haas, B.J., Yassour, M. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29, 644–652. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W. & Gascuel, O. (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology, 59, 307–321. Gustafson, G.T., Alexander, A., Sproul, J.S., Pflug, J.M., Maddison, D.R. & Short, A.E.Z. (2019) Ultraconserved element (UCE) probe set design: base genome and initial design parameters critical for optimization. Ecology and Evolution, 9, 6933–6948.
10
Herndon, N., Shelton, J., Gerischer, L. et al. (2020) Enhanced genome assembly and a new official gene set for Tribolium castaneum. BMC Genomics, 21, 47. Hoang, D.T., Chernomor, O., von Haeseler, A., Minh, B.Q. & Le, S.V. (2018) UFBoot2: improving the ultrafast bootstrap approximation. Molecular Biology and Evolution, 35, 518–522. Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences, 8, 275–282. Kalyaanamoorthy, S., Minh, B.Q., Wong, T.K.F., von Haeseler, A. & Jermiin, L.S. (2017) ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods, 14, 587–589. Kim, H.S., Murphy, T., Xia, J. et al. (2010) BeetleBase in 2010: revisions to provide comprehensive genomic information for Tribolium castaneum. Nucleic Acids Research, 38, D437–D442. Kosiol, C. & Goldman, N. (2005) Different versions of the dayhoff rate matrix. Molecular Biology and Evolution, 22, 193–199. Kück, P. & Struck, T.H. (2014) BaCoCa - a heuristic software tool for the parallel assessment of sequence biases in hundreds of gene and taxon partitions. Molecular Phylogenetics and Evolution, 70, 94–98. Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. (2018) MEGA X: molecular evolutionary genetics analysis across computing platforms. Molecular Biology and Evolution, 35, 1547– 1549. Lanfear, R., Calcott, B., Kainer, D., Mayer, C. & Stamatakis, A. (2014) Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evolutionary Biology, 14, 82. Le, S.Q., Dang, C.C. & Gascuel, O. (2012) Modeling protein evolution with several amino acid replacement matrices depending on site rates. Molecular Biology and Evolution, 29, 2921– 2936.
11
Le, S.Q. & Gascuel, O. (2008) An improved general amino acid replacement matrix. Molecular Biology and Evolution, 25, 1307–1320. Mayer, C., Sann, M., Donath, A. et al. (2016) BaitFisher: a software package for multispecies target DNA enrichment probe design. Molecular Biology and Evolution, 33, 1875–1886. McKenna, D.D., Shin, S., Ahrens, D. et al. (2019) The evolution and genomic basis of beetle diversity. Proceedings of the National Academy of Sciences, 116, 24729–24737. Misof, B., Liu, S., Meusemann, K. et al. (2014) Phylogenomics resolves the timing and pattern of insect evolution. Science, 346, 763–767. Nguyen, L.-T., Schmidt, H.A., Von Haeseler, A. & Minh, B.Q. (2015) IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32, 268–274. Pauli, T., Vedder, L., Dowling, D. et al. (2016) Transcriptomic data from panarthropods shed new light on the evolution of insulator binding proteins in insects. BMC Genomics, 17, 861. Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. (2012) IDBA-UD: a de novo assembler for singlecell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28, 1420– 1428. Peters, R.S., Meusemann, K., Petersen, M. et al. (2014) The evolutionary history of holometabolous insects inferred from transcriptome-based phylogeny and comprehensive morphological data. BMC Evolutionary Biology, 14, 52. Petersen, M., Meusemann, K., Donath, A. et al. (2017) Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics, 18, 111. Philippe, H., Derelle, R., Lopez, P. et al. (2009) Phylogenomics revives traditional views on deep animal relationships. Current Biology, 19, 706–712.
12
Pflug, J.M., Holmes, V.R., Burrus, C., Johnston, J.S. & Maddison, D.R. (2020) Measuring genome sizes using read-depth, k-mers, and flow cytometry: methodological comparisons in beetles (Coleoptera). G3 Genes|Genomes|Genetics, 10, 3047–3060. Pisani, D., Pett, W., Dohrmann, M. et al. (2015) Genomic data do not support comb jellies as the sister group to all other animals. Proceedings of the National Academy of Sciences, 112, 15402–15407. R Core Team (2020) R: A Language and Environment for Statistical Computing. Richards, S., Gibbs, R.A., Weinstock, G.M. et al. (2008) The genome of the model beetle and pest Tribolium castaneum. Nature, 452, 949–55. Seppey, M., Ioannidis, P., Emerson, B.C. et al. (2019) Genomic signatures accompanying the dietary shift to phytophagy in polyphagan beetles. Genome Biology, 20, 98. Simion, P., Belkhir, K., François, C. et al. (2018) A software tool “CroCo” detects pervasive crossspecies contamination in next generation sequencing data. BMC Biology, 16, 28. Simmons, M.P. & Kessenich, J. (2020) Divergence and support among slightly suboptimal likelihood gene trees. Cladistics, 36, 322–340. Slater, G.S.C. & Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics, 6, 31. Struck, T.H. (2014) TreSpEx-detection of misleading signal in phylogenetic reconstructions based on tree information. Evolutionary Bioinformatics, 10, 51–67. Van Belleghem, S.M., Roelofs, D., Van Houdt, J. & Hendrickx, F. (2012) De novo transcriptome assembly and SNP discovery in the wing polymorphic salt marsh beetle Pogonus chalceus (Coleoptera, Carabidae). PLoS One, 7, e42605. Vasilikopoulos, A., Balke, M., Beutel, R.G. et al. (2019) Phylogenomics of the superfamily Dytiscoidea (Coleoptera: Adephaga) with an evaluation of phylogenetic conflict and systematic error. Molecular Phylogenetics and Evolution, 135, 270–285.
13
Vasilikopoulos, A., Misof, B., Meusemann, K. et al. (2020) An integrative phylogenomic approach to elucidate the evolutionary history and divergence times of Neuropterida (Insecta: Holometabola). BMC Evolutionary Biology, 20, 64. Wang, H.-C., Minh, B.Q., Susko, E. & Roger, A.J. (2018) Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Systematic Biology, 67, 216–235. Whelan, S. & Goldman, N. (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular Biology and Evolution, 18, 691–699.
14
Trim with BMGE (BLOSUM62, h=0.4)
Supermatrix J
Trim with BMGE (BLOSUM62, h=0.3)
Supermatrix A
Supermatrix D - recoded
Recode states with Dayhoff-6 strategy
Supermatrix D
Remove partitions that failed symmetry tests (genes violating model assumptions) with IQ-TREE and trim with BMGE (BLOSUM62, h=0.5)
Recoded supermatrix
Partition boundaries, full taxon sampling
No partition boundaries, reduced taxon sampling
No partition boundaries, full taxon sampling
Starting supermatrix
Filtering of COGs and of initial supermatrix
Supermatrix I
Keep only half of the genes with the lowest RCFV value. Trim resulting matrix with BMGE (BLOSUM62, h=0.5)
Supermatrix B
Trimmed gene partitions with BMGE (BLOSUM62, h =0.4). Keep genes with length >= 50 sites
Supermatrix C
Fig. S1: Flowchart summarizing the steps for generating the different amino-acid supermatrices after processing of the individual COGs (see also Fig. 1 of the main text).
Supermatrix H
Remove long-branched ingroup species from supermatrix G (TreSpex)
Supermatrix G
Remove distantly related outgroups
Supermatrix F
Supermatrix E
Trim with BMGE (BLOSUM62, h=0.5)
- Remove outlier sequences - Trim COGs with ALISCORE - Remove partitions with zero information content (MARE)
Trimmed gene partitions with BMGE (BLOSUM62, h =0.5). Keep genes with >= 80 sites and