Flexible protein database based on amino acid k-mers

The database engine of kAAmer is based on log-structured merge-tree (LSM-tree) Key-Value (KV) storeseleven. LSM-trees are used in data-intensive operations such as web indexing12,13social networking14 and online gaming15.16. KAAmer uses Badger17an efficient implementation in Golang (https://golang.org/) of a WiscKey KV (key-value) store16. WiscKey’s LSM-tree design is optimized for solid state drives (SSD) and separates keys from values ​​to minimize data movement during the creation of the key-value store. KAAmer will obtain peak performance with modern hardware such as solid-state drives that offer good throughput in input/output (I/O) operations per second (IOPS) and will effectively accommodate use cases where many queries are sent simultaneously. A kAAmer database includes three KV stores (see Fig. 1a): one to provide the information on proteins (protein store) and two to enable the search functionalities (k-mer store and combination store). The k-mer store contains all the 7-mers found in the sequence dataset and the keys to the combination store, which uniquely serves the combination of proteins held by k-mers. The fixed k-mer size of 7 was chosen to fit on 4 bytes and keep a manageable database size while offering good specificity over protein targets. The k-merized design of a kAAmer database provides an interesting simplicity for the search tasks which will give an exact match count of all 7-mers between a protein query and all targets from a protein database. This strategy is not guaranteed to return the same homologous targets that would be obtained with alignment or HMM search and is therefore less suitable for distant homology retrieval.

Figure 1

(a) Design of a kAAmer database. Three key-value stores are created within a database (K-mer Store, Combination Store, Protein Store). Colors indicate the combination (hash) values ​​that are reused in the combination store. Proteins are numbered (p01, p02, p03) and k-mers are numbered (k01, k02, …, k08). (b) Protein search speed benchmark. Software include Blastp (v2.9.0+), Ghostz (v1.0.2), Diamond (v0.9.25) and kAAmer (v0.6) with (-aln) and without (-kmatch) alignment. (c) Protein search precision and recall benchmark with the ECOD database. The blue bars indicate the precision results and the red bars indicate the recall results. Software include Blastp (v2.9.0+), Ghostz (v1.0.2), Diamond (v0.9.25) and kAAmer (v0.6) with alignment.

In order to evaluate the performance and precision of kAAmer, we built a speed and a sensitivity benchmark against protein families of the ECOD database (homology groups)18. We evaluated four software: Blastp (v2.9.0+)3Ghostz (v1.0.2)19Diamond (v0.9.25)5 and kAAmer (v0.6). Other interesting software that make use of web servers for remote analyses, such as Sequenceservertwenty and MMseqs2twenty-one, are worth mentioning for the functionality that they offer. However, we limited our benchmark to the previously mentioned software for their alignment efficiency and their computational resource requirements. Note that we have tested two modes for Diamond, the more-sensitive and fast ones. Similarly, we tested two sensitivity modes with kAAmer, based on the minimal number of shared k-mers, which are k10 (at least 10 shared k-mers between protein query and target) and k1 (at least 1 shared k-mer). For kAAmer, each sensitivity mode was tested without alignment—the ratio of shared k-mers serving as a scoring function and also with a subsequent alignment form. The alignment purpose in kAAmer is to improve the scoring metrics while using the same result set as the raw alignment-free method.

Figure 1c illustrates the ROC curve results of the sensitivity benchmark. We observed that Ghostz, Blastp and Diamond-sensitive reported respectively the highest number of true positives regardless of false negatives. Then follows, kAAmer with the minimal number of shared k-mers, Diamond-fast and kAAmer with at least 10 shared k-mers. The ROC curve also shows the difference in precision of the alignment-free mode of kAAmer compared to the alignment modes. One of the main reasons would be the scoring scheme that uses the percentage of shared k-mers in contrast to bit score with the alignment results. Note that the minimal k-mer matches is an option provided by the user to tune the sensitivity of the protein search. We also compared our database engine with the aforementioned software for their execution time with different query dataset size. Thirteen different protein query datasets were randomly and uniquely chosen from the original ECOD database, with size ranging from 1 protein to 50,000 proteins. Figure 1b illustrates the wallclock times of the alignment software in comparison with kAAmer for protein homology searches. See the Methods section for the hardware used in the benchmarks. We observed with the larger query datasets (50,000 proteins) that kAAmer k10 in alignment-free mode completed the search in 46.5 s, while the alignment mode for kAAmer-k10 did it in 390.7 s. When using only one shared k-mers (kAAmer k1), the most sensitive mode in kAAmer, the execution times were 78.8 s without alignment and 966.6 s with alignment. The fast mode of Diamond completed the same task in 64.7 s, while it took 287.7 s with the sensitive mode. Ghost yielded results similar to the Diamond sensitive mode while Blastp reported results that were significantly slower than the other tested software. When comparing the speed results with the maximum number of queries (50,000 proteins), kAAmer in its alignment-free mode achieves performance comparable to the fast mode of Diamond, although the results will vary with the parameter of the minimal number of k-mer matches. used. The alignment mode of kAAmer obviously adds an overhead that will impact the running time results. Yet in combination with the minimal k-mer match of 1, it will offer better sensitivity at the detriment of speed.

In order to accommodate real-use cases, we built relevant kAAmer databases and investigated their usage in typical bacterial genomics analyses. It should be noted that annotation of genomes and gene identification rely heavily on the quality of the underlying database. What kAAmer has to offer is the inclusion of the protein information within the database combined with an efficient search functionality to facilitate downstream analyses. Therefore, we also provide utility scripts to demonstrate these use cases. The first use case was to identify antibiotic resistance genes (ARGs) in a bacterial genome and test its accuracy related to other ARG finder software. For ARG identification we used the NCBI Bacterial Antimicrobial Resistance Reference Gene Database (v2020-01-06.1)22 and compared the kAAmer results with the ResFinder (v3.2 and database 2019-10-01)23 and CARD (v5.1.0)24 software and database. The query genome is a pan-resistant Pseudomonas aeruginosa strain E613095225. Table 1 shows the results of the ARG identification within the query genome by the three software/databases tested. For the majority of antibiotic classes, the results are in agreement between the three databases. Interestingly, three aminoglycoside genes (aac(6′)-Il, ant(2”)-Ia and aacA8) were only found with kAAmer (NCBI-ARG) and ResFinder. On the other hand, several more antibiotic efflux systems are annotated in CARD and the number of identified efflux proteins in E6130952 goes up to 36 while only 3 were reported by kAAmer (NCBI-ARG) and none by ResFinder. Also 2 genes associated with resistance to peptide antibiotics (arnA, basS) and 2 other (soxR, carA) associated with multiple antibiotic classes were only reported by CARD. Other tested use cases include genome annotation and metagenome profiling as shown in the Methods section.

Table 1 Report of the antibiotic resistance gene identification in the pan-resistant Pseudomonas aeruginosa E6130952 strain from kAAmer+NCBI-arg, ResFinder and CARD databases.

In summary, kAAmer introduces a fast and flexible protein database engine to accommodate different genomics analyzes use cases. It can be hosted on-premise or in the cloud and be queried remotely while offering a flexible protein annotation scheme. Although it can be adapted to find more distant homology, it is best suited to quickly find close sequence homology with its k-mer matching functionality, while providing rich annotations on the identified protein targets.

Leave a Comment

Your email address will not be published.