Comparative analysis of transcription factor binding sites assessment approaches

ISCB Africa ASBCB Conference on Bioinformatics 2015, Dar es Salaam, Tanzania, March 2015 (abstract only)

Caleb K. Kibet and Philip Machanick

Transcription factor binding site (TFBS) prediction remains a challenge in gene regulatory research due to degeneracy and multiple potential binding sites in the genome. Dozens of algorithms have been designed to learn TFBS models, generating multiple models available in research papers with a few making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF makes it difficult for biologists to make an informed choice of the motif models to use and for the algorithm developers to benchmark, test and continually improve on their models. Currently, diverse techniques have been used with very little standardization. In fact, conflicting results are obtained depending on who is performing the assessment, which technique is used, which data type is used and the parameters set in the assessment. Therefore, there is a need for a standardized motif assessment approach. In this study, we reviewed and tested the approaches that have been used by various algorithms in motif assessment and highlight the discrepancies and weaknesses of using diverse approaches and data for motif assessment. We identify factors like motif scoring approaches, motif length, data used for assessment and type performance metrics used as some of the factors that influence the outcome of a motif assessment. In light of this we are developing an approach that allows for the quick assessment of position weight matrix-based motif models against a variety of models for the same transcription factor using ChIP-seq, PBM and other data sets. Preliminary output challenges the results from some of the recent comparative motif assessments from a variety of data.

Extra background information and justification

This paper is motivated by the difficulty in choosing best motif models among the variety available in various databases and research groups’ websites. In an effort to improve in vivo prediction by the currently available in vitro derived models, we were unable to confidently judge and determine the benefit of combining transcription factor-specific in vitro data and general in vivo data. The use of performance metrics like AUC on ChIP-seq data produced varied results depending on the metrics set. Certain tools like BEEML perform well when an energy scoring approach is used compared with occupancy scoring. This can be attributed to their use of an energy model; comparing with techniques that use sum occupancy or GOMER scoring may introduce a different bias. The advancement of technologies utilizing high throughput sequencing methods like ChIP-seq offers an opportunity create benchmarking data for motif assessment. Some of the new algorithms may generate good models but they do not necessarily make their way to the currently available motif databases, which either focus of motifs generated by a specific research group or method (e.g. UniPROBE for PBM data), or only contain motifs that have been rigorously curated. Therefore, in addition to providing a standardized motif assessment approach, we will provide an access point to multiple motifs generated using a variety of algorithms accompanied by their performance on various data and use addressing a need in the transcription factor research. This work answers a functional and regulatory genomics problem. It also fits into the database and resources development theme of the conference.