Comparative Assessment Suite for Transcription Factor Binding Motifs

RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges, Philadelphia, November 2015 (poster)


Caleb K. Kibet and Philip Machanick


Predicting transcription factor (TF) binding sites remains an active challenge due to degeneracy and multiple potential binding sites in the genome. The advent of high throughput sequencing has seen several experimental approaches, including ChIP-seq, DNase-seq and ChIP-exo, and dozens of algorithms developed to address the challenge. An increasing number of motif models has been published and those in databases have more than doubled in the last two years. However, there is no standardized means of motif assessment let alone a computational tool to rank the available motifs for a given TF. This makes it hard to choose the best models and for algorithm developers to benchmark, test, quantify and improve on their tools. We introduce a web server hosting a suite of tools that assesses PWM-based motif models using scoring, comparison and enrichment approaches. Given that there is no agreed standard for motif quality assessment, we present a range of measures so users can apply their own judgement. An assess-by-scoring approach uses motif models to score benchmark data partitioned into positive and background sets, then uses AUC, Pearson, MNCP and Spearman’s rank statistics to quantify their performance – scoring functions are energy, GOMER, sum occupancy and sum log-odds. An assess-by-comparison approach seeks to rank, for a given TF, motifs based on similarity to all available motifs in the database using TOMTOM’s Euclidean distance function and FISim. It assumes the best model should be representative of information in the others, provided a variety of data and algorithms is used. This is a quick data-independent approach that has proved to be powerful, reproducing assessment-by-score ranks with over 0.7 average correlation. A web interface to the tools uses the Django framework with a MySQL back end. The database contains 6,530 human and mouse motif models and benchmark data derived from available databases and publications. A user-entered test motif for a given TF is ranked against motifs for the same TF in the database using the available benchmark data as well as user-supplied data in BED or FASTA format. Results are returned in interactive visuals providing further information on motif clustering, similarity and ranks, with options to download publication-ready figures and ranked motif data. We have demonstrated the benefit of our web server in motif choice and ranking as well as in motif discovery. Web server and command-line versions are available (link to be added once available, estimated mid-October 2015).
Web server under development at http://bioinf.ict.ru.ac.za/.