Comparative Assessment Suite for Transcription Factor Binding Motifs
Caleb K. Kibet and Philip Machanick
Predicting transcription factor (TF) binding sites remains an active challenge
due to degeneracy and multiple potential binding sites in the genome. The advent
of high throughput sequencing has seen several experimental approaches, including
ChIP-seq, DNase-seq and ChIP-exo, and dozens of algorithms developed to address
the challenge. An increasing number of motif models has been published and those
in databases have more than doubled in the last two years. However, there is no
standardized means of motif assessment let alone a computational tool to rank the
available motifs for a given TF. This makes it hard to choose the best models and
for algorithm developers to benchmark, test, quantify and improve on their tools.
We introduce a web server hosting a suite of tools that assesses PWM-based motif
models using scoring, comparison and enrichment approaches. Given that there is
no agreed standard for motif quality assessment, we present a range of measures
so users can apply their own judgement. An assess-by-scoring approach uses motif
models to score benchmark data partitioned into positive and background sets, then
uses AUC, Pearson, MNCP and Spearman’s rank statistics to quantify their
performance – scoring functions are energy, GOMER, sum occupancy and sum log-odds.
An assess-by-comparison approach seeks to rank, for a given TF, motifs based on
similarity to all available motifs in the database using TOMTOM’s Euclidean
distance function and FISim. It assumes the best model should be representative
of information in the others, provided a variety of data and algorithms is used.
This is a quick data-independent approach that has proved to be powerful,
reproducing assessment-by-score ranks with over 0.7 average correlation. A web
interface to the tools uses the Django framework with a MySQL back end. The
database contains 6,530 human and mouse motif models and benchmark data derived
from available databases and publications. A user-entered test motif for a given
TF is ranked against motifs for the same TF in the database using the available
benchmark data as well as user-supplied data in BED or FASTA format. Results are
returned in interactive visuals providing further information on motif clustering,
similarity and ranks, with options to download publication-ready figures and
ranked motif data. We have demonstrated the benefit of our web server in motif
choice and ranking as well as in motif discovery. Web server and command-line
versions are available (link to be added once available, estimated mid-October
2015).
Web server under development at http://bioinf.ict.ru.ac.za/.