Comparative analysis of transcription factor binding sites assessment approaches
Caleb K. Kibet and Philip Machanick
Transcription factor binding site (TFBS) prediction remains a challenge in gene
regulatory research due to degeneracy and multiple potential binding sites in
the genome. Dozens of algorithms have been designed to learn TFBS models,
generating multiple models available in research papers with a few making it to
databases like JASPAR, UniPROBE and Transfac. The presence of many versions of
motifs from the various databases for a single TF makes it difficult for
biologists to make an informed choice of the motif models to use and for the
algorithm developers to benchmark, test and continually improve on their models.
Currently, diverse techniques have been used with very little standardization.
In fact, conflicting results are obtained depending on who is performing the
assessment, which technique is used, which data type is used and the parameters
set in the assessment. Therefore, there is a need for a standardized motif
assessment approach. In this study, we reviewed and tested the approaches that
have been used by various algorithms in motif assessment and highlight the
discrepancies and weaknesses of using diverse approaches and data for motif
assessment. We identify factors like motif scoring approaches, motif length,
data used for assessment and type performance metrics used as some of the
factors that influence the outcome of a motif assessment. In light of this we
are developing an approach that allows for the quick assessment of position
weight matrix-based motif models against a variety of models for the same
transcription factor using ChIP-seq, PBM and other data sets. Preliminary output
challenges the results from some of the recent comparative motif assessments
from a variety of data.
Extra background information and justification
This paper is motivated by the difficulty in choosing best motif models among
the variety available in various databases and research groups’ websites. In an
effort to improve in vivo prediction by the currently available in vitro derived
models, we were unable to confidently judge and determine the benefit of
combining transcription factor-specific in vitro data and general in vivo data.
The use of performance metrics like AUC on ChIP-seq data produced varied results
depending on the metrics set. Certain tools like BEEML perform well when an
energy scoring approach is used compared with occupancy scoring. This can be
attributed to their use of an energy model; comparing with techniques that use
sum occupancy or GOMER scoring may introduce a different bias. The advancement
of technologies utilizing high throughput sequencing methods like ChIP-seq
offers an opportunity create benchmarking data for motif assessment. Some of the
new algorithms may generate good models but they do not necessarily make their
way to the currently available motif databases, which either focus of motifs
generated by a specific research group or method (e.g. UniPROBE for PBM data),
or only contain motifs that have been rigorously curated. Therefore, in addition
to providing a standardized motif assessment approach, we will provide an access
point to multiple motifs generated using a variety of algorithms accompanied by
their performance on various data and use addressing a need in the transcription
factor research. This work answers a functional and regulatory genomics problem.
It also fits into the database and resources development theme of the conference.