Improving transcription factor binding site prediction using a combination of universal protein binding microarray and DNase-seq data
SASBi-SAGS,
Kwalata Game Ranch, Gauteng, South Africa, September 2014 (abstract only)
Caleb K. Kibet and Philip Machanick
Accurate prediction of transcription factor binding sites (TFBS) remains a
challenge in gene regulatory research. One in vitro technique, a protein binding
microarray (PBM), produces high throughput binding data that covers all possible
10-mers and has been used to predict TFBS. Although comprehensive, a universal
PBM models binding without considering the chromatin accessibility of
transcription factors (TFs) to the binding sites. In vivo techniques like
chromatin immuno-precipitation followed by deep sequencing (ChIP-seq) more
accurately models TF binding, but are experimentally expensive and can only
study a single TF for a specific cell type. DNase hypersensitivity sites data
that combine chromatin accessibility data from multiple cell types are available
from the ENCODE database, but cannot be used independently to predict binding as
they are not specific to any TF. A technique that merges in vivo DNase
hypersensitivity data and in vitro PBM data could allow for better TFBS
prediction than use of PBM data alone. In this study, we infer the probability
that a k-mer is located in an open chromatin region based on its frequency count
in DNase-seq data. We then use a modified seed and wobble algorithm that
re-ranks the k-mer binding intensity data based on probability that a k-mer is
located in an open chromatin region and its signal intensity. Preliminary
analysis show that our technique produces motifs with information content higher
than tested reference motifs in the JASPAR core database learned from SELEX or
ChIP-seq data. Also, our motifs have a higher enrichment in ChIP-seq data than
pure PBM-derived models and, for some TFs like Max, better enrichment than
reference motifs. This shows that our technique has a good chance of improving
TFBS prediction by combining in vivo and in vitro data.