Improving transcription factor binding site prediction using a combination of universal protein binding microarray and DNase-seq data

SASBi-SAGS, Kwalata Game Ranch, Gauteng, South Africa, September 2014 (abstract only)

Caleb K. Kibet and Philip Machanick

Accurate prediction of transcription factor binding sites (TFBS) remains a challenge in gene regulatory research. One in vitro technique, a protein binding microarray (PBM), produces high throughput binding data that covers all possible 10-mers and has been used to predict TFBS. Although comprehensive, a universal PBM models binding without considering the chromatin accessibility of transcription factors (TFs) to the binding sites. In vivo techniques like chromatin immuno-precipitation followed by deep sequencing (ChIP-seq) more accurately models TF binding, but are experimentally expensive and can only study a single TF for a specific cell type. DNase hypersensitivity sites data that combine chromatin accessibility data from multiple cell types are available from the ENCODE database, but cannot be used independently to predict binding as they are not specific to any TF. A technique that merges in vivo DNase hypersensitivity data and in vitro PBM data could allow for better TFBS prediction than use of PBM data alone. In this study, we infer the probability that a k-mer is located in an open chromatin region based on its frequency count in DNase-seq data. We then use a modified seed and wobble algorithm that re-ranks the k-mer binding intensity data based on probability that a k-mer is located in an open chromatin region and its signal intensity. Preliminary analysis show that our technique produces motifs with information content higher than tested reference motifs in the JASPAR core database learned from SELEX or ChIP-seq data. Also, our motifs have a higher enrichment in ChIP-seq data than pure PBM-derived models and, for some TFs like Max, better enrichment than reference motifs. This shows that our technique has a good chance of improving TFBS prediction by combining in vivo and in vitro data.