Results on a carefully constructed test set of verified binding sites in mouse genome demonstrate that our new multiple data fusion method can reduce false positive rates, and that DNA duplex stability and nucleosome occupation data can improve the accuracy of transcription factor target gene predictions, especially when combined with other genome-level data sources. Cross-validation and other randomization tests confirm the predictive performance of our method. Our results also show that nonredundant data sources provide the most efficient data fusion. A central problem in molecular system biology is to understand the manner in which a cell operates its complex transcriptional machinery.

At molecular level, transcriptional processes are largely controlled by transcription factors TFs that bind to gene promoters in a sequence-specific manner and, thereby, inhibit or promote the expression of their target genes. Collectively, these DNA-binding proteins and other molecules work together to implement the complex regulatory machinery that controls gene expression. Since large-scale understanding of transcriptional regulation is still severely limited even in lower organisms, it is of great importance to reveal these regulatory protein-DNA interactions genomewide.

However, it is not possible to obtain sufficient coverage, that is, to screen all TFs under all conditions, using experimental methods alone. Therefore, the binding site prediction problem calls for computational methods. Computational predictions rely on sequence specificities that are typically taken from a database [ 4 ] or obtained as an output from a motif discovery method [ 6 ]. Recent progress on experimental side has made it also possible to measure TF-binding specificities in high-throughput manner [ 7 ].

- Publications - Bioinformatics and Systems Medicine Laboratory - UTHealth.
- Encyclopedia of Agriculture and Food Systems, Second Edition: 5-volume set!
- Introduction to Stochastic Programming.
- Human Interface and the Management of Information. Information and Interaction: Symposium on Human Interface 2009, Held as part of HCI International 2009, San Diego, CA, USA, July 19-24, 2009, Proceedings, Part II?
- Learning from Failure to Maintain High Commitment and Performance (FT Press Delivers Elements).
- Serpent (The first book in the NUMA Files series);
- Associated Data!

The advent of these experimental techniques equips TF target gene prediction methods with much more accurate binding specificity models and, indeed, opens a whole new avenue for computational analysis of TF-DNA binding. Sequence specificities alone, however, are not sufficiently informative to accurately predict TFBSs simply because the probability of observing an exact copy of a presumably functional binding motif in a genome by chance is remarkably high.

A natural way to improve TF target gene predictions is to incorporate additional information into statistical inference of TFBSs. A number of additional data sources can be useful for this purpose, including, among others, information on coregulated genes, evolutionary conservation, physical binding locations as measured by ChIP-chip or ChIP-seq, nucleosome occupancies, CpG islands, regulatory potential, DNase hypersensitive sites, and so on. Incorporating additional information sources to guide statistical inference has successfully been made use of in the context of motif discovery [ 8 — 11 ], but has not attracted enough attention in TF target gene prediction.

We have recently developed a probabilistic TF target gene prediction method, ProbTF, which can incorporate practically any additional genome-level information source to predict TF target gene [ 12 ]. Statistical data fusion for TF target gene prediction becomes more challenging in the case of multiple information sources. Here we develop a new method for multiple data fusion and incorporate novel data sources into TF target gene prediction. Four genome-level additional information sources i. Some of these and other individual data sources have already been shown to improve de novo motif discovery [ 8 — 11 ].

Here we demonstrate how multiple data sources can be combined to make joint statistical inference of TF target gene.

### Main navigation

Integration of data sources that have a probabilistic interpretation is relatively straightforward [ 12 ], and for other data sources we convert the raw data into probabilities, or prior distributions, by extending a previously proposed Bayesian transformation method [ 11 ]. In addition, for efficient use of DNA duplex stability data, we propose a simple heuristic that can assess the binding preference single versus double-stranded DNA for a TF from a set of known binding sites. Results on a carefully constructed set of verified binding sites in mouse genome [ 3 , 5 , 12 ] demonstrate that the new data fusion method that we propose here improves the performance of TF target gene prediction methods.

We also demonstrate that a number of genome-level data sources, either alone or especially in combination, are highly informative of TF target gene. Consequently, our statistical data fusion method can gain valuable new insights into genomewide models of transcriptional regulatory networks. Given the fundamental role of TFs in transcriptional regulation, we focus on predicting TF target gene.

Because each individual data source is noisy and gives only a partial view of the underlying regulatory mechanisms, we focus on making statistical inference for TFBSs from multiple information sources. The essence of the data fusion problem that we encounter is illustrated in Figure 1 , which shows four examples of verified binding sites from the test data set together with the associated additional genome-level data sources [ 12 ].

The first row in each subplot shows the annotated binding site s for a TF in a gene promoter. The following five rows show the additional data sources: probability of conservation con. The joint prior combined from all the explored additional data sources is shown in black in the last row. The median and mean of the scores for each data type applied to the sequences shown in Figure 1 are recorded in Table in supplementary material available online at doi: Illustration of data fusion problem in TF target gene prediction.

The promoter sequence names are shown above the arrow, and the arrow corresponds to transcription start site TSS.

- The Son of Neptune (Heroes of Olympus, Book 2)!
- Computational Genomics | Center for Genomic Sciences.
- Abstracts | CSHL Meetings And Courses?
- Functional Annotation | Biocuration ?
- ECCB'12 | Accepted Posters.
- Integrative Systems Biology Resources and Approaches in Disease Analytics!

Horizontal axis corresponds to position relative to TSS. The red bar s together with a TF name on the first line of each figure represent the known binding site. Evolutionary conservation green , regulatory potential blue , two nucleosome positioning signals [ 1 , 2 ] magenta , and DNA duplex stability data red are shown in the following five rows abbreviated with con. The joint prior from all the four additional data sources black is shown in the last row.

TFs shown in panels a and b are assumed to bind to their corresponding sequences in a double-strand manner, while TFs in panels c and d bind in a single-strand manner. All plotted data are for mouse genome. Figure 1 shows that the highest log-likelihood score is not always obtained at the annotated binding site.

This issue can be solved by, for example, ProbTF method, which implements an intuitive way of combining predictions by multiple PSFMs: ProbTF considers all possible numbers of nonoverlapping TFBSs in all possible locations and configurations and weights each configuration according to its probability. A more difficult problem is to decide that which of the peaks predicted by PSFMs correspond to real, functional binding sites.

The lack of specificity can be greatly improved by genome-level data fusion, which forms the focus of this study.

### 1. Introduction

Corresponding to what is known about transcriptional regulation, many of the verified binding sites typically have high degree of conservation [ 8 ] and high regulatory potential scores [ 14 ] and are typically free of stable nucleosomes i. Moreover, DNA double helix destabilization energies at TF binding sites are different from those at random sites [ 11 ].

However, correlation between TFBSs and any of the additional data sources cannot be expected to be perfect even from a biological point of view. The additional information sources are also noisy, regardless of whether they are experimental measurements or computational predictions.

The only possibility is to make statistical inference, which takes the inherent randomness into account, from multiple genome-level data sources. The rationale is that the accuracy of computational TF target gene predictions naturally improves when more useful information is incorporated into statistical analysis. We first describe the TF target gene prediction algorithm employed in this study full details can be found from [ 12 ]. Let denote a single strand of a promoter sequence, where and is the length of the sequence generalization to double-stranded DNA sequences is also possible but omitted here.

Let denote the number of unknown binding sites and the hidden start positions of nonoverlapping binding sites in sequence ; that is, if then. Nonbinding site i.

Assuming that we have access to the previous nucleotides before the start of the actual sequence , the likelihood of a sequence having no binding sites for any TF is , where. We set since that value provides the best results in [ 12 ]. Define as the configuration of motif models from in ; that is, specifies the motif model , which begins from location and has a length. Further, the probability of sequence , given nonoverlapping motif positions and the motif and background models, is.

The probability that a sequence has binding sites is obtained with Bayes' rule.

- Integrative Systems Biology Resources and Approaches in Disease Analytics.
- MAGNet: High-Impact Publications.
- Columbia University Medical Center.

As proposed in [ 12 ], the prior of the number of motif instances, , is assumed to be independent of and and has an exponential form. We use. This formula defines the user definable prior expectation of the number of binding sites in a given DNA sequence.

## Joint/Adjunct Faculty | Computational & Systems Biology

Importantly, it does not incorporate any of the informative data sources studied here. This prior, primarily only, increases or decreases of the estimated binding probabilities, and as such has little effect on, for example, the ROC curves. The probability is obtained with the assumption that, for a fixed value of , the prior over binding site positions and configurations is uniform and inversely proportional to the number of different binding site positions and configurations. The probability is obtained by summing over all possible positions and configurations, and can be computed efficiently using a recursive formula [ 12 ].

Finally, the probability that a TF which is characterized by binds to a promoter , , is defined as the probability that at least one of the motif models in has a binding site in. Integration of additional data sources into the aforementioned probabilistic TF target gene prediction framework is carried out by assuming that the data sources are in the form of where is the probability that the base pair location is a binding site. Assuming that and are conditionally independent and the probability of does not depend on the PSFM and background models, the probability of and given , , , and is.

## ECCB12 Accepted Posters

Following 1 , the probability is modeled as. Consequently, the same efficient recursive algorithm can be used to compute see [ 12 ] for more details. Also note that since additional data are incorporated using probabilities of binding over the promoter sequence; we could also employ methods other than ProbTF. Define the additional genome-level data source for a single gene promoter having length as. Denote the probabilities for position from different data sources as ,. Further, define a thresholded version of probabilities as.

Then the thresholded scores for position can be written as ,. Let be the number of data sources that exceed their thresholds at location , then the integrated probability for position , , is calculated as. The data integration method is parameterized by and. Note that and. It is also worth noting that the resulting probabilities do not include hard thresholding for any of the genomic locations although thresholding is involved in integration, and the use of thresholding during the construction is motivated by its simple yet powerful parametrization.

The data integration method is illustrated in Figure 2 for the case of two additional data sources with parameters , , , and. For illustration purposes, both data sources are assumed to have uniform distribution and hence.

An illustration of the prior integration method. An illustration of the prior integration method for the case of two additional data sources. In the above genome-level data integration method there are is the number of additional data sources weighting parameters , and one threshold for emphasizing the most informative binding locations. There are also two scaling parameters, a multiplicative factor , and a bias term , for each additional data source, and one scaling parameter, , for combining other data sources with the TF target gene prediction analysis.