Drosophila melanogaster datasets sources

This 4th release of ReMap (2022) present the analysis of 1,205 quality controlled ChIP-seq (n=1,315 before QCs) datasets from public sources (GEO, ENCODE). Those ChIP-seq datasets have been mapped to the dmel6 drosophila assembly. Here we define a “dataset” as a ChIP-seq experiment in a given series (e.g. GSE107059), for a given TF (e.g. Trl ), in a particular biological condition (i.e. cell line, tissue type, disease state or experimental conditions ; e.g. Schneider-2). Datasets were labeled by concatenating these three pieces of information such as GSE107059.Trl.Schneider-2.

Statistics

ChIP-seq
Datasets (QC1+2 pass) 1,205
Targets 550
Peaks 16,634,486

550

Transcriptionnal regulators

Search for specific factors

1,205

Quality controled ChIP-seq datasets

Browse a given dataset

16,634,486

Binding regions

Download our data

Integration of ChIP-seq

In this ReMap 2022 Fly release we have manually curated and annotated 1,315 ChIP-seq experiments, retained after quality control 1,205 datasets. We applied both or our quality control steps as described in the NAR publication.

After consistent peak calling, we identified a total of 16.6 million peaks bound by transcriptionnal regulators. These numbers include overlapping sites for identical TRs which were studied in various conditions. To address this we merged overlapping TR peaks for similar TR obtaining a catalog of 12.9 million non-redundant peaks.

Datasets quality assessment

As not every ChIP-seq datasets are equal in terms of quality, we used four different metrics based on ENCODE ChIP-seq guidelines to retain high quality datasets for downstream analyses. First we used the normalized strand cross-correlation coefficient (NSC) which is a normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation, and the relative strand cross-correlation coefficient (RSC), a ratio between the fragment-length peak and the read-length peak to exclude low quality datasets. We also used the fraction of reads in peaks (FRiP) and the number of peaks identified in each dataset to filter datasets.

Dataset(s) are plotted in a 2D vizualization with NSC and RSC as x- and y-axis, colours highlight the datasets conserved (green) or excluded (red) from the catalogue of binding sites.

In 2022 two sets of filters were added in our QC steps. The first filter consist of removing peaks those length are outside set cutoffs. In Fly we defined the range betwen 50bp and 2kb. For each species the upper cutoff was selected based on the percentage of peaks, we searched the length under for which we have 98% of catalogue, so that we removed peaks above this cutoff. The second filter consist of removing datasets those peaks numbers are greater than two times the number of annotated genes (See Ensembl Drosophila melanogaster gene annotation statistics). As of late 2021, about less than 20,000 coding an non-coding genes are identified, giving us a cutoff of maximum 40,000 peaks per Drosophila melanogaster ChIP-seq datasets.

Datasets quality plot

Human

ChIP-exo post-processing

For this ReMap 2022 Drosophila release no large ChIP-exo datasets were clearly identified for Drosophila melanogaster. A few smaller ChIP-exo GEO datasets were published, but were not processed and included in this release.
They could be added for the 2024 release, or as a mid-release update.

Annotation and classification of transcription factors

Function and description of transcriptionnal regulators present in this catalog (GEO, ENCODE) were retrieved from Flybase, Ensembl and RefSeq databases. When possible each transcription factor was also annotated using the classification of human transcription factors allowing users to filter specific TFs based on the characteristics of their DNA-binding domains.

Genomic visualization of peaks and analyses

To perform a de novo motifs analysis for each TF present in our catalogue, we provide a link to the Regulatory Sequence Analysis Tools.

A link to the UCSC Genome Browser was also added to facilitate genomic integration of the binding sites with other genome annotations. Our BED tracks allow for the visualization of our catalogues of binding sites on the Fly genome. Finally, different analyses such as the quality of datasets and DNA constraint analysis are provided for each transcription factor.

Downloading peaks

The ReMap BED files are available to download either for a given transcriptional regulator, by Biotype or for the entire catalog as one very large BED file.

For Homo sapiens the GRCh38/hg38 assembly is currently the supported assembly, but we lifted to hg19 with liftover. We provide archives of previous ReMap catalogs.

For Mus musculus we provide BED files for transcriptional regulators. The mm10 assembly is the assembly supported by ReMap, we probide lifted peaks in mm39.

For Drosophila melanogaster we provide BED files for transcriptional regulators. The dm6 assembly is the only assembly supported by ReMap.

For Arabidopsis thaliana we provide BED files for transcriptional regulators, histones marks, ecotypes and biotype coupled with a given ecotype. The TAIR10 assembly is the only assembly supported by ReMap.