Data preparation
We need to incorporate new data (fetal brain related) to the model. It contains two sub tasks:
- Annotate DeepSEA sequence with new labels.
- See pipeline memo here for E081, E082, Noonan, E129 (DeepSEA label)
- See pipeline memo here for ATAC-seq data from Noonan’s lab
- See pipeline memo here for d27-NSC, d41-neuron data from GSE70823
- See pipeline memo here for ATAC-seq data obtained from Yifan 5/30/18
- Extract new sequences of interest based on new data set.
- See pipeline memo here for E081, E082, Noonan, E129 (DeepSEA label)
- See pipeline memo here for ATAC-seq data from Noonan’s lab
- See pipeline memo here for d27-NSC, d41-neuron data from GSE70823
- See pipeline memo here for ATAC-seq data obtained from Yifan 5/30/18
- Define positive and negative sequences
- See memo here for fantom fetal brain cage peak from Zhongshan (06/02/18)
Genomic Data
- Stage1
- Fetal brain DNase-seq from Roadmap (E081, E082)
- H3K27ac mark from Noonan’s lab (Noonan)
- One label from Deepsea model (E129)
- Stage2
- ATAC-seq data from Noonan’s lab (hNSC-50, hNSC-P15-1, hNSC-P5-1, hNSC-P5-2)
- paperNSC
- NSC data from this paper and the data is obtained from this link
- d27, d41-1, d41-2 data are used
- Is referred as
paperNSC
- ATAC053018
- Obtained from Yifan on 5/30/18
- Contains cell types labelled as
CN
, DN
, GA
, ips
, NSC
Train new models
There are three strategies:
- Baseline classifier. here
- Train an SVM or logistic classifier using JASPAR as annotation
- Gkm-SVM classifier is trained on 5000 positive and 5000 negative sequences one annotation at a time.
- Use DeepSEA sequences and DeepSEA feature representation.
- Use DeepSEA sequences but train a new model using the same architecture as DeepSEA.
- Include new sequences and train a new model using the same architecture. Leave for future
Results
- ROC/PR curve
- Stage1, JASPAR logistic, deepsea+extracted, no reweight
- Stage1 JASPAR logistic, deepsea+extracted, reweight
- Stage1 gkmsvm, deepsea, no reweight
- Stage1 gkmsvm, deepsea, reweight
- Stage1, train_deepsea, deepsea+extracted, no reweight
- Stage1, train_deepsea, deepsea+extracted, reweight
- Stage1 Stage2, all methods, deepsea sequence, reweight
- Stage1 Stage2, all methods, extracted sequence, reweight
- Stage1 Stage2, all methods, extracted sequence, reweight, 080917 For Xin,
- paperNSC, all methods, deepsea sequences, reweighted GC,
- paperNSC, all methods, extracted sequences, reweighted GC
- ATACseq 05/30/18 from Yifan, DeepSEA sequences
- ATACseq 05/30/18 from Yifan, extracted sequences
- AUC scores
- Stage1, train_deepsea vs. JASPAR, no reweight
- Stage1, train_deepsea vs. JASPAR, reweight
- Stage1, train_deepsea vs. gkmsvm, reweight
- Stage1 Stage2, compare all methods, reweight
- Stage1 Stage2, compare all methods, reweight, by group,
- paperNSC, all methods, reweighted GC
- ATACseq 05/30/18 from Yifan, all methods, reweighted by GC content
- Cross validation
Other analysis
There are some other directions and issues we need to explore and tackle. The analysis of them are listed here.
- Motif Analysis:
- Examine which filter or a set of filters that contribute most to a particular prediction task. See details Stage1 here. ONGOING
- Training Strategy
- How to boost the model to specialize on a particular prediction task, say on certain tissue type? (change architecture? how to deal with overfitting?) TODO
- Performance Evaluation
- How to measure the performance on imbalanced data. For instance, how to interpret ROC AUC and PR AUC? See related discussion here
- Redo DeepSEA evaluation on DNase footprint data (See Allelic Imbalance Analysis below)
- Application
- IBD GWAS SNPs. See here Rmarkdown script is at here
- Data Quality:
- Allelic Imbalance Analysis:
- DeepSEA results (Table S3). Presented by data set at here
- Rerun analysis considering both imbalanced variants and balanced variants here
- Detecting Regulartory Grammar:
- Critical windows are used to detect important “word” in the sequence and potentially the combination of “word” (“phrase”) in the sequence. Here, we first explore the distribution of critical window along the sequence. E081, Noonan
TODO list
- 4/9/17:
ROC/PR AUC interpretation link
Extract sequences from new data sets link
- 4/10/17:
- New loss function? Weighted hinge loss? link TODO
- Motif analysis of current model link ONGOING
- 4/14/17:
Build up pipeline using snakemake. See issue1 at link
- 4/26/17:
Build baseline classifier link
- 5/29/17:
- GC content bias in training and testing data
- To overcome this, we can either use unbiased sequence set or reweight the performance evaluation function based on the GC content.
- At first, do reweighting (
testing - training - motif analysis) ONGOING see testing with reweight and motif analysis with gc matched sequences
- Construct a sequence set with matched GC content in positive and negative instances (use Homer?) TODO
- 5/31/17:
- Search for motif hits using top 20 motifs in motif activation analysis by GC matched sequences TODO