1 Data preparation

We need to incorporate new data (fetal brain related) to the model. It contains two sub tasks:

Annotate DeepSEA sequence with new labels.
- See pipeline memo here for E081, E082, Noonan, E129 (DeepSEA label)
- See pipeline memo here for ATAC-seq data from Noonan’s lab
- See pipeline memo here for d27-NSC, d41-neuron data from GSE70823
- See pipeline memo here for ATAC-seq data obtained from Yifan 5/30/18
Extract new sequences of interest based on new data set.
- See pipeline memo here for E081, E082, Noonan, E129 (DeepSEA label)
- See pipeline memo here for ATAC-seq data from Noonan’s lab
- See pipeline memo here for d27-NSC, d41-neuron data from GSE70823
- See pipeline memo here for ATAC-seq data obtained from Yifan 5/30/18
Define positive and negative sequences
- See memo here for fantom fetal brain cage peak from Zhongshan (06/02/18)

2 Genomic Data

Stage1
- Fetal brain DNase-seq from Roadmap (E081, E082)
- H3K27ac mark from Noonan’s lab (Noonan)
- One label from Deepsea model (E129)
Stage2
- ATAC-seq data from Noonan’s lab (hNSC-50, hNSC-P15-1, hNSC-P5-1, hNSC-P5-2)
paperNSC
- NSC data from this paper and the data is obtained from this link
- d27, d41-1, d41-2 data are used
- Is referred as paperNSC
ATAC053018
- Obtained from Yifan on 5/30/18
- Contains cell types labelled as CN, DN, GA, ips, NSC

There are three strategies:

Baseline classifier. here
- Train an SVM or logistic classifier using JASPAR as annotation
- Gkm-SVM classifier is trained on 5000 positive and 5000 negative sequences one annotation at a time.
Use DeepSEA sequences and DeepSEA feature representation.
- See training result Stage1 here.
- See performance on new sequences Stage1 here
- Evalution using allelic imbalance site Stage1 here
Use DeepSEA sequences but train a new model using the same architecture as DeepSEA.
- See results and analysis Stage1 here. Leave for future
Include new sequences and train a new model using the same architecture. Leave for future

There are some other directions and issues we need to explore and tackle. The analysis of them are listed here.

Motif Analysis:
- Examine which filter or a set of filters that contribute most to a particular prediction task. See details Stage1 here. ONGOING
Training Strategy
- How to boost the model to specialize on a particular prediction task, say on certain tissue type? (change architecture? how to deal with overfitting?) TODO
Performance Evaluation
- How to measure the performance on imbalanced data. For instance, how to interpret ROC AUC and PR AUC? See related discussion here
- Redo DeepSEA evaluation on DNase footprint data (See Allelic Imbalance Analysis below)
Application
- IBD GWAS SNPs. See here Rmarkdown script is at here
Data Quality:
- GC content.
- Similarity between data sets Stage1 Stage2 with Normal Brain data from DeepSEA
Allelic Imbalance Analysis:
- DeepSEA results (Table S3). Presented by data set at here
- Rerun analysis considering both imbalanced variants and balanced variants here
Detecting Regulartory Grammar:
- Critical windows are used to detect important “word” in the sequence and potentially the combination of “word” (“phrase”) in the sequence. Here, we first explore the distribution of critical window along the sequence. E081, Noonan

4/9/17:
1. ~~ROC/PR AUC interpretation~~ link
2. ~~Extract sequences from new data sets~~ link
4/10/17:
1. New loss function? Weighted hinge loss? link TODO
2. Motif analysis of current model link ONGOING
4/14/17:
1. ~~Build up pipeline using snakemake.~~ See issue1 at link
4/26/17:
1. ~~Build baseline classifier~~ link
5/29/17:
1. GC content bias in training and testing data
  - To overcome this, we can either use unbiased sequence set or reweight the performance evaluation function based on the GC content.
  - At first, do reweighting (~~testing~~ - training - ~~motif analysis~~) ONGOING see testing with reweight and motif analysis with gc matched sequences
  - Construct a sequence set with matched GC content in positive and negative instances (use Homer?) TODO
5/31/17:
1. Search for motif hits using top 20 motifs in motif activation analysis by GC matched sequences TODO