Get input and background peaks from ENCSR000DXD which is a fetal kidney proximal epithelial CTCF chip-seq dataset
wget https://encode-public.s3.amazonaws.com/2016/12/16/d944f665-0b23-418b-b297-c36bc585942b/ENCFF932EHP.bed.gz
bgzip -f -d ENCFF932EHP.bed.gz
cut -f 1-3 ENCFF932EHP.bed > ENCFF932EHP_cut.bed
sed -i "s/$/\tRPTEC|bg|None/" ENCFF932EHP_cut.bed
sort -k1V -k2n -k3n ENCFF932EHP_cut.bed > ENCFF932EHP_cut_sorted.bed
bgzip -c ENCFF932EHP_cut_sorted.bed > ENCFF932EHP_cut_sorted.bed.gz
tabix -p bed ENCFF932EHP_cut_sorted.bed.gz
wget https://encode-public.s3.amazonaws.com/2016/12/16/f5a83928-69e1-4d5f-ba89-e8ea0124261c/ENCFF192UQS.bed.gz
bgzip -f -d ENCFF192UQS.bed.gz
cut -f 1-3 ENCFF192UQS.bed > ENCFF192UQS_cut.bed
sed -i "s/$/\tRPTEC|CTCF|None/" ENCFF192UQS_cut.bed
sort -k1V -k2n -k3n ENCFF192UQS_cut.bed > ENCFF192UQS_cut_sorted.bed
bgzip -c ENCFF192UQS_cut_sorted.bed > ENCFF192UQS_cut_sorted.bed.gz
tabix -p bed ENCFF192UQS_cut_sorted.bed.gz
model: {
path: <absolute path>/deeperdeepsea.py,
class: DeeperDeepSEA,
class_args: {
sequence_length: 1000,
n_targets: 1
},
non_strand_specific: mean
}
features: !obj.selene_sdk.utils.load_features_list {
input_path: <absolute path>/distinct_features.txt
}
ops: [train, analyze]
ops: [train, evaluate]
def criterion():
return torch.nn.BCELoss()
def get_optimizer(lr):
return (torch.optim.SGD, {"lr": lr, "weight_decay": 1e-6, "momentum": 0.9})
Using target genomic intervals from the TF binding dataset and background from DeepSea at least 1 TF binding set
sampler: !obj:selene_sdk.samplers.IntervalsSampler {
reference_sequence: !obj:selene_sdk.sequences.Genome {
input_path: male.hg19.fasta
},
features: !obj:selene_sdk.utils.load_features_list {
input_path: <absolute path>/distinct_features.txt
},
target_path: <absolute path>/sorted_GM12878_CTCF.bed.gz,
intervals_path: <absolute path>/ENCFF192UQS_cut_sorted.bed,
seed: 127,
sample_negative: True,
sequence_length: 1000,
center_bin_to_predict: 200,
test_holdout: [chr8, chr9],
validation_holdout: [chr6, chr7],
feature_thresholds: 0.5,
mode: train,
save_datasets: [test]
}
train_model: !obj:selene_sdk.TrainModel {
batch_size: 64,
# typically the number of steps is much higher
max_steps: 16000,
# the number of mini-batches the model should sample before reporting performance
report_stats_every_n_steps: 2000,
n_validation_samples: 32000,
n_test_samples: 120000,
cpu_n_threads: 32,
use_cuda: False,
data_parallel: False
}
lr: 0.01
random_seed: 1447
output_dir: ./training_outputs
create_subdirectory: True
Save test dataset to save time if it doesnt finish running
load_test_set: False
In [1]:
%matplotlib inline
from selene_sdk.utils import load_path
from selene_sdk.utils import parse_configs_and_run
In [2]:
configs = load_path("./CTCF_kidney_train.yml")
In [ ]:
parse_configs_and_run(configs, lr=0.01)
In [ ]: