Notebook to train and evaluate NN trained against CTCF binding in fetal proximal tubule cells

This is a minimal example of training a neural network using selene against a transcription-specific peak file.

Steps:

Get human genome annotation

Get human fasta file using 2bit from UCSC

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa 
chmod 755 twoBitToFa
./twoBitToFa hg38.2bit hg38.fa

Download and format the data

Get input and background peaks from ENCSR000DXD which is a fetal kidney proximal epithelial CTCF chip-seq dataset

Get and process background peaks

wget https://encode-public.s3.amazonaws.com/2016/12/16/d944f665-0b23-418b-b297-c36bc585942b/ENCFF932EHP.bed.gz
bgzip -f -d ENCFF932EHP.bed.gz
cut -f 1-3 ENCFF932EHP.bed > ENCFF932EHP_cut.bed
sed -i "s/$/\tRPTEC|bg|None/" ENCFF932EHP_cut.bed
sort -k1V -k2n -k3n ENCFF932EHP_cut.bed > ENCFF932EHP_cut_sorted.bed
bgzip -c ENCFF932EHP_cut_sorted.bed > ENCFF932EHP_cut_sorted.bed.gz
tabix -p bed ENCFF932EHP_cut_sorted.bed.gz

Get and process CTCF peaks

wget https://encode-public.s3.amazonaws.com/2016/12/16/f5a83928-69e1-4d5f-ba89-e8ea0124261c/ENCFF192UQS.bed.gz
bgzip -f -d ENCFF192UQS.bed.gz
cut -f 1-3 ENCFF192UQS.bed > ENCFF192UQS_cut.bed
sed -i "s/$/\tRPTEC|CTCF|None/" ENCFF192UQS_cut.bed
sort -k1V -k2n -k3n ENCFF192UQS_cut.bed > ENCFF192UQS_cut_sorted.bed
bgzip -c ENCFF192UQS_cut_sorted.bed > ENCFF192UQS_cut_sorted.bed.gz
tabix -p bed ENCFF192UQS_cut_sorted.bed.gz

Get genomic features to predict

Distinct features = genomic features to predict

cut -f 4 ENCFF192UQS_cut_sorted.bed | sort -u > distinct_features.txt

Neural network configuration

Literal arguments

List arguments

Dictionary arguments

model: {
    path: <absolute path>/deeperdeepsea.py,
    class: DeeperDeepSEA,
    class_args: {
        sequence_length: 1000,
        n_targets: 1
    },
    non_strand_specific: mean
}

Function arguments

features: !obj.selene_sdk.utils.load_features_list {
    input_path: <absolute path>/distinct_features.txt
}

Training a model and analyzing sequences with it

ops

ops: [train, analyze]

ops: [train, evaluate]

model

def criterion():
    return torch.nn.BCELoss()

def get_optimizer(lr):
    return (torch.optim.SGD, {"lr": lr, "weight_decay": 1e-6, "momentum": 0.9})

sampler

Using target genomic intervals from the TF binding dataset and background from DeepSea at least 1 TF binding set

sampler: !obj:selene_sdk.samplers.IntervalsSampler {
    reference_sequence: !obj:selene_sdk.sequences.Genome {
        input_path: male.hg19.fasta
    },
    features: !obj:selene_sdk.utils.load_features_list {
        input_path: <absolute path>/distinct_features.txt
    },
    target_path: <absolute path>/sorted_GM12878_CTCF.bed.gz,
    intervals_path: <absolute path>/ENCFF192UQS_cut_sorted.bed,
    seed: 127,
    sample_negative: True,
    sequence_length: 1000,
    center_bin_to_predict: 200,
    test_holdout: [chr8, chr9],
    validation_holdout: [chr6, chr7],
    feature_thresholds: 0.5,
    mode: train,
    save_datasets: [test]
}

train_model

train_model: !obj:selene_sdk.TrainModel {
    batch_size: 64,
    # typically the number of steps is much higher
    max_steps: 16000,  
    # the number of mini-batches the model should sample before reporting performance
    report_stats_every_n_steps: 2000,
    n_validation_samples: 32000,
    n_test_samples: 120000,
    cpu_n_threads: 32,
    use_cuda: False,
    data_parallel: False
}

other arguments

lr: 0.01
random_seed: 1447
output_dir: ./training_outputs
create_subdirectory: True

Save test dataset to save time if it doesnt finish running

load_test_set: False

Train and evaluate NN



In [1]:

    
%matplotlib inline

from selene_sdk.utils import load_path
from selene_sdk.utils import parse_configs_and_run



In [2]:

    
configs = load_path("./CTCF_kidney_train.yml")



In [ ]:

    
parse_configs_and_run(configs, lr=0.01)









    



Outputs and logs saved to /input_dir/ml_testing/kidney_test/training_outputs
2020-04-06 21:12:14,450 - Creating validation dataset.
2020-04-06 21:14:07,039 - 112.58383131027222 s to load 32000 validation examples (500 validation batches) to evaluate after each training step.
2020-04-06 21:31:03,462 - [STEP 1000] average number of steps per second: 1.0
2020-04-06 21:32:42,640 - validation roc_auc: 0.7525198756563279
2020-04-06 21:32:42,641 - validation average_precision: 0.04557017384717893
2020-04-06 21:32:44,159 - training loss: 0.1443811058998108
2020-04-06 21:32:44,160 - validation loss: 0.0470037256591022
2020-04-06 21:47:08,922 - [STEP 2000] average number of steps per second: 1.2
2020-04-06 21:48:48,752 - validation roc_auc: 0.8042294552957752
2020-04-06 21:48:48,753 - validation average_precision: 0.06160666347778087
2020-04-06 21:48:50,699 - training loss: 0.010772843845188618
2020-04-06 21:48:50,700 - validation loss: 0.04442669021524489
2020-04-06 22:01:39,554 - [STEP 3000] average number of steps per second: 1.3
2020-04-06 22:03:19,367 - validation roc_auc: 0.9054715232369117
2020-04-06 22:03:19,369 - validation average_precision: 0.2898709659797808
2020-04-06 22:03:21,283 - training loss: 0.011397086083889008
2020-04-06 22:03:21,284 - validation loss: 0.03605182409705594



In [ ]: