By Roman Sasik (rsasik@ucsd.edu) This Notebook describes the sequence of commands used in TCR-seq analysis.
The multiplexing barcodes are assumed to follow the design described in this paper: "Linking T-cell receptor sequence to functional phenotype at the single-cell level", A Han, J Glanville and MD Davis, Nature Biotechnology, 2014, 32 (7), p.684-92
In addition to original perl scripts below, you need to install the superfast TCR repertoir processing java program mitcr.jar
, which can be downloaded at http://mitcr.milaboratory.com/. The relevant paper is MiTCR: software for T-cell receptor sequencing data analysis by DA Bolotin et al., Nature Methods 10, 813-814 (2013).
Perl and java are assumed to be installed.
Processing starts with demultiplexing the reads from a single pair of large fastq files:
In [22]:
!perl demultiplex_fastq_TCRplates.pl Sample_S1_L001_R1_001.fastq Sample_S1_L001_R2_001.fastq
In [23]:
!ls *[A,B].fastq
This script demultiplexes reads multiplexed in a single pair of large fastq files and saves them into separate fastq files whose names indicate Plate, Well, and TCR isoform (A or B), for instance 01H12B.fastq. Up to one mismatch is allowed in any of the Plate, Well Row, Well Column, and TCR Isoform barcodes.
It will create 2x96 files (one per TCR isoform) per each Plate (a lot of files!)
This script will ignore all reads from plates whose code is commented out (see below in source code). This is useful when there is a mixture of TCR genotyping reads and phenotyping reads.
There is a separate demultiplex script for the phenotyping reads (see below). This is demultiplex_fastq_TCRplates.pl
:
In [ ]:
#!/usr/bin/perl
$fileR1 = $ARGV[0];
$fileR2 = $ARGV[1];
open(F1,$fileR1);
open(F2,$fileR2);
%plate = (
"GCAGA" => "01", #uncomment this line if plate code 01 is among the sequences to be demultiplexed
# "TCGAA" => "02",
# "AACAA" => "03",
# "GGTGC" => "04",
# "TTGGT" => "05",
# "CATTC" => "06",
# "ATTGG" => "07",
# "CGGTT" => "08",
# "ATCCT" => "09",
# "ATGTC" => "10",
# "TCACG" => "11",
# "AGACC" => "12",
# "CCCCA" => "13",
# "GCGCT" => "14",
# "TCCTT" => "15",
# "TATAT" => "16",
# "CGTAA" => "17",
# "AAGGT" => "18",
# "AGCTC" => "19",
# "CTTGC" => "20",
# "GTATC" => "21",
# "TATGA" => "22",
# "CACAC" => "23",
# "ACACT" => "24",
# "ACTAC" => "25",
# "GTTAC" => "26",
);
%row = ( #if you want output for all rows, leave them all uncommented
"TAAGC" => "A",
"TGCAC" => "B",
"CTCAG" => "C",
"GGAAT" => "D",
"CGAGG" => "E",
"AGGAG" => "F",
"TGTTG" => "G",
"CAACT" => "H",
);
%col = ( #if you want output for all columns, leave them all uncommented
"GTTCA" => "01",
"CAGGA" => "02",
"TTATA" => "03",
"CCTGT" => "04",
"ACCGC" => "05",
"ACTTA" => "06",
"GCTAG" => "07",
"GACGT" => "08",
"GGCTA" => "09",
"GAATG" => "10",
"CCAAC" => "11",
"GAGAC" => "12",
);
%TCR = (
"GTCAC" => "A", # TCRA
"GAGAT" => "B",
);
foreach $plateID (keys(%plate)) {
foreach $rowID (keys(%row)) {
foreach $colID (keys(%col)) {
foreach $TCRID (keys(%TCR)) {
$fh = $plate{$plateID}.$row{$rowID}.$col{$colID}.$TCR{$TCRID};
open $fh, '>', $fh.".fastq"; #open file for writing at the end
}
}
}
}
while($A1 = <F1>) { #read 4 lines from R1 and 4 lines from R2
$A2 = <F1>;
$A3 = <F1>;
$A4 = <F1>;
$B1 = <F2>;
$B2 = <F2>;
$B3 = <F2>;
$B4 = <F2>;
$ID = substr($A2, 2, 5); #plate ID barcode
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%plate)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {#accept $true_plateID as the true plate ID
$rowID = $trueID;
} else {#leave $plateID blank - sequence won't be output
$rowID = ""
}
$ID = substr($B2, 2, 5); #column ID
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%col)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {#accept $true_plateID as the true plate ID
$colID = $trueID;
} else {#leave $plateID blank - sequence won't be output
$colID = ""
}
$ID = substr($B2, 7, 5); #TCR ID
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%TCR)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {
$TCRID = $trueID;
} else {
$TCRID = ""
}
if (exists $plate{$plateID} and exists $row{$rowID} and exists $col{$colID} and exists $TCR{$TCRID}) {
$fh = $plate{$plateID}.$row{$rowID}.$col{$colID}.$TCR{$TCRID};
print $fh $A1.$A2.$A3.$A4.$B1.$B2.$B3.$B4;
};
}
close(F1);
close(F2);
After demultiplexing, each individual fastq file will be processed by mitcr
. The output is a separate result file for each well, e.g., 01A06A_result.txt
. The example below will produce reports for plate 01, row A and columns 06 through 09 (see source code below).
In [24]:
!perl analyze_wells.pl
!ls *_result.txt
The output is a tab-delimited file whose main components are these (this is the content of file 01A06A_result.txt):
The first column is the number of times this sequence is seen; the second column is the fraction (not a percentage) of the total count of sequences in the well. This is especially useful when there are two species of TCRA expressed in a single cell (as in this case). It does not happen with TCRB.The v- j- and d- alleles of the TCR are listed. The last two lines (a tiny fraction of the number of reads) are a result of sequencing/PCR errors. The program mitcr has an error-checking algorithm that reduces these calls. For details see MiTCR: software for T-cell receptor sequencing data analysis by DA Bolotin et al., Nature Methods 10, 813-814 (2013).
This is the source of analyze_wells.pl
:
In [ ]:
#!/usr/bin/perl
%plate = (
"GCAGA" => "01",
# "TCGAA" => "02",
# "AACAA" => "03",
# "GGTGC" => "04",
# "TTGGT" => "05",
# "CATTC" => "06",
# "ATTGG" => "07",
# "CGGTT" => "08",
# "ATCCT" => "09",
# "ATGTC" => "10",
# "TCACG" => "11",
# "AGACC" => "12",
# "CCCCA" => "13",
# "GCGCT" => "14",
# "TCCTT" => "15",
# "TATAT" => "16",
# "CGTAA" => "17",
# "AAGGT" => "18",
# "AGCTC" => "19",
# "CTTGC" => "20",
# "GTATC" => "21",
# "TATGA" => "22",
# "CACAC" => "23",
# "ACACT" => "24",
# "ACTAC" => "25",
# "GTTAC" => "26",
);
%row = ( #uncomment line if you want output for row A, etc.
"TAAGC" => "A",
# "TGCAC" => "B",
# "CTCAG" => "C",
# "GGAAT" => "D",
# "CGAGG" => "E",
# "AGGAG" => "F",
# "TGTTG" => "G",
# "CAACT" => "H",
);
%col = ( #uncomment line if you want output for column 01, etc.
# "GTTCA" => "01",
# "CAGGA" => "02",
# "TTATA" => "03",
# "CCTGT" => "04",
# "ACCGC" => "05",
"ACTTA" => "06",
"GCTAG" => "07",
"GACGT" => "08",
"GGCTA" => "09",
# "GAATG" => "10",
# "CCAAC" => "11",
# "GAGAC" => "12",
);
%TCR = (
"GTCAC" => "A", # TCRA
"GAGAT" => "B",
);
foreach $plateID (sort (keys(%plate))) {
foreach $rowID (sort (keys(%row))) {
foreach $colID (sort (keys(%col))) {
foreach $TCRID (sort (keys(%TCR))) {
$fh = $plate{$plateID}.$row{$rowID}.$col{$colID}.$TCR{$TCRID};
print "$fh\n";
system("java -Xmx10g -jar ./mitcr.jar -pset flex -gene TR$TCR{$TCRID} $fh.fastq $fh\_result.txt")
}
}
}
}
The following command demultiplexes phenotyping reads multiplexed in a single pair of large fastq files and saves them into separate fastq files whose names indicate Plate, Well, and "R1" or "R2" for left or right read, for instance 03H12R1.fastq. Up to one mismatch is allowed in any of the Plate, Well Row, or Well Column barcodes.
It will create 2x96 files per each Plate.
This script will ignore all reads from plates whose code is commented out (see below in source code). This is useful when there is a mixture of TCR genotyping reads and phenotyping reads.
In [25]:
!perl demultiplex_fastq_phenoplates.pl Sample_S1_L001_R1_001.fastq Sample_S1_L001_R2_001.fastq
!ls 03*.fastq
The source code of demultiplex_fastq_phenoplates.pl is here (in this example, Plate 03 contains phenotyping reads):
In [ ]:
#!/usr/bin/perl
$fileR1 = $ARGV[0];
$fileR2 = $ARGV[1];
open(F1,$fileR1);
open(F2,$fileR2);
%plate = (
# "GCAGA" => "01",
# "TCGAA" => "02",
"AACAA" => "03",
# "GGTGC" => "04",
# "TTGGT" => "05",
# "CATTC" => "06",
);
%row = (
"TAAGC" => "A",
"TGCAC" => "B",
"CTCAG" => "C",
"GGAAT" => "D",
"CGAGG" => "E",
"AGGAG" => "F",
"TGTTG" => "G",
"CAACT" => "H",
);
%col = (
"GTTCA" => "01",
"CAGGA" => "02",
"TTATA" => "03",
"CCTGT" => "04",
"ACCGC" => "05",
"ACTTA" => "06",
"GCTAG" => "07",
"GACGT" => "08",
"GGCTA" => "09",
"GAATG" => "10",
"CCAAC" => "11",
"GAGAC" => "12",
);
foreach $plateID (keys(%plate)) {
foreach $rowID (keys(%row)) {
foreach $colID (keys(%col)) {
$fh = $plate{$plateID}.$row{$rowID}.$col{$colID};
$fh1 = $plate{$plateID}.$row{$rowID}.$col{$colID}."1";
$fh2 = $plate{$plateID}.$row{$rowID}.$col{$colID}."2";
open $fh1, '>', $fh."R1.fastq";
open $fh2, '>', $fh."R2.fastq";
}
}
}
while($A1 = <F1>) { #read 4 lines from R1 and 4 lines from R2
$A2 = <F1>;
$A3 = <F1>;
$A4 = <F1>;
$B1 = <F2>;
$B2 = <F2>;
$B3 = <F2>;
$B4 = <F2>;
# now find out if the bar codes make sense
$ID = substr($A2, 2, 5); #plate ID
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%plate)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {#accept $true_plateID as the true plate ID
$plateID = $trueID;
} else {#leave $plateID blank - sequence won't be output
$plateID = ""
}
$ID = substr($A2, 9, 5); #row ID
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%row)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {
$rowID = $trueID;
} else {
$rowID = ""
}
$ID = substr($B2, 2, 5); #column ID
# now find what the true bar code should have been if imperfect match
$score = 0;
$trueID = "";
foreach $key (keys(%col)) {
my $count = ($ID^$key) =~ tr/\0//;
if ($count > $score) {
$score = $count;
$trueID = $key
}
}
if ($score >= 4) {
$colID = $trueID;
} else {
$colID = ""
}
if (exists $plate{$plateID} and exists $row{$rowID} and exists $col{$colID} ) {
$fh1 = $plate{$plateID}.$row{$rowID}.$col{$colID}."1";
$fh2 = $plate{$plateID}.$row{$rowID}.$col{$colID}."2";
print $fh1 $A1.$A2.$A3.$A4;
print $fh2 $B1.$B2.$B3.$B4;
};
}
close(F1);
close(F2);
The following command will produce expression counts for all 17 cytokines and TF's, separately for each well:
In [26]:
!perl count_cytokines.pl
!ls *.count
The output is a set of tab-delimited files such as 03F03R1.count. Only the R1 read is used for counting; the R2 read is redundant (and lower quality anyway). The content of this file looks something close to this:
The source code of count_cytokines.pl is here (Plate 03 has pheno reads):
In [ ]:
#!/usr/bin/perl
%plate = (
# "GCAGA" => "01",
# "TCGAA" => "02",
"AACAA" => "03",
# "GGTGC" => "04",
# "TTGGT" => "05",
# "CATTC" => "06",
);
%row = (
"TAAGC" => "A",
"TGCAC" => "B",
"CTCAG" => "C",
"GGAAT" => "D",
"CGAGG" => "E",
"AGGAG" => "F",
"TGTTG" => "G",
"CAACT" => "H",
);
%col = (
"GTTCA" => "01",
"CAGGA" => "02",
"TTATA" => "03",
"CCTGT" => "04",
"ACCGC" => "05",
"ACTTA" => "06",
"GCTAG" => "07",
"GACGT" => "08",
"GGCTA" => "09",
"GAATG" => "10",
"CCAAC" => "11",
"GAGAC" => "12",
);
%cyt = (
"GCCGGAGGAGGTGGATGTGC" => "GATA3",
"CCCAACACAGGAGCGCACTG" => "TBET",
"GGCAGCCAAGGCCCTGTCGT" => "FOXP3",
"AGAGGAAGTCCATGTGGGAG" => "RORC",
"GCGAGCTGGTGCGCACCGAC" => "RUNX1",
"GGACCACGCAGGCGAGCTCG" => "RUNX3",
"CCTACACGGCCCCACCTGCC" => "BCL6",
"CCACAGAACTGAAACATCTT" => "IL2",
"CCCAAGCTGAGAACCAAGAC" => "IL10",
"AGACCTCTTTTATGATGGCC" => "IL12A",
"GGTATGGAGCATCAACCTGA" => "IL13",
"CAACCTGAACATCCATAACC" => "IL17A",
"GGGTTCTCTTGGCTGTTACT" => "IFNG",
"GGAGGCGCTCCCCAAGAAGA" => "TNFA",
"CCGAGAAGCGGTACCTGAAC" => "TGFB",
"GCCAACTTTGCAGCCCAGAA" => "PRF1",
"CCACAATATCAAAGAACAGG" => "GZMB",
);
foreach $plateID (sort (keys(%plate))) {
foreach $rowID (sort (keys(%row))) {
foreach $colID (sort (keys(%col))) {
$fh = $plate{$plateID}.$row{$rowID}.$col{$colID};
open(F1,$fh."R1.fastq");
open $fh, '>', $fh."R1.count";
print $fh "\t$fh\n"; #print header
# zero out counters
foreach $key (keys(%cyt)) {$count{$cyt{$key}} = 0};
while($A1 = <F1>) { #read 4 lines from R1 and 4 lines from R2
$A2 = <F1>;
$A3 = <F1>;
$A4 = <F1>;
# now find out if the bar codes make sense
$seq = substr($A2, 36, 20);
if (exists $cyt{$seq}) {$count{$cyt{$seq}}++}; #add to count
};
foreach $key (keys(%cyt)) {
print $fh $cyt{$key}."\t".$count{$cyt{$key}}."\n"
};
close(F1);
close($fh);
}
}
}
Cleanup after exercize:
In [27]:
!rm 0*