DSPR RIL Allele Table

I have made allele tables for the DSPR founder strains, but want to look at the RILS and see if there is any interesting patterns. The DSPR population has 15 founder strains from around the world. These were mixed for 50 generations in two subpopulations, RILs were created by inbreeding for an additional 25 gnerations, for a total of 1700 RILs. In other words, each RIL is a combination of alleles from up to 8 founder strains.

We are arguing that the DSPR does not add any genes to the SD pathway, because it has less allelic variation than the CEGS population. As reviewer 1 points out, it is really the combination of alleles that matters. Are genes in the SD made up of a few combinations of founder alleles? Or is each RIL a unique combination of alleles?

This notebook is trying to get to the bottom of these question.


In [ ]:
%run 'ipython_startup.py'

Figure out RIL make up

First I want to count to estimate the number of haplotypes present by looking at SNPs in the coding sequence. I am guessing SNP density for the RILs is low because of the expense, so later I will use King's founder assignment of 10kb blocks.

RIL SNP table

King et al. 2014 provide SNPs for each RIL.

Allelic varaition in the coding sequence of genes in SD

Here I import SNP information from King et al. and pull out genomic regions that correspond to coding sequence for genes in the sex hierarchy. I then count the number of unique haplotypes present across the RIL population.


In [ ]:
import bed as mcbed
from collections import defaultdict

In [ ]:
# Import bed with CDS for genes in sd
bedName = os.path.join(PROJ, 'exported_data/sd_coding_sequence.bed')
bed = mcbed.Bed(bedName)

# Create data frame with positions of interest (poi)
poi = list()
for row in bed:
    start = min(row[1], row[2])
    end = max(row[1], row[2])
    poi.append(pd.DataFrame([(row[3], row[0], x) for x in xrange(start, end)], columns=['gene', 'chrom', 'pos']))

dfPoi = pd.concat(poi)

In [ ]:
# Create a list of RILs used in the F1-hybrid population
hybName = os.path.join(MCLAB, 'dspr_data/mel_expression_head_F/FemaleHeadExpression.txt')
RILs = list()
with open(hybName, 'r') as FH:
    for row in FH:
        cols = row.strip().split('\t')
        RILs.append(cols[0])
        RILs.append(cols[1])
        
hybRILs = set(RILs)

In [ ]:
# Import RIL SNP calls
fname = '/home/jfear/storage/dspr_variants/RILSNP_R2.txt'

out = list()
with open(fname, 'r') as FH:
    for row in FH:
        chrom, pos, RILID, minor, major, minorCnt, majorCnt = row.strip().split('\t')
        
        if RILID in hybRILs:
            # Only keep the RILs that were in the Expression experiment
            pos = int(pos)
            minorCnt = int(minorCnt)
            majorCnt = int(majorCnt)

            if minorCnt > 0:
                out.append((chrom, pos, RILID, minor))
            else:
                out.append((chrom, pos, RILID, major))
df = pd.DataFrame(out, columns=['chrom', 'pos', 'RILID', 'base'])

In [ ]:
# Merge positions of interest to SNP data
dfFilter = df.merge(dfPoi, how='inner', on=['chrom', 'pos'])

# Make wide data set of filtered data
dfFilter.set_index(['gene', 'chrom', 'pos', 'RILID'], inplace=True)
dfWide = dfFilter.unstack(level='RILID')

# Iterate over gene and count the number of haplotypes
grp = dfWide.groupby(level='gene')
out = list()
for name, g in grp:
    uniq = set([tuple(x) for x in g.T.values])
    out.append([name, len(uniq)])
    
dspr_ril_cds = pd.DataFrame(out, columns=['gene', 'dspr_ril_cds'])
dspr_ril_cds.set_index('gene', inplace=True)

There are a lot of missing values present in this dataset. Unfortunately, I don't have easy access to the reference base, instead I will set missing values to the most frequent base in that row. This has a lot of caveats, but will give me a better estimate of haplotype frequency.


In [ ]:
# Fill missing values with most frequent value
dfFull = dfWide.apply(lambda x: x.fillna(x.mode()[0]), axis=1)

# Iterate over gene and count the number of haplotypes
grp = dfFull.groupby(level='gene')
out = list()
for name, g in grp:
    uniq = set([tuple(x) for x in g.T.values])
    out.append([name, len(uniq)])
    
dspr_ril_cds_no_miss = pd.DataFrame(out, columns=['gene', 'dspr_ril_cds_no_miss'])
dspr_ril_cds_no_miss.set_index('gene', inplace=True)

In [ ]:
# merge results
dspr_ril_cds.join(dspr_ril_cds_no_miss)

Most genes were not present in the SNP calls. The RIL SNP density is very low. Fru is the only gene that appears to have a signal, and its haplotype count is similar to the CEGS data. I think the other counts are underestimated.

Looking at the King et al. paper they use a HMM with a 10kb sliding windows to determine the parent of origin. This may be a better measure.

Parent of origin for each RIL

Here I use King's HMM data to assign parent of origin to each gene in the RIL population. I first identify which 10kb window(s) each gene in the sex hierarchy is in, I then call the RIL genotype based on the HMM results.


In [ ]:
# Import bed with Gene Region
bedName = os.path.join(PROJ, 'exported_data/sd_gene_sequence.bed')
bed = mcbed.Bed(bedName)

In [ ]:
# Create data frame with positions of interest (poi). The HMM results are done on 10kb
# chunks. Here I round the start down to the nearest 10kb and round the end up to the nearest
# 10kb. I then figure out how many 10kb chunks that gene overlapps and split them up.
poi = defaultdict(lambda : defaultdict(list))
for row in bed:
    # If on minus strand, make sure start is the smallest.
    start = min(int(row['start']), int(row['end']))
    end = max(int(row['start']), int(row['end']))
    
    # Round to nearest 10kb
    coordLow = np.floor(start / 10000.0) * 10000
    coordHigh = np.ceil(end / 10000.0) * 10000

    # In a dictionary relate each 10kb chunk to its overlapping genes.
    for pos in range(coordLow.astype(int), coordHigh.astype(int) + 1, 10000):
        poi[row['chrom']][pos].append(row['name'])

In [ ]:
# Parental combos from http://wfitch.bio.uci.edu/~dspr/DatFILES/Release3README.txt
combosA = ['A1A1','A1A2','A1A3','A1A4','A1A5','A1A6','A1A7','A1A8','A2A2','A2A3','A2A4',
          'A2A5','A2A6','A2A7','A2A8','A3A3','A3A4','A3A5','A3A6','A3A7','A3A8','A4A4',
          'A4A5','A4A6','A4A7','A4A8','A5A5','A5A6','A5A7','A5A8','A6A6','A6A7','A6A8',
          'A7A7','A7A8','A8A8']

combosB = ['B1B1', 'B1B2', 'B1B3', 'B1B4', 'B1B5', 'B1B6', 'B1B7', 'B1B8', 'B2B2', 'B2B3', 
           'B2B4', 'B2B5', 'B2B6', 'B2B7', 'B2B8', 'B3B3', 'B3B4', 'B3B5', 'B3B6', 'B3B7', 
           'B3B8', 'B4B4', 'B4B5', 'B4B6', 'B4B7', 'B4B8', 'B5B5', 'B5B6', 'B5B7', 'B5B8', 
           'B6B6', 'B6B7', 'B6B8', 'B7B7', 'B7B8', 'B8B8']

In [ ]:
# Create genotype dictionary from HMM results from flyrils.org
def genotypeDict(fname, genotypes, combos):
    """ Parse the HMM results.
    
    Assigns founder strain combo to each gene based 
    on the 10kb chunks where it is located.
    
    """ 
    with open(fname, 'r') as FH:
        # Iterate over HMM file
        for row in FH:
            cols = row.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            
            if chrom in poi and pos in poi[chrom]:
                gene = poi[chrom][pos]
                RILID = cols[2]

                # What is the most-likely genotype
                best = np.argmax([float(x) for x in cols[3:39]])
                geno = combos[best]

                # Iterate over genes in the region and assign the genotype
                for g in gene:
                    genotypes[g][RILID].add(geno)

In [ ]:
# Import HMM results population A
genotypesA = defaultdict(lambda : defaultdict(set))
genotypeDict('/home/jfear/storage/dspr_variants/HMMregA_R2.txt', genotypesA, combosA)

# Import HMM results population B
genotypesB = defaultdict(lambda : defaultdict(set))
genotypeDict('/home/jfear/storage/dspr_variants/HMMregB_R2.txt', genotypesB, combosB)

In [ ]:
# Count the number of times a gene had a specific genotype.
def genoCnt(genotypes, combos):
    """ Count the number of times a gene had a specific genotype.
    
    For each gene, look at all of the RILS and count the number of times
    a RIL had a given genotype. 
    
    """
    out = list()

    for gene in genotypes.keys():
        # Create counter Series
        cnt = pd.Series(data=[0]*len(combos), index=combos)
        cnt.name = gene
        for ril in genotypes[gene]:
            for geno in genotypes[gene][ril]:
                cnt[geno] += 1
        out.append(cnt)
    return pd.concat(out, axis=1)

In [ ]:
# Plot the number of times a founder strain is ID in SD genes

# Count the number of RILS had a given founder strain at a location
genoCountsA = genoCnt(genotypesA, combosA)
genoCountsB = genoCnt(genotypesB, combosB)

# Plot these counts
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8))

genoCountsA.plot(kind='bar', stacked=True, ax=ax1, rot=90, title='Pop A', sharex=False, cmap='Set1')
lgd = ax1.get_legend_handles_labels()
ax1.legend(lgd[0], lgd[1], loc='center', bbox_to_anchor=(1.1, 0.5), fontsize=9)

genoCountsB.plot(kind='bar', stacked=True, ax=ax2, rot=90, title='Pop B', sharex=False, cmap='Set1')
lgd = ax2.get_legend_handles_labels()
ax2.legend(lgd[0], lgd[1], loc='center', bbox_to_anchor=(1.1, 0.5), fontsize=9)

plt.tight_layout()

It looks like most genes are homozygous for one of the parental lines, this is not surprising in a RIL population. When only looking at the genes in SD, there is not an equal frequency of use of the different parental strains. (A3A3, A4A4, A7A7) are the most common in the A population, while (B1B1, B2B2, B3B3, B6B6, B7B7) are all frequent in the B population.

Genes are also not evenly distributed, for example most RILs had ps from (A3; A4, B2; B6).

F1-hybrid parent of origin make-up


In [ ]:
# Iterate over each F1-hybrid and Pull genotypes.

# Create list of genes in SD
genes = genotypesA.keys()

# Create a dictionary with F1-hybrid genotypes
hybName = os.path.join(MCLAB, 'dspr_data/mel_expression_head_F/FemaleHeadExpression.txt')
F1Geno = defaultdict(dict)
with open(hybName, 'r') as FH:
    # skip header
    FH.next()
    
    for row in FH:
        cols = row.strip().split('\t')
        F1ID = '_'.join([str(cols[0]), str(cols[1])])
                        
        for gene in genes:
            g1 = list(genotypesA[gene][cols[1]])[0]
            g2 = list(genotypesB[gene][cols[0]])[0]
            F1Geno[gene][F1ID] = (g1, g2)

In [ ]:
# For each gene create a set of pA-pB unique combinations

outGeno = defaultdict(set)
outCnt = list()
for gene in genes:
    for key in F1Geno[gene]:
        outGeno[gene].add(F1Geno[gene][key])
    cnt = len(outGeno[gene])
    outCnt.append((gene, cnt))
dfF1hap = pd.DataFrame(outCnt, columns=['gene', 'haplo_cnt'])
dfF1hap.set_index('gene', inplace=True)
dfF1hap

In [ ]: