Overview of kvector features


In [1]:
import kvector


/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Read HOMER Motifs

Read HOMER motif file and create a pandas dataframe for each position weight matrix (PWM), with all motifs saved as a series with the motif name as the key.


In [2]:
motifs = kvector.read_motifs('kvector/tests/data/example_rbps.motif', residues='ACGT')
motifs.head()


Out[2]:
M001_0.6_A1CF_ENSG00000148584_Homo_sapiens\tM001_0.6_A1CF_ENSG00000148584_Homo_sapiens\t5.0                                          A         C         G         T
0  0...
M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\tM002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\tM003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\t5.0              A         C         G         T
0  0...
M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
dtype: object

You can can access individual motifs with the usual pandas indexing:


In [3]:
# the 4th (counting from 0) motif
motifs[3]


Out[3]:
A C G T
0 0.085063 0.085063 0.175952 0.653921
1 0.013046 0.013046 0.776577 0.197330
2 0.013046 0.013046 0.013046 0.960861
3 0.013046 0.013046 0.764576 0.209331
4 0.013046 0.013046 0.104634 0.869273
5 0.013046 0.013046 0.666799 0.307108
6 0.083101 0.083101 0.264548 0.569250

In [4]:
# Specific motif name
motifs['M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0']


Out[4]:
A C G T
0 0.085063 0.085063 0.175952 0.653921
1 0.013046 0.013046 0.776577 0.197330
2 0.013046 0.013046 0.013046 0.960861
3 0.013046 0.013046 0.764576 0.209331
4 0.013046 0.013046 0.104634 0.869273
5 0.013046 0.013046 0.666799 0.307108
6 0.083101 0.083101 0.264548 0.569250

Convert motifs to kmer vectors

Instead of representing a motif as a position-specific weight matrix which would require aligning motifs to compare them, you can convert them to a vector of kmers, where the value for each kmer is the score of the kmer in that motif.

Citation: Xu and Su, PLoS Computational Biology (2010)


In [5]:
%pdb


Automatic pdb calling has been turned ON

In [6]:
motif_kmer_vectors = kvector.motifs_to_kmer_vectors(motifs, residues='ACGT', 
    kmer_lengths=(3, 4))
motif_kmer_vectors


Out[6]:
M001_0.6_A1CF_ENSG00000148584_Homo_sapiens M001_0.6_A1CF_ENSG00000148584_Homo_sapiens 5.0 M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens 5.0 M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster 5.0 M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens 5.0
AAA 0.442114 0.310285 0.012428 0.022518
AAC 0.301068 0.278607 0.012428 0.022518
AAG 0.323372 0.273450 0.133575 0.134406
AAT 0.424485 0.271529 0.212117 0.207887
ACA 0.312890 0.301837 0.012428 0.022518
ACC 0.171844 0.270159 0.012428 0.022518
ACG 0.194148 0.265001 0.133575 0.134406
ACT 0.295261 0.263081 0.212117 0.207887
AGA 0.312890 0.360837 0.174895 0.173211
AGC 0.171844 0.329159 0.174895 0.173211
AGG 0.194148 0.324001 0.296043 0.285099
AGT 0.295261 0.322080 0.374585 0.358580
ATA 0.506726 0.273832 0.177188 0.187763
ATC 0.365680 0.242154 0.177188 0.187763
ATG 0.387985 0.236997 0.298336 0.299650
ATT 0.489097 0.235076 0.376878 0.373132
CAA 0.293569 0.263778 0.012428 0.022518
CAC 0.152523 0.232100 0.012428 0.022518
CAG 0.174827 0.226943 0.133575 0.134406
CAT 0.275940 0.225022 0.212117 0.207887
CCA 0.164345 0.255330 0.012428 0.022518
CCC 0.023298 0.223652 0.012428 0.022518
CCG 0.045603 0.218495 0.133575 0.134406
CCT 0.146715 0.216574 0.212117 0.207887
CGA 0.164345 0.314330 0.174895 0.173211
CGC 0.023298 0.282652 0.174895 0.173211
CGG 0.045603 0.277494 0.296043 0.285099
CGT 0.146715 0.275573 0.374585 0.358580
CTA 0.358181 0.227325 0.177188 0.187763
CTC 0.217135 0.195647 0.177188 0.187763
... ... ... ... ...
TGAG 0.242964 0.274452 0.336151 0.345814
TGAT 0.337757 0.272651 0.371112 0.355464
TGCA 0.172563 0.239641 0.233917 0.240919
TGCC 0.100906 0.271367 0.233917 0.240919
TGCG 0.121816 0.266532 0.336151 0.345814
TGCT 0.216609 0.264731 0.371112 0.355464
TGGA 0.172563 0.233529 0.342024 0.334473
TGGC 0.100906 0.265255 0.342024 0.334473
TGGG 0.121816 0.260420 0.444258 0.439368
TGGT 0.216609 0.258619 0.479219 0.449018
TGTA 0.293710 0.213386 0.371231 0.384318
TGTC 0.222054 0.245112 0.371231 0.384318
TGTG 0.242964 0.240277 0.473465 0.489213
TGTT 0.337757 0.238476 0.508427 0.498863
TTAA 0.414858 0.124713 0.276984 0.277042
TTAC 0.343201 0.156439 0.276984 0.277042
TTAG 0.364112 0.151604 0.379219 0.381936
TTAT 0.458905 0.149803 0.414180 0.391587
TTCA 0.293710 0.116793 0.276984 0.277042
TTCC 0.222054 0.148519 0.276984 0.277042
TTCG 0.242964 0.143684 0.379219 0.381936
TTCT 0.337757 0.141883 0.414180 0.391587
TTGA 0.293710 0.110682 0.385091 0.370596
TTGC 0.222054 0.142408 0.385091 0.370596
TTGG 0.242964 0.137572 0.487326 0.475491
TTGT 0.337757 0.135771 0.522287 0.485141
TTTA 0.414858 0.090538 0.414299 0.420441
TTTC 0.343201 0.122264 0.414299 0.420441
TTTG 0.364112 0.117429 0.516533 0.525335
TTTT 0.458905 0.115628 0.551494 0.534986

320 rows × 4 columns

Count kmers in fasta files

You may also want to just count the integer number of occurences of a DNA word (kmer) in a file. count_kmers does just that, returning a pandas dataframe.


In [7]:
asdf = 'akjsdhfkjahsf klasjdfk     asdfasdf'

In [8]:
asdf.replace('\t', ' ')


Out[8]:
'akjsdhfkjahsf klasjdfk     asdfasdf'

In [9]:
kmer_vector = kvector.count_kmers('kvector/tests/data/example.fasta', kmer_lengths=(3, 4))
kmer_vector.head()


Out[9]:
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC ... TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT
0 2 3 1 0 0 2 1 2 1 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 3 1 0 0 2 1 2 1 0 ... 0 0 0 0 0 0 0 0 0 0
2 6 5 6 0 0 3 2 4 3 5 ... 2 0 0 0 0 0 0 0 0 0
3 6 6 4 1 12 2 0 5 3 9 ... 0 0 0 1 2 0 0 1 0 1
4 19 9 1 7 7 8 0 8 4 1 ... 0 2 1 1 1 3 1 4 2 11

5 rows × 320 columns

Since this is a pandas dataframe, you can do convenient things like get the mean and standard deviation.


In [10]:
kmer_vector.mean()


Out[10]:
AAA      7.0
AAC      5.2
AAG      2.6
AAT      1.6
ACA      3.8
ACC      3.4
ACG      0.8
ACT      4.2
AGA      2.4
AGC      3.0
AGG      2.4
AGT      1.4
ATA      0.6
ATC      1.6
ATG      2.0
ATT      1.2
CAA      3.6
CAC      4.8
CAG      3.2
CAT      2.0
CCA      2.6
CCC     10.8
CCG      2.8
CCT      6.2
CGA      0.6
CGC      1.4
CGG      3.2
CGT      1.2
CTA      2.2
CTC      6.8
        ... 
TGAG     0.6
TGAT     0.2
TGCA     2.2
TGCC     0.4
TGCG     0.8
TGCT     0.6
TGGA     1.2
TGGC     0.4
TGGG     0.4
TGGT     1.0
TGTA     0.6
TGTC     0.8
TGTG     0.4
TGTT     0.4
TTAA     0.4
TTAC     0.4
TTAG     0.2
TTAT     0.2
TTCA     0.6
TTCC     0.6
TTCG     0.4
TTCT     0.4
TTGA     0.2
TTGC     0.4
TTGG     0.6
TTGT     0.6
TTTA     0.2
TTTC     1.0
TTTG     0.4
TTTT     2.4
dtype: float64

In [11]:
kmer_vector.std()


Out[11]:
AAA     7.000000
AAC     2.489980
AAG     2.302173
AAT     3.049590
ACA     5.495453
ACC     2.607681
ACG     0.836660
ACT     2.489980
AGA     1.341641
AGC     3.937004
AGG     0.894427
AGT     1.516575
ATA     0.894427
ATC     2.607681
ATG     2.738613
ATT     2.683282
CAA     3.435113
CAC     4.969909
CAG     2.949576
CAT     2.828427
CCA     1.516575
CCC     4.381780
CCG     2.588436
CCT     1.303840
CGA     1.341641
CGC     1.140175
CGG     1.643168
CGT     1.643168
CTA     0.447214
CTC     1.643168
          ...   
TGAG    0.894427
TGAT    0.447214
TGCA    3.033150
TGCC    0.894427
TGCG    0.836660
TGCT    0.894427
TGGA    1.095445
TGGC    0.547723
TGGG    0.547723
TGGT    1.414214
TGTA    0.894427
TGTC    0.447214
TGTG    0.894427
TGTT    0.894427
TTAA    0.894427
TTAC    0.894427
TTAG    0.447214
TTAT    0.447214
TTCA    1.341641
TTCC    0.894427
TTCG    0.894427
TTCT    0.894427
TTGA    0.447214
TTGC    0.547723
TTGG    0.894427
TTGT    1.341641
TTTA    0.447214
TTTC    1.732051
TTTG    0.894427
TTTT    4.827007
dtype: float64