Imports

Importing the package you will need on the top of your notebook is a good programming practice



In [1]:

    
# Import the packages that will be usefull for this part of the lesson
from collections import OrderedDict, Counter, namedtuple
import pandas as pd
from pprint import pprint

# Small trick to get a larger display
from IPython.core.display import display, HTML, Image
display(HTML("<style>.container { width:90% !important; }</style>"))

General file parsing strategy

1. Read the first line of the file and try to understand the structure and determine the length of the file



In [ ]:

    
file = "../data/gencode_sample.gff3"



In [ ]:

    
# Count all the lines in python
i = 0
with open (file, "r") as fp:
    for line in fp:
        i += 1
print (i, "lines found")



In [ ]:

    
# Count all the lines in bash
!wc -l {file}



In [ ]:

    
# Read the first 2 lines in python
with open (file, "r") as fp:
    for _ in range (2):
        print (next(fp).strip())



In [ ]:

    
# Read the first 2 lines in Bash
!head {file} -n 2

2. If the file is a standard genomic/proteomic format, read the documentation

There is no central database. You have to find it by yourself

Wikipedia is usually a good starting point, but try to identify the orginal source which is usually more up to date.

3. Think about the most efficient way to parse the file to get the information you want

How are you going to access the field(s) of interest ? (you can test that with 1 line before starting with the whole file)

A real life file will contain millions of lines and file reading is usually slow. Try to read the file only 1 time, even if you need to parse multiple element per line.

Do you know how many lines and how many fields per line you will collect ?

How are you going to collect the information (dictionary, tuple, list, dataframe...) ?

... Now you can parse the file

High-performance container datatypes from the standard library

Python standard library contains a very interesting package to simplify the parsing a file: "collections" (See documentation for detailed information)

This module implements specialized container datatypes providing alternatives to Python’s native data structures (dict, list, set...)

Collection	Functionality
Counter	Dict subclass for counting hashable objects
OrderedDict	Dict subclass that remembers the order entries were added
namedtuple	Factory function for creating tuple subclasses with named fields
deque	List-like container with fast appends and pops on either end
ChainMap	Dict-like class for creating a single view of multiple mappings
defaultdict	Dict subclass that calls a factory function to supply missing values
UserDict	Wrapper around dictionary objects for easier dict subclassing
UserList	Wrapper around list objects for easier list subclassing
UserString	Wrapper around string objects for easier string subclassing

Counter

A Counter container is provided to support convenient and rapid counting of specific occurences.

See also defaultdict for a generalization to other types than integer.

Example: counting words in a text



In [ ]:

    
random_text = """Ukip is likely to be asked to repay tens of thousands of euros by European parliament finance chiefs
who have accused the party of misspending EU funds on party workers and Nigel Farage’s failed bid to win a seat in
Westminster.The Alliance for Direct Democracy in Europe (ADDE), a Ukip-dominated political vehicle, will be asked to
repay €173,000 (£148,000) in misspent funds and denied a further €501,000 in EU grants for breaking European rules
that ban spending EU money on national election campaigns and referendums. According to a European parliament audit
report seen by the Guardian, Ukip spent EU funds on polling and analysis in constituencies where they hoped to win a 
seat in the 2015 general election, including the South Thanet seat that party leader Farage contested. The party also
funded polls to gauge the public mood on leaving the EU, months before the official campaign kicked off in April 2016"""



In [ ]:

    
# Example with a Counter 
c = Counter()

# Iterate over each word of the string
for word in random_text.split():
    # Increment the counter for the current element
    c[word.lower()] += 1

# Order by most frequent element
c.most_common(10)



In [ ]:

    
# Same thing but with native dict
d = {}

# Iterate over each characters of the string
for word in random_text.split():
    word = word.lower()
    # If the element is not in the dict we have to create an entry first
    if word not in d:
        d[word] = 0
    # Increment the counter for the current element
    d[word]+=1
    
# Order by most frequent element
sorted(d.items(), key=lambda t: t[1], reverse=True)[:10]

OrderedDict

In a standard python dictionary, the order of the elements is not guaranteed and can change between 2 successive calls. In many situations, it can be annoying particularly if the order of elements in a parsed file matters (fastq, fasta...)

Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted, like lists.

When iterating over an ordered dictionary, the items are returned in the order their keys were first added.



In [ ]:

    
fruit_str = "banana ripe:banana unripe:banana ripe:banana rotten:apple ripe:apple ripe:apple ripe:apple unripe:orange unripe:orange unripe:orange unripe:pear rotten:pear rotten:pear ripe"

Parsing with a normal dictionary



In [ ]:

    
d={}

for element in fruit_str.split(":"):
    fruit, status = element.split(" ")
    if fruit not in d:
        d[fruit] = Counter()
    d[fruit][status]+=1

d

Parsing with an OrderedDict



In [ ]:

    
d=OrderedDict()

for element in fruit_str.split(":"):
    fruit, status = element.split(" ")
    if fruit not in d:
        d[fruit] = Counter()
    d[fruit][status]+=1

d

Since an ordered dictionary remembers its insertion order, it can be used in conjunction with sorting to make a sorted dictionary (by key or value) from a standard dictionary



In [ ]:

    
print ("\nStandard unsorted dictionary")
d = {'banana':3, 'apple':4, 'pear':1, 'orange':2, "peach":10, "apricot":2}
pprint (d)



In [ ]:

    
print("\nDictionary sorted by key")
d_per_key = OrderedDict(sorted(d.items(), key=lambda t: t[0]))
pprint (d_per_key)



In [ ]:

    
print("\nDictionary sorted by value")
d_per_val = OrderedDict(sorted(d.items(), key=lambda t: t[1]))
pprint (d_per_val)

Namedtuple

Namedtuple are tuples with customizable field names, which makes them more intuitive to use

Similar to normal tuples

Immutable sequence that can contain any type
Much faster than lists

Contrary to normal tuples

Initialization creates a new tuple subclass with a custom name
Attribute fields can be named
Access to the fields by name, like an object field
helpful rendering thanks to a __repr__() method

Simple example of initialisation and instantiation



In [ ]:

    
# Create the new type
Point = namedtuple('Point', ['x', 'y'])



In [ ]:

    
# Create a new Point Object
p = Point(x=10, y=23)
p



In [ ]:

    
# Access to the fields by index or by point
print(p[0])
print(p.x)

Example of file parsing



In [ ]:

    
!head "../data/abundance.tsv"
!wc -l "../data/abundance.tsv"

We will represent each line with a namedtuple, themselves stored in a list representing the entire file.

We will store only the field we are interested in: target_id, length, tpm

In addition only line with length greater than 50,000 will be saved



In [ ]:

    
abundance_line = namedtuple("abundance_line", ["target_id", "length", "tpm"])
abundance_line



In [ ]:

    
with open ("../data/abundance.tsv", "r") as fp:
    # Flush the first header line
    header = next(fp)
    
    # Init an empty list to store each parsed line
    line_list = []
    
    # Iterate over all the lines and save the information in a list of namedtuples
    for line in fp:
        line = line.split()
        
        # Select lines with a length > 50000, create an abundance_line namedtuple and append to the list
        if int(line[1]) > 50000:
            a = abundance_line(target_id = line[0], length = int(line[1]), tpm = float(line[4])) # Cast in numeric types
            line_list.append(a)

pprint(line_list)

Access to the stored element is straightforward and intuitive



In [ ]:

    
# Access to the feature_id of the second line
line_list[1].target_id

Generators

A generator is an object that will return values computed on the fly. The size could be indefinite or not and can encapsulate significant computing.

Contrary to mappings (dict, list, ...) the values of the series are not stored in memory, but generated only when required.

Can be used in place of a list when the size of the series is very long to save memory.

A good example is the randint function from the random package.



In [ ]:

    
from random import randint
randint(10, 100)

A generator is a function (or a class method) using a yield statement in place of a return statement

Example of a generator function to parse gencode gff3 file and extract the feature ID field if the feature matches specific requirements

First we write the generator function



In [ ]:

    
def gencode_ID_generator (file, max_lines=10): # max_lines is to control the maximal number of lines generated
    
    # Open the file using a with statement
    with open (file, "r") as fp:
        
        # The loop will provide the iteration for the generator.
        # Lines are counted by the enumerate statement and a control break will exit when the max number of lines is reached 
        for i, line in enumerate(fp):
            if i > max_lines:
                break
            
            # Extract the ID from the attribute field (attribute = 9th field, ID = first field of attribute + get rid of the field name)
            ID = line.split("\t")[8].split(";")[0].split("=")[1]
            
            # The yield statement will return a value each time the generator is called
            yield ID

Now we can create the generator from a gencode gff3 file



In [ ]:

    
gen = gencode_ID_generator("../data/gencode_sample.gff3", max_lines=10)

Elements can be generated by calling the generator with the next function



In [ ]:

    
next(gen)



In [ ]:

    
next(gen)

Or using a for loop until the generator is exhausted



In [ ]:

    
for ID in gen:
    print (ID)

Pandas: Powerful data structures for data analysis, time series, and statistics

Flexible, and expressive data structures Series (1D) and DataFrame (2D) and Panel (3D)
High-level building block for doing practical, real world data analysis
Nearly as fast as C language = Build on top of Numpy and extensive use of Cython
Robust IO tools for loading and parsing data from text files, excel files and databases.

Series

1D labeled array capable of holding any data type (integers, strings, float...)
Similar to a python standard dictionary but faster (because based on C datatypes) and more user-friendly in Jupyter
Adding/removing new elements is inefficient as it is store as a block in memory and thus requires a resizing and reallocation each time an element is added or removed

Create series

From 2 lists or sets



In [ ]:

    
Base = ('A','T','C','G','N')
Freq = (0.21, 0.24, 0.27, 0.25, 0.03)
pd.Series(data=Freq, index=Base)

From a python dictionary



In [ ]:

    
d = {'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25, 'N':0.03}
pd.Series(d)

The data type and series names can be specified



In [ ]:

    
d = {'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0}
pd.Series(d, name="Percent", dtype=int)

From a file containing 2 columns with the squeeze option



In [ ]:

    
pd.read_table("../data/DNA_distrib.tsv", index_col=0, squeeze=True, sep="\t")

Manipulate series

Support list methods



In [ ]:

    
s = pd.Series({'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25})



In [ ]:

    
# Concat 2 series
s2 = pd.Series({'Y':0.01, 'N':0.03})
s3 = s.append(s2)
print(s3)



In [ ]:

    
# Slicing
print(s[2:4])



In [ ]:

    
# Extraction
print(s[2])



In [ ]:

    
# the "for" loop works as for a list 
for i in s:
    print (i)

Support dictionary methods



In [ ]:

    
s = pd.Series({'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0}, name="Percent", dtype=int)

# Update value
s["A"] = 22
print(s)

# Named indexing
print(s["A"])

# Test for existence
print ("A" in s)
print ("V" in s)

Support a wide range of mathematic operations (thanks to numpy)



In [ ]:

    
s = pd.Series({'A':21, 'T':24, 'C':27, 'G':25, 'N':3}, name="Percent")

print(s.max())
print(s.mean())
print(s.all()>20)
print(s.sem())



In [ ]:

    
# Addition of 2 series will return a results for all values in the 2 series
s2 = pd.Series({'A':0.2, 'T':0.7, 'C':0.4, 'G':1.5, 'N':-3}, name="Percent")

print (s + s2)

Dataframes

2-dimensional labeled data structure with columns of potentially different types
Beautiful HTML rendering in jupyter
Same as for Series, adding/removing new elements is inefficient because of memory reallocation.

The default dataframe rendering in jupyter can be tweaked with pd.options.display



In [ ]:

    
pd.options.display.max_colwidth = 200
pd.options.display.max_rows = 50
pd.options.display.max_columns = 50

Create dataframe from other datatypes

You can optionally pass index (row labels) and columns (column labels) arguments

From a pandas Series



In [ ]:

    
s = pd.Series({'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0})
pd.DataFrame(s, columns=["Percent"])

From a list of pandas Series



In [ ]:

    
series_list = [
    pd.Series({'A':21, 'T':24, 'C':27, 'G':25, 'N':3}, name="Percent"),
    pd.Series({'A':331.2, 'T':322.2, 'C':307.2, 'G':347.2, 'N':None}, name="MolecularWeight"),
    pd.Series({'A':259, 'T':267, 'C':271, 'G':253, 'N':None}, name="AbsorbanceMax")]

pd.DataFrame(series_list)

From a simple list of lists



In [ ]:

    
list_list = [[21, 24, 27, 25, 3], [331.2, 322.2, 307.2, 347.2, None], [259, 267, 271, 253, None]]
column_list = ['A', 'T', 'C', 'G', 'N']
index_list = ["Percent", "MolecularWeight", "AbsorbanceMax"]

pd.DataFrame(list_list, index=index_list, columns=column_list)

From a simple list of namedtuples ++



In [ ]:

    
ni = namedtuple("nucleotide_info", ["Variable", 'A', 'T', 'C', 'G', 'N'])

list_namedtuple = [
    ni (Variable="AbsorbanceMax", A=21.0, T=24.0, C=27.0, G=25.0, N=3.0),
    ni (Variable="MolecularWeight", A=331.2, T=322.2, C=307.2, G=347.2, N=None),
    ni (Variable="AbsorbanceMax", A=259, T=267, C=271, G=253, N=None)]

df = pd.DataFrame(list_namedtuple)
df.set_index("Variable", inplace=True)
df

The Dataframe creation is very versatile and can also be done from Dictionaries of lists, dicts, or Series and from numpy.ndarray...

The orientation may differ depending on the source data (columns or index orientation)



In [ ]:

    
dict_list = { "Percent":[21.0, 24.0, 27.0, 25.0, 3.0], "MolecularWeight":[331.2, 322.2, 307.2, 347.2, None], "AbsorbanceMax":[259, 267, 271, 253, None]}

pd.DataFrame(dict_list, )

Create a dataframe from a file

One of the major strengths of Pandas is its ability to perform "moderately" complex file parsing into a comprehensive DataFrame format

Example with a gff3 file



In [ ]:

    
file = "../data/gencode_sample.gff3"



In [ ]:

    
!wc -l $file



In [ ]:

    
!head -2 $file

The file is a standard genomic/proteomic format

gff3 documentation

In this case the gff3 format consists of 9 tab delimited fields

The standard fields are named "seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"

The hard way with a list of dictionaries... But you control everything and you can parse very complex structured files

The output is not very user friendly, but once again you can do hardcore stuff. In this case it is possible to parse completely the file, including the attribute field



In [ ]:

    
file = "../data/gencode_sample.gff3"
#Empty list to collect each lines 
l=[]

with open (file, "r") as fp:
    for line in fp:
        
        # Split the line by tabulations 
        sl = line.strip().split("\t")
        
        # Using an ordered dictionary because we want to retain the order in the file       
        line_dict = OrderedDict({
            "seqid":sl[0],  
            "source":sl[1],  
            "type":sl[2],  
            "start":int(sl[3]),
            "end":int(sl[4]),
            "score":sl[5],  
            "strand":sl[6],  
            "phase":sl[7]})
        
        # Then you can entirely parse the attributes
        for attribute in sl[8].split(";"):
            k, v = attribute.split("=")
            line_dict[k] = v
        
        # Add the line to the general list
        l.append(line_dict)

# print only the first elements of the list (because 100 lines would be a bit long)
l[0:2]

The easy way with a pandas DataFrame... Works most of the time for standard formats

Much more user friendly, but in this case it is quite impossible to parse the attribute field



In [ ]:

    
df = pd.read_table("../data/gencode_sample.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])



In [ ]:

    
# print only the 10 first rows of the table (well, you know why now...)
df.head(10)

Viewing Data in a DataFrame



In [ ]:

    
df = pd.read_table("../data/gencode_sample.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])

See the top & bottom rows of the frame



In [ ]:

    
df.head(2)



In [ ]:

    
df.tail(2)

Sampling random rows or column: Awesomeness level +++



In [ ]:

    
df.sample(n=10, axis=0)

Renaming and transforming a dataframe



In [ ]:

    
df = pd.read_table("../data/gencode_sample.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])

Rename columns or index



In [ ]:

    
# inplace=True will affect the current df
df.rename(columns={"score":"Score", "strand":"Strand"}, inplace=True)
df.head()



In [ ]:

    
# inplace=False (default) will return a modified df. ie the original DataFrame remains unchanged
df = df.rename(index={0:-1}, inplace=False)
df.head()

Discard a subset of columns



In [ ]:

    
df = df.drop(labels=["Score", "Strand"], axis=1)
df.head()

Discard a subset of rows



In [ ]:

    
df = df.drop(labels=[-1, 2, 4], axis=0)
df.head()

Sort by values or index



In [ ]:

    
# With 1 value key
df.sort_values(by="type", inplace=True)
df.head()



In [ ]:

    
# With several value keys
df.sort_values(by=["seqid", "end"], inplace=True)
df.head()



In [ ]:

    
# With the index
df.sort_index(inplace=True, ascending=False)
df.head()

Reset a numeric index after deleting or reordering rows



In [ ]:

    
df = df.reset_index(drop=True)
df.head()

Transposing your data



In [ ]:

    
# Example of method chaining with no return = display only
df.transpose().head()

Concatenate and merging DataFrame

Concatenate along columns (stack dataframes)



In [ ]:

    
display(Image("../pictures/merging_concat_basic.png"))



In [ ]:

    
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'], 'B': ['B8', 'B9', 'B10', 'B11'], 'C': ['C8', 'C9', 'C10', 'C11'], 'D': ['D8', 'D9', 'D10', 'D11']}, index=[8, 9, 10, 11])

df4 = pd.concat([df1, df2, df3], axis=0)
df4

Concatenate along rows



In [ ]:

    
display(Image("../pictures/merging_concat_axis1.png"))



In [ ]:

    
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3])
df2= pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'], 'D': ['D2', 'D3', 'D6', 'D7'], 'F': ['F2', 'F3', 'F6', 'F7']}, index=[2, 3, 6, 7])

df3 = pd.concat([df1, df2], axis=1)
df3

Example with 2 gff3 files



In [ ]:

    
df1 = pd.read_table("../data/gencode_sample_transcript.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])
df1



In [ ]:

    
df2 = pd.read_table("../data/gencode_sample_gene.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])
df2



In [ ]:

    
df = pd.concat([df1, df2])

# Sorting by genomic coordinates
df = df.sort_values(["seqid", "start", "end"])

# Reseting the index after concat and sorting
df = df.reset_index(drop=True)
df

Database-style DataFrame joining/merging

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)

DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)



In [ ]:

    
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'], 'D': ['D0', 'D2', 'D3']}, index=['K0', 'K2', 'K3'])



In [ ]:

    
display(Image("../pictures/merging_merge_index_outer.png"))
pd.merge(left, right, left_index=True, right_index=True, how='outer')



In [ ]:

    
display(Image("../pictures/merging_merge_index_inner.png"))
pd.merge(left, right, left_index=True, right_index=True, how='inner')

Example where it could be useful in bioinformatics



In [ ]:

    
g2t = pd.DataFrame(
    {'gene': ['DGB1', 'AHY', 'AUHD2'],
     'transcript': ['ENST7454647', 'ENST9856565', 'ENST875650']})
t2e = pd.DataFrame(
    {'transcript': ['ENST7454647', 'ENST9856565', 'ENST8565667'],
     'value': [8374, 44, 876]})



In [ ]:

    
pd.merge(left=g2t, right=t2e, left_on="transcript", right_on="transcript", how='outer')

Descriptive statistics of a dataframe

Similar to Series, many basic statistics are available.

The results will only contain the relevant columns (for example mean() will be applied only to numeric columns.

Function	Description
count	Number of non-null observations
sum	Sum of values
mean	Mean of values
mad	Mean absolute deviation
median	Arithmetic median of values
min	Minimum
max	Maximum
mode	Mode
abs	Absolute Value
prod	Product of values
std	Bessel-corrected sample standard deviation
var	Unbiased variance
sem	Standard error of the mean
skew	Sample skewness (3rd moment)
kurt	Sample kurtosis (4th moment)
unique	List the unique elements
quantile	Sample quantile (value at %)
cumsum	Cumulative sum
cumprod	Cumulative product
cummax	Cumulative maximum
cummin	Cumulative minimum



In [ ]:

    
df = pd.read_table("../data/gencode_sample.gff3", sep="\t", names =["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"])



In [ ]:

    
df.mean()



In [ ]:

    
df.count()

One can also use the describe() method to a simple table report of the DataFrame



In [ ]:

    
df.describe(include="all")

Access the element in a DataFrame = Indexing and Selection

Operation	Syntax	Result
Select column	df[col] or df.col	Series
Select row by label	df.loc[label]	Series
Select row by integer location	df.iloc[loc]	Series
Slice rows	df[5:10]	DataFrame

Loading a new example file formated in SAM format



In [ ]:

    
file =  "../data/sample_alignment.sam"



In [ ]:

    
!wc {file}



In [ ]:

    
!head -n 5 "../data/sample_alignment.sam"

SAM format documentation



In [ ]:

    
file = "../data/sample_alignment.sam"
columns_names = ['QNAME', 'FLAG', 'RNAME', 'POS', 'MAPQ', 'CIGAR', 'RNEXT', 'PNEXT', 'TLEN', 'SEQ', 'QUAL']
df = pd.read_table(file, sep="\t", names = columns_names, skiprows=[0,1], index_col=0)
df.head()

Examples of column slicing



In [ ]:

    
df["FLAG"]



In [ ]:

    
df.FLAG



In [ ]:

    
df[["FLAG", "POS"]]

Examples of index slicing



In [ ]:

    
df.loc["HWI-1KL149:87:HA58EADXX:1:1101:1531:2163"]



In [ ]:

    
df.loc["HWI-1KL149:87:HA58EADXX:1:1101:1531:2163":"HWI-1KL149:87:HA58EADXX:1:1101:1744:2169"]



In [ ]:

    
df.iloc[0]



In [ ]:

    
df.iloc[10:12]

Examples of combination of column and index slicing to select a specific item or a range of items



In [ ]:

    
df.loc["HWI-1KL149:87:HA58EADXX:1:1101:1531:2163":"HWI-1KL149:87:HA58EADXX:1:1101:1744:2169", "FLAG"]



In [ ]:

    
df.loc["HWI-1KL149:87:HA58EADXX:1:1101:1531:2163"]["FLAG"]



In [ ]:

    
df.iloc[0,0]



In [ ]:

    
df.iloc[0][0]

Select specific elements based on their values = Boolean Indexing

With a single condition



In [ ]:

    
df[df.FLAG == 163]



In [ ]:

    
df[df.FLAG.isin([99,147])]



In [ ]:

    
df.query('FLAG == 141')



In [ ]:

    
df[df.SEQ.str.startswith('ACAG')]

With multiple conditions



In [ ]:

    
df[(df.FLAG == 163) & (df.TLEN > 180)]



In [ ]:

    
df.query('FLAG == 99 and TLEN > 200')

Iterate over a dataframe

Iterate row by row with the iterrows or itertuples mthods

itertuples is faster but the index is merged with the other values of each lines



In [ ]:

    
for index_name, row_values in df.iterrows():
    if row_values.CIGAR != "101M":
        print (index_name)
        print (row_values)



In [ ]:

    
for values in df.itertuples():
    if values.CIGAR != "101M":
        print (row_values)

Iterate by group of values with the groupby method



In [ ]:

    
for group, group_df in df.groupby("FLAG"):
    print ("\nGroup with flag {}".format(group))
    display (group_df)

And much more...

Pandas documentation is extensive and updated frequently