General GFF structure Position index Position name Description 1 sequence The name of the sequence where the feature is located. 2 source Keyword identifying the source of the feature, like a program (e.g. Augustus or RepeatMasker) or an organization (like TAIR). 3 feature The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project. 4 start Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED files. 5 end Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED files.[citation needed] 6 score Numeric value that generally indicates the confidence of the source on the annotated feature. A value of "." (a dot) is used to define a null value. 7 strand Single character that indicates the Sense (molecular biology) strand of the feature; it can assume the values of "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined). 8 frame (GTF, GFF2) or phase (GFF3) Frame or phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). Frame and Phase are not the same, See following subsection. 9 Attributes. All the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between the three competing file formats.



In [7]:

    
import pandas as pd
df = pd.read_table('/home/cmb-panasas2/skchoudh/genomes/S_cerevisiae_BY4741/annotation/BY4741_JRIS00000000.gff',
                   sep=' ',
                   names= ['sequence', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes'])



In [8]:

    
## Row 11348 (1-based) onwards are sequences
df = df #.iloc[:11348-1]



In [29]:

    
lines_to_write = """##description: modified gtf from BY4741_JRIS00000000.gff
##provider: saketkc
##format: gtf
##date: 2017-10-27
"""
contigs_info = ''
undef_count = 1
with open('/home/cmb-panasas2/skchoudh/genomes/S_cerevisiae_BY4741/annotation/BY4741_JRIS00000000.gff') as fh:
    for line in fh:
        line_splitted = line.strip().split(' ')
        chrom, source, feature, start, end, score, strand, frame, attributes = line_splitted
        start = int(start)
        end = int(end)
        if feature == 'gene':
            feature = 'exon'
        if feature == 'contig':
            contigs_info += '{}\t{}\n'.format(chrom, end-start+1)
            continue
        ## Process attributes
        attribute = attributes.split(';')[0]
        ## The attrobutes are separated by ; indicating multiple blast hits
        ## We deal with them by assigning to the hit with max percent identity
        ## which is always the first one
        if attribute=='UNDEF':
            mod_attribute = 'gene_id "UNDEF-{}"; transcript_id "UNDEF-{}-T"; gene_name "UNDEF-{};"'\
            .format(undef_count, undef_count, undef_count)
            undef_count +=1
        else:
            gene_id, genome, chromRef, startRef, startEnd, gene_name, evalue, percent_identity = attribute.split(',')
            mod_attribute = 'gene_id "{}"; transcript_id "{}-T"; gene_name "{}"'\
            .format(gene_id, gene_id, gene_name)
        lines_to_write += '{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{};\n'\
        .format(chrom, source, feature, start, end, score, strand, frame, mod_attribute)



In [31]:

    
with open('/home/cmb-panasas2/skchoudh/genomes/S_cerevisiae_BY4741/annotation/BY4741_JRIS00000000.modified.gtf', 'w') as fh:
    fh.write(lines_to_write)