Dedurre le caratteristiche UTR mancanti nel file GFF3

l0110

2017-12-29 23:49:41 UTC

view on stackexchange narkive permalink

Sto lavorando a un file GFF in cui mancano le informazioni 5'UTR e 3'UTR. Ad esempio:

  ctg123. gene 1050 9000. +. ID = gene00001; Nome = EDEN ctg123. mRNA 1050 9000. +. ID = mRNA00001; Genitore = gene00001; Nome = EDEN.1 ctg123. esone 1050 1500. +. ID = exon00002; Parent = mRNA00001 ctg123. esone 3000 3902. +. ID = exon00003; Parent = mRNA00001 ctg123. esone 5000 5500. +. ID = exon00004; Parent = mRNA00001 ctg123. esone 7000 9000. +. ID = exon00005; Parent = mRNA00001 ctg123. CDS 1201 1500. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1 ctg123. CDS 3000 3902. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1 ctg123. CDS 5000 5300. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1

C'è un modo per aggiungere righe nel GFF con gli intervalli 5'UTR e 3'UTR?

Dovremmo anche occuparci del filone negativo? Voglio dire, puoi avere linee gff che si riferiscono al filamento negativo e, in tal caso, possiamo presumere che saranno corrette? Che le posizioni saranno date rispetto alla linea "+" e la posizione finale sarà "prima" di quella iniziale?

sì, dobbiamo occuparci anche del filone negativo. Ecco solo un esempio semplificato.

Quindi [modifica] la tua domanda e includi anche un esempio del filone negativo. Dovremmo anche sapere se ogni gff avrà una singola voce o se puoi avere più geni nella stessa gff. Più informazioni specifiche ci fornisci, più è probabile che saremo in grado di darti una risposta utile. Tuttavia, tieni presente che gli UTR possono essere uniti, quindi questo non è un buon modo per ottenere le informazioni di cui hai bisogno. Sarebbe meglio ottenere un'annotazione completa da dove l'hai scaricata.

#! / usr / bin / env pythonimport argparsedef parseAttributes (kvps): d = dict () per kvp in kvps: k, v = kvp.split ("=") d [k] = v return ddef parseGFF3 (fname, topLevel = "gene", secondLevel = "mRNA", CDS = "CDS", exon = "exon"): f = aperto (fname) genes = dict () transcripts = dict () g2t = dict () # Un dizionario con ID gene associati ai loro ID di trascrizione per la riga in f: if line.startswith ("#"): continue cols = line.strip () .split ("\ t") attribs = parseAttributes (cols [8] .split (";")) ID = attribs ["ID"] if cols [2] == topLevel: genes [ID] = riga g2t [ID] = [] elif cols [2] == secondLevel: parent = attribs ["Parent"] g2t [parent ] .append (ID) # Genera un elenco di [riga, [voci esone], [voci CDS]] se l'ID non è presente nelle trascrizioni: trascrizioni [ID] = [riga, [], []] altro: trascrizioni [ID] [0] = line else: parent = attribs ["Parent"] if parent non in transcripts: transcripts [parent] = [None, [], []] if cols [2] == exon: transcripts [parent] [1 ] .append ((int (cols [3]), int (cols [4]), line)) elif cols [2] == CDS: transcripts [parent] [2] .append ((int (cols [3]), int (cols [4]), line)) f.close () return geni, transcripts, g2tdef findUTRs (transcripts, fivePrime = True): for k, t in transcripts.items (): # Se non ci sono voci CDS, salta se len (t [2]) == 0: continue # Ottieni il filamento ("." sarà trattato come "+") strand = t [0] .split ("\ t") [6] # Per 3 'UTR, scambia semplicemente il filo se non cinquePrime: strand = "+" if strand == "- "else" - "esoni = [(s, e) per s, e, _ in t [1]] CDS = [(s, e) per s, e, _ in t [2]] esoni.sort () CDSs.sort () UTRs = [] if strand! = "-": final = CDSs [0] [0] per s, e in esoni: if e < final: UTRs.append ((s, e)) elif s < final: UTRs.append ((s, final - 1)) else: break else: final = CDS [-1] [- 1] per s, e in esoni: if e < final: continue elif s > final: UTRs.append ((s, e)) else: UTRs.append ((final + 1, e) ) t.append (UTRs) def saveGFF (oname, geni, transcripts, g2t, fivePrime, threePrime): o = open (oname, "w") o.write ("## gff-version 3 \ n") for geneID , geneLine in genes.items (): o.write (geneLine) # Scorri ogni trascrizione per transID in g2t [geneID]: t = transcripts [transID] se t [0] è Nessuno: continua o.write (t [0 ]) cols = t [0] .strip (). split ("\ t") # Scrivi gli esoni, poi CDS, poi 5'UTR, poi 3'UTR per l'esone in t [1]: o.write (esone [2]) per CDS in t [2]: o.write (CDS [2]) per idx, UTR in enumerate (t [3]): o.write ("{} \ t {} \ t {} \ t {} \ t {} \ t. \ t {} \ t. \ t" .format (cols [0], cols [1], fivePrime , UTR [0], UTR [1], cols [6])) o.write ("ID = {}. FivePrimeUTR {}; Parent = {} \ n" .format (transID, idx, transID)) per idx , UTR in enumerate (t [4]): o.write ("{} \ t {} \ t {} \ t {} \ t {} \ t. \ T {} \ t. \ T" .format ( cols [0], cols [1], threePrime, UTR [0], UTR [1], cols [6])) o.write ("ID = {}. threePrimeUTR {}; Parent = {} \ n". format (transID, idx, transID)) o.close () parser = argparse.ArgumentParser (description = "Analizza un file GFF3 privo di voci UTR e aggiungile. Nota che questo programma assume un file GFF3 ragionevolmente formattato") parser.add_argument ("--fiveUTRname", default = "five_prime_UTR", help = "L'etichetta per 5 voci UTR (default:% (default) s)") parser.add_argument ("- threeUTRname", default = "three_prime_UTR", help = "L'etichetta per 3 voci 'UTR (default:% (default) s)") parser.add_argument ("- topLevelID", default = "gene", help = "Il' tipo 'che designa la voce di primo livello (predefinito:% (predefinito) s) ") parser .add_argument ("- secondLevelID", default = "mRNA", help = "Il 'tipo' che designa la voce di secondo livello, tipicamente qualcosa come 'mRNA' o 'transcript' (default:% (default) s)") parser.add_argument ("- exonID", default = "exon", help = "Il 'tipo' che designa gli esoni (default:% (default) s)") parser.add_argument ("- CDSID", default = "CDS ", help =" Il 'tipo' che designa CDS (default:% (default) s) ") parser.add_argument (" input ", metavar =" input.gff3 ", help =" Input file name ") parser.add_argument ( "output", metavar = "output.gff3", help = "Nome file di output") args = parser.parse_args () geni, transcripts, g2t = parseGFF3 (args.input, topLevel = args.topLevelID, secondLevel = args.secondLevelID , exon = args.exonID, CDS = args.CDSID) # Aggiungi gli UTRsfindUTRs (transcripts) findUTRs (transcripts, fivePrime = False) # Scrivi il outputaveGFF (args.output, geni, transcripts, g2t, args.fiveUTRname, args.threeUTRname )

./add_utrs_to_gff.py example.gff3ctg123. gene 1050 9000. +. ID = gene00001; Nome = EDENctg123. mRNA 1050 9000. +. ID = mRNA00001; Parent = gene00001; Name = EDEN.1ctg123. esone 1050 1500. +. ID = exon00002; Parent = mRNA00001ctg123. esone 3000 3902. +. ID = exon00003; Parent = mRNA00001ctg123. esone 5000 5500. +. ID = exon00004; Parent = mRNA00001ctg123. esone 7000 9000. +. ID = exon00005; Parent = mRNA00001ctg123. CDS 1201 1500. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1ctg123. CDS 3000 3902. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1ctg123. CDS 5000 5300. + 0 ID = cds00001; Parent = mRNA00001; Name = edenprotein.1ctg123. five_prime_UTR 1050 1200. +. ID = utr00002; Parent = mRNA00001ctg123. tre_prime_UTR 5301 5500. +. ID = utr00004; Parent = mRNA00001ctg123. three_prime_UTR 7000 9000. +. ID = utr00005; Parent = mRNA00001