Untouched nature, glorified nature

This summer I have gone to a place completely new to me: The Alps. I was planning to go somewhere sunny, but instead, south Germany was my fortunate reality. I live in Berlin now, and since corona, it was a precaution not to leave the country during summer break.  A great choice. It is the most beautiful thing I ever saw. Definitely breath taking. 

All this journey brought me to the reflection of such an incredible and “inaccessible” nature. Despite the fact of its natural inaccessibility due to its high altitude and low temperatures, why are there such sacred places around the world where nature is so well preserved and inaccessible at the same time? Maybe it is a cause-effect. Or maybe it is not a unanimous though. It is probably a felling which exists in big cities, where we place nature within green areas, natural parks and forests legally delimitated. 

I am from Rio de Janeiro, Brazil. If you are from outside Brazil, you probably heard about Brazil being very green and have a picture of beaches and the amazing Amazon forest in your mind. And yes, that is true. But what about the rest? Brazil is the size of a continent – Europe fits in Brazil. Well, let me say that the rest many times does not follow this definition.

Of course, the discussion is much more complicated and has a lot of important historic facts. There is a reason and an explanation for each scenario. But what makes sense nowadays is that it is for sure possible to balance between natural, or wild, and civilized concepts. Countries such as Germany have a lot to teach us on how to proceed. They are still learning but have already acquired such an amazing relationship with nature. 

I can only talk about my own experience. And based on that, I can say how surprised I am every single day on cycling to work, in the centre of Berlin, via green paths. Berlin and Brandenburg state have more than 3000 parks and green areas. Many lakes and forests. And they are not outside the city. There are everywhere. In Rio, or in Petrópolis, where I was born, I can easily count how many parks there are. Normally you need to drive to go somewhere really green. 

All this discussion still does not mean that there is no nature around us even in places as described in Brazil. For sure not. Life in the urban area is everywhere. It depends on how we face that. But of course, we can make it look even more like home. Our real home.  

I always remember some professors of mine during college saying how important it is to increase and grown the green spaces within cities since the natural legalized spaces are decreasing. At the same pace of avoiding deforestation, we can make of our cities real urban forests; including technology and everything we acquired to bring us comfort. Being creative is possible to come up with new ideas and to progressively leave old fashion unsustainable production and consumption styles behind. 

Lastly, I would like to cite a nice post about nature integration in European cities. Take a look at this interesting text to notice how we still have a long way to follow. The publication presents an interesting map of green gradient in the capitals.

Maps (left) and ranking (right) of European capitals according to their ecological integrity performance. The colour gradient represents the amount of area under different ecological integrity values, with the lowest integrity in dark orange and the highest in dark green. Image from rewildingeurope.com.

Thanks for reading. Feel free to leave your feedback below 😉

Genome Assembly Tutorial

“[…] knowledge of sequences could contribute much to our understanding of living matter.” Frederick Sanger, 1980


When we talk about Genome Assembly, we obligatory need to mention sequencing “generations”. If you are a biologist, you problably already know that does not exist independent sequencing generations. In order to have a great assembly, in general, biologists use more than one technology at the same time, to generate DNA reads. So, before defining Assembly and going to some tutorials, let us briefly talk about sequencing generations. Have in mind, though, all these generations have an overlap. Mainly the second and third ones, also called together as Next Generation Sequencing.

First Sequencing Generation

In only two decades, the modern biology has revolutionized whole science, after human genome project conclusion in 2001 (Consortium 2004)–. Almost all research areas are being influenced by genetics, such as energy, agroindustry, medicine and engineering.

The first complete sequenced genome was from Bacteriophage MS2, done in 1976 (Sanger et al., 1977). The technology available that moment was based on “plus and minus” method (Sanger & Coulson, 1975), a variant of the Sanger methodology to sequence DNA, in which deoxyribonucleotides (dNTPs) are used in different reactions to generate assorted length sequences, fractioned later by gel electrophoresis. That chemical sequencer was able to generate a whole DNA fragment and was responsible for the beginning of the Bioinformatics Era.

Frederick Sanger kept improving his technology, creating in 1977 the famous chain-termination or dideoxy technique (Sanger and Nicklen 1977), which is still used in many places until current days. It uses dideoxynucleotides (ddNTPs), dNTPs analogues that lack 3’hydroxyl group – required for DNA extension during its synthesis. Radiolabelled ddNTPs are mixed in four parallel synthesis reactions to generate the original sequences through an autoradiography.

Several other changings have been made to improve this technique, like using phospho- or tritrium- radiolabeling with fluorometric based detection or detection through capillary based electrophoresis (Heather and Chain 2016). However, the machines could not produce reads more than one kilobase in length, reason why shotgun sequencing technique was developed later, in order to assemble those reads into long contiguous sequences (contigs), by a number of cloned and separately sequenced overlapping DNA fragments.

In addition, the creation of technologies such as polymerase chain reaction (PCR) (Saiki et al., 1988) and recombinant DNA (Jackson, Symons and Berg, 1972) made possible much more quantity of pure DNA to sequence.

Second Sequencing Generation

New technologies (Shendure & Ji, 2008) appeared using luminescent method (Nyrén & Lundin, 1985) for measuring pyrophosphate synthesis (Ronaghi et al., 1998): pyrosequencing was licensed to 454 Life Sciences, the first big company to launch next-generation sequencing (NGS) technology. A great transition was made, when they started mass parallelization of sequencing reactions, increasing the amount of DNA – producing millions of 400-500 base pairs (bp) long reads (Heather & Chain, 2016).

Techniques comparable to 454 emerged in the following years, among them, Solexa, later acquired by Illumina, using ‘bridge amplification’ method (Fedurco et al., 2006). Although, the first Genome Analyzer (GA) machine (by Solexa) was capable of generating very short reads – about 35bp long –, it could produce paired-end (PE) data (forward and reverse DNA information), improving the accuracy at mapping reads to a reference genome (Heather & Chain, 2016). The second GA version was later replaced by HiSeq/MiSeq, with longer read lengths – ~150bp long (Quail et al., 2012).

Other impact company was Applied Biosystems (Life Technologies merged with Invitrogen, currently Thermo Fisher Scientific), owner of SOLiD (Mckernan et al., 2009), a ligation and detection sequencing system. SOLiD was followed by Ion Torrent, a platform in which nucleotide incorporation is detected by the difference in pH, caused by the release of protons during DNA synthesis (Rothberg et al., 2011) – it can generate ~200bp long reads (Quail et al., 2012). However, interpreting homopolymer sequences is not an easy task in Ion Torrent, due to the loss of signal of many simultaneous dNTPs incorporation (Loman et al., 2012).

The sequencing cost have been dramatically altered by these companies, revolutionizing the complexity of microchips and increasing the number of chemical methods to sequence (Heather & Chain, 2016). Illumina, though, has been considered the most successful sequencing platform, making this company a near monopoly (Greenleaf & Sidow, 2014; Heather & Chain, 2016).

Third Sequencing Generation

Currently, we are living Third-generation DNA sequencing (Schadt, Turner and Kasarskis, 2010; Heather & Chain, 2016), a step into longer reads, real-time sequencing and new technologies. These technologies can sequence single molecules lacking DNA amplification, needed in all previous sequencers (Heather & Chain, 2016).

A first single molecule sequencing (SMS) machinery was commercialized by Helicos BioSciences (Harris et al., 2008), working with the same methodology Illumina is used to do, but with no bridge amplification – it avoids biases and errors associated to amplified DNA.

But now, one of the most famous third-generation sequencing is the Single Molecule Real Time (SMRT) technology from Pacific Biosciences (PacBio). Despite the cost, PacBio have been used to generate much longer reads, up to 10kbase (Van Dijk et al., 2014), necessary to assemble big genomes – as the 32-gigabase-pair axolotl genome, the biggest genome ever assembled at the time of writing (Nowoshilow et al., 2018) –, although, high base detection error is an issue to settle yet.

Nanopore technologies have also appeared as a promise to the future of sequencing (Haque et al., 2013). The firsts nanopore sequencers were developed by Oxford Nanopore Technologies (ONT) – GridION and MinION (Eisenstein, 2012; Clarke et al., 2013) –, and the latter were innovated by size, similar to an USB drive (Loman & Quinlan, 2014). Nanopore sequencers are hoped to be a future solution to fast, low-cost and compact machines with long and accurate reads (Heather & Chain, 2016). For the moment, they can be used in association with current accurate technologies due to their long reads (Madoui et al., 2015; Karlsson et al., 2015).

Pre-processing the reads

After sequencing, comes the quality control step. It can done throughout several ways, including tools offered by the sequencer company at the sequencing machine. What we need to do is visualize our reads quality. Here we are going to use FastQC software, the most famous tool utilized to achieve quality visualization.

First of all, download FastQC here. Then, follow our instructions:We’ll use the same dataset during all tests here. Please, download the reference genome and the raw reads from the European Nucleotide Archive (ENA). You can do it manually, or through FTP using wget command as follows. Remember to extract that.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR181/000/SRR1816870/SRR1816870_subreads.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR181/008/SRR1818128/SRR1818128_subreads.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR572/SRR572209/SRR572209_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR572/SRR572209/SRR572209_2.fastq.gz

gzip -d SRR1816870_subreads.fastq.gz
gzip -d SRR1818128_subreads.fastq.gz
gzip -d SRR572209_1.fastq.gz
gzip -d SRR572209_2.fastq.gz

Type in the terminal “fastqc read_file.fastq -t number_of_threads” to each file. It will generate an html file, in which you can open and detect any issues to handle.

./fastqc SRR1816870_subreads.fastq -t 8
./fastqc SRR1818128_subreads.fastq -t 8
./fastqc SRR572209_1.fastq -t 8
./fastqc SRR572209_2.fastq -t 8

1. All of it will take some time. It’s okay, we are doing science here 🙂 . Anyway, if you have any trouble, you can try the same process only to one or two files, or even take another file from ENA or NCBI.

2. The following video will help you to interpret FastQC html file.

FastQC tell us about the quality of our sequencing. Sometimes, we need to consider re-sequencing our datasets. However, some issues can be solved based on some support tools to trim your data, basically. For example, to the file “SRR572209_1.fastq” we see problems related to “Per tile sequence quality”, “Per base sequence content”, “Sequence Duplication Levels”, and “Kmer Content”. It seems like all of those problems are located a the beginning and at the end of our reads. So, maybe we should eliminate the first and the last bases of our reads. Let’s do it.

Download Trimmomatic; extract that; cut the first and last 9 bases from the reads; and re-run fastqc. The dir I’m storing my files is: /mnt/data-assemblies/guia/. I renamed the paired-end files to R1.fastq and R2.fastq to make it simpler.

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.38.zip
unzip Trimmomatic-0.38.zip

java -jar trimmomatic-0.38.jar PE /mnt/data-assemblies/guia/R1.fastq /mnt/data-assemblies/guia/R2.fastq /mnt/data-assemblies/guia/pe_1_paired.fastq /mnt/data-assemblies/guia/pe_1_unpaired.fastq /mnt/data-assemblies/guia/pe_2_paired.fastq /mnt/data-assemblies/guia/pe_2_unpaired.fastq LEADING:9 TRAILING:9 MINLEN:50 -threads 8 -phred33

So here, I’m basically cutting the initial and final bases, and also saying to trimmomatic that it can eliminate reads less than 50bp. For a complete list of actions possible in Trimmomatic, visit http://www.usadellab.org/cms/?page=trimmomatic.

We can also use fastx tools. Take a deeper look at these tools, and choose the one you like the most. To give you another example:

./fastx_trimmer -f 9 -l 90 /mnt/data-assemblies/guia/R1.fastq /mnt/data-assemblies/guia/new_R1.fastq

It does the same as previous command. Take a look at FastQC again using these new files. Enjoy your time learning it 🙂

When we consider PacBio, it’s more complicated. Pacbio generates longer reads, but with lower quality. We can still analyze it using FastQC, but then, we need to consider other tools to manage our reads. PacBio indicates some of these tools here. For now, let’s use it without any change.

When we work with Illumina, we also come across with mate-pair reads. Although we are not going to use them here, let us define single, paired-end and mate-pair reads.

According to Illumina, Paired-End sequencing is a strategy to sequence both ends of a DNA/RNA fragment, in 5’ to 3’ direction and in 3’ to 5’direction (200 – 800bp). It may not only facilitate detection of genomic rearrangements and repetitive sequence elements, but also the detection of gene fusions and novel transcripts. Single-read sequencing, as the name refers, considers sequencing only one end (Illumina Inc., 2018). However, it is not being widely used anymore.

On the other hand, Mate Pair sequencing generates long-insert paired-end reads, longer than 800bp. This strategy is possible through Biotine, which is ligated to fragments, circularizing them (figure 3). The circularized DNA is fragmented, enriched and ligated to adapters. Thus, the final fragment contains the ends of the original longer fragment (ecSeq Bioinformatics, 2018).

Comparison of sample preparation for Illumina Paired-End Sequencing and Illumina Mate Pair Sequencing

The mate-pair reads may now pair the paired-end reads in great distances (figure 4), since the original long read length is known. It can elucidate the existence of long repetitive regions and also the problems generated during paired-end reads assembly.

Genome assembly using paired-end (short) and mate-pair (long) reads. Each line in black represents a forward and a reverse DNA fragment. The orange line represents Biotinylate ends. The dashed line refers to the known original mate-pair size, which will be used to validate the generated sequence with the short paired-end reads.

Assembly Approaches

So, as you noticed we have reads from Illumina as well as reads from Pacbio. Let us make, though, our guide more interesting. We are going to assemble pacbio reads using Canu software, Illumina reads using SPAdes and then, Pacbio and Illumina reads using SPAdes again. At the end we’ll be able to compare all three strategies. However, before going further, I’ll briefly present you some Assembly approaches.

After sequencing, genome assembly is needed, even when using Sanger platforms. But, distinguishing between de novo and mapping approaches is very important to select the best algorithm to assemble.

De novo genome assembly intents to reconstruct DNA or RNA molecules in which there is no genome reference previously sequenced (an NP-hard problem). On the other hand, mapping/comparative (re-sequencing) approach uses a sequenced genome from same or related species as a guide during assembly (alignment) – making it much easier when comparing to de novo genome assembly (Pop 2009; Miller, Koren and Sutton, 2010).

Assembling genomes was a problem that emerged from NGS, a challenge created by millions of short reads. Many algorithms and tools were developed to better achieve de novo genome assembly, considering assembled genomes quality and computational efficiency. The main algorithms are: Greedy, Overlap-Layout-Consensus (OLC), De Bruijn graph (DBG) and string graph; which are summarized below.

Greedy Assembly Algorithm

As any other greedy algorithm, the greedy assembly algorithm selects always the best option each operation, according to an ordering. In this case, a basic operation means: “given any read or contig, add one more read or contig. Each operation uses the next highest-scoring overlap to make the next join” (Miller, Koren and Sutton, 2010). Therefore, contigs and later, scaffolds2, are assembled as larger as possible.

Greedy approach was widely used for assembling Sanger data, in assemblers such phrap, TIGR Assembler and CAP3. However recent software platforms have used different greedy strategies (Pop 2009). OLC and DBG graphs may also be used by greedy algorithms (Chen et al., 2017).


Contig derives from the word contiguous, it means a set of overlapping DNA fragments that together produce a consensus region of DNA (Staden, 1980).

Scaffold is a series of contigs separated by gaps of known length.

Overlap-layout-consensus (OLC)

Alike greedy algorithm, a list of highest-scoring overlap to each read is given in OLC (Staden, 1980). The list is used for creating an overlap graph, in which each read corresponds to a node, connected by edges that represent an overlap between the corresponding nodes (Figure 1).

A layout step is responsible for identifying paths throughout the graph in order to generate genome fragments, or contigs. The ideal path would traverse each node in the graph only once, reconstructing the whole genome (Pop, 2009). Finding this path is computationally difficult, an NP-hard problem, known as Hamiltonian path. The overlap strategy has time complexity O(n²) (Chen et al., 2017).

Consensus sequence is the final stage, when reads overlapping same genome positions are used to identify the correct bases, detecting polymorphisms, and generating the sequence quality values (Li et al., 2004).

However, in whole genome shotgun (WGS) technique, the ideal path does not exist. Assembling contigs is the strategy, trying to remove gaps and errors, solving repetitions in the genome, and solving forks ( According to Pop (2009), fork means “a read A that overlaps two other reads, B and C; however, B and C do not overlap each other. Such a situation often represents the boundary between a repeat and the genomic regions adjacent to the copies of this repeat throughout the genome; however, forks can also be caused by sequencing errors”).

The identified overlaps list is generated from input reads. The overlaps are used as edges connecting the reads, as nodes. So, the graph created is used to find the best path which reconstructs the whole genome (called Hamiltonian path). The Figure was taken from Commins, Toft and Fares, 2009

de Bruijn graph algorithm

De Bruijn is another graph approach, widely used for short reads, that implements K-mer strategy (Idury & Waterman, 1995). In de Bruijn graph, k-mers are nodes, and exactly overlapping of length k – 1 between two adjacent nodes are edges (Pop, 2009); each repeat is presented at once in the graph, with links to different start and end positions (Zerbino & Birney, 2008).

Here, a path in the graph is found using Eulerian path algorithm (O(n)), which possess every edge in the graph. There are several efficient algorithms for finding Eulerian path, perhaps it can generate exponential number of Eulerian paths (Pop, 2009). In addition, finding a Hamiltonian path may be reduced into finding a Eulerian path in a (k-1)-mer DBG (Chen et al., 2017).

A problem from k-mer approach leads to a loss of information – “long-range connectivity information implied by each read” (Pop, 2009). To incorporate read information, Pevzner and colleagues (2001) created a Eulerian path variation, called Eulerian superpath problem. This superpath is produced from sub-paths corresponding to reads given.

String graph algorithm

The string graph approach was first presented explicitly in Euler algorithms. Although, Myres (2005) introduced this graph as a new concept, lacking k-mer idea, in order to get a more efficient algorithm (O(n)), scalable to mammalian genomes.

String graph assembler (SGA) performs a compressed data structure Ferragina-Manzini (FM)-index with a collection of assembly algorithms (Simpson & Durbin, 2010; Chen et al., 2017). The graph is created by pairwise overlaps between reads, removing transitive edges. Just as de Bruijn graph, repeats are collapsed to a single unit, but without the necessity of generating k-mers from reads. An error correction is performed in the reads before assembling, thus constructing FM-index to compute string graph, and then to assemble the contigs (Chen et al., 2017).



Let’s use Canu to assemble Pacbio data. First, rename your files to something simpler Here, we keep only the accession number. After installing Canu, type:

./canu -p Pfermentans -d /mnt/data-assemblies/bacteria/Pfermentans genomeSize=5.03362m -pacbio-raw SRR1818128.fastq SRR1816870.fastq

As you see, -p indicates the name of your organism, -d the directory you want to store the results, genomeSize the estimated length of the genome, and -pacbio-raw the pacbio reads.


And now SPAdes to Illumina, and PacBio + Illumina, respectively. SPAdes automatically generates distinct k lenghts and chooses the best one according to some heuristics. You’ll see it’s very simple. To know more type “./spades.py -h”.

./spades.py -1 SRR572209_1.fastq -2 SRR572209_2.fastq -o /dir_out
./spades.py -1 SRR572209_1.fastq -2 SRR572209_2.fastq –pacbio SRR1818128.fastq –pacbio SRR1816870.fastq -o /dir_out

So, the assembly process looks simpler than it is. Actually, the big problem comes next.

Evaluation Process

The Genomics field is still somewhat recent and presents a large number of practical issues to tackle. Assembly quality is one of these urgent issues that have emerged, particularly in personalized medicine scenarios. At this point, many metrics and strategies have been proposed in order to evaluate an assembly.

Taking into account the proposed evaluation metrics (summarized in in the figure below), we now face the problem of choosing those that give us a better understanding of the assembly quality. In many contexts, this choice will directly influence assemblies selection and consequently genome assembly application cases.

Main Assembly Evaluation Strategies. Contiguity is the blue circle, Base Analysis yellow, and Genes Analysis the brown one.

So, you problably will spend the majority of your time seeking for a good quality assembly. You need to come back many times to the assembly step, trying new paramters and assemblers. For sure, it’s so musch easier when you already have good reads. Then, also consider generating as better reads as possible.

Here we are going to use QUAST to generate many quality metrics.Download QUAST or use it on web: http://quast.bioinf.spbau.ru/. In case of using its command line:

./quast.py –gene-finding –rna-finding -R GCF_000271665.2_ASM27166v2_genomic.fna.gz –est-ref-size 5033620 –pacbio SRR1816870.fastq –pacbio SRR1818128.fastq Pfermentans.unitigs.fasta –threads 8 -o /quast

You’ll do the same for the three assemblies. The command above generates quality metrics for Canu’s assembly. We used the unitigs file. –gene-finding and –rna-finding estimates the number of genes and rnas. -R indicates a reference genome, here we used the one from ENA, GCF_000271665. If you do not have a reference genome, quast will generate the metrics available for unknown reference genomes. You also need to specify the reference genome lenght in –est-ref-size. And also, if you give quast the raw reads, it will return you more quality measures.

Now take some time to interpret the metrics from the three assemblies.

Considering a decreasing-ordered list of contigs, Nx (e.g. N50, N90) is the length of the shortest contig from the sum group of all contigs from the list necessary to get x% of total assembly length. NGx considers not the total assembly length, but the original genome length. And NAx does the same job as Nx but using an aligned contigs list; contigs containing misassemblies are broken into two new contigs [Gurevich et al. 2013].

In terms of contigs, the best assembly is the one made on Canu, with 4 contigs, and the worst is from SPAdes using only Illumina. Also N50 and L50 are better represented in Canu. However, when we look at Reference Mapped, for example, we see Canu with 69,91%, while SPAdes has more than 99%. You problably will agree with me that seems the assembly using both PacBio and Illumina in SPAdes looks the best option here.

Did you enjoy it?

Draft’s Improvement

There’s a pleithora of tools available on internet to improve your assembly. You should consider looking around and applying some of them. I would like to cite here GapFiller, CISA and KmerGenie.

GapFiller, as its names suggests, is a tool which seeks to eliminate the gaps inside the assembly, according to some strategies. CISA, as many others, generates hybrid assemblies, gathering many distinct assembler’s outputs. And KmerGenie, gives you the better K number according to your data.

There’s so much more to learn and apply, but it’s not the focus here. As a beginner, you should first learn the basics.


It’s always challenging to speak about the future. Maybe all we have talked here is going to be unnecessary in the years to come, given a technology capable of sequencing the whole DNA molecule at once. But, what we do know, is that for the moment we need to assemble DNA reads.

However, we can extrapolate what we know about the third sequencing generation. Nanopore, for example, sequencing giant reads in the middle of nowhere, is a great guess to what we can expect.

We are walking through a more inclusive, cheaper and quicker science. Our work now is to produce good quality assemblies in order to better understand and manage the molecule of life.

CHEN, Qingfeng et al. Recent advances in sequence assembly: principles and applications. Briefings In Functional Genomics, [s.l.], v. 16, n. 6, p.361-378, 26 abr. 2017. Oxford University Press (OUP). http://dx.doi.org/10.1093/bfgp/elx006.
CHIKHI, R.; MEDVEDEV, P.. Informed and automated k-mer size selection for genome assembly. Bioinformatics, [s.l.], v. 30, n. 1, p.31-37, 3 jun. 2013. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btt310.
CLARKE, James et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotechnology, [s.l.], v. 4, n. 4, p.265-270, 22 fev. 2009. Springer Nature. http://dx.doi.org/10.1038/nnano.2009.12.
COMMINS, Jennifer; TOFT, Christina; FARES, Mario A.. Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biological Procedures Online, [s.l.], v. 11, n. 1, p.52-78, 11 mar. 2009. Springer Nature. http://dx.doi.org/10.1007/s12575-009-9004-1.
CONSORTIUM, International Human Genome Sequencing. Finishing the euchromatic sequence of the human genome. Nature, [s.l.], v. 431, n. 7011, p.931-945, 21 out. 2004. Springer Nature. http://dx.doi.org/10.1038/nature03001.
EISENSTEIN, Michael. Oxford Nanopore announcement sets sequencing sector abuzz. Nature Biotechnology, [s.l.], v. 30, n. 4, p.295-296, abr. 2012. Springer Nature. http://dx.doi.org/10.1038/nbt0412-295.
ECSEQ BIOINFORMATICS. What is mate pair sequencing for? Disponível em: . Acesso em: 09 mar. 2018.
FEDURCO, M. et al. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Research, [s.l.], v. 34, n. 3, 6 fev. 2006. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/gnj023.
GREENLEAF, William J; SIDOW, Arend. The future of sequencing: convergence of intelligent design and market Darwinism. Genome Biology, [s.l.], v. 15, n. 3, p.303-310, 2014. Springer Nature. http://dx.doi.org/10.1186/gb4168.
HAQUE, Farzin et al. Solid-state and biological nanopore for real-time sensing of single chemical and sequencing of DNA. Nano Today, [s.l.], v. 8, n. 1, p.56-74, fev. 2013. Elsevier BV. http://dx.doi.org/10.1016/j.nantod.2012.12.008.
HARRIS, T. D. et al. Single-Molecule DNA Sequencing of a Viral Genome. Science, [s.l.], v. 320, n. 5872, p.106-109, 4 abr. 2008. American Association for the Advancement of Science (AAAS). http://dx.doi.org/10.1126/science.115042
HEATHER, James M.; CHAIN, Benjamin. The sequence of sequencers: The history of sequencing DNA. Genomics, [s.l.], v. 107, n. 1, p.1-8, jan. 2016. Elsevier BV. http://dx.doi.org/10.1016/j.ygeno.2015.11.003.
IDURY, Ramana M.; WATERMAN, Michael S.. A New Algorithm for DNA Sequence Assembly. Journal Of Computational Biology, [s.l.], v. 2, n. 2, p.291-306, jan. 1995. Mary Ann Liebert Inc. http://dx.doi.org/10.1089/cmb.1995.2.291.
ILLUMINA INC.. Advantages of paired-end and single-read sequencing: Understand the key differences between these sequencing read types. Disponível em: . Acesso em: 09 mar. 2018.
JACKSON, D. A.; SYMONS, R. H.; BERG, P. Biochemical method for inserting new genetic information into DNA of Simian Virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. PMC, USA, v. 10, n. 69, p.2904-2909, oct. 1972.
KARLSSON, E. et al. Scaffolding of a bacterial genome using MinION nanopore sequencing. Scientific Reports, [s.l.], v. 5, n. 1, 7 july 2015. Springer Nature. http://dx.doi.org/10.1038/srep11996.
LI, M. et al. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Research, [s.l.], v. 32, n. 17, p.5183-5191, 23 set. 2004. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/gkh850.
LOMAN, Nicholas J et al. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, [s.l.], v. 30, n. 5, p.434-439, 22 abr. 2012. Springer Nature. http://dx.doi.org/10.1038/nbt.2198.
LOMAN, N. J.; QUINLAN, A. R.. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics, [s.l.], v. 30, n. 23, p.3399-3401, 20 ago. 2014. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btu555.
MADOUI, Mohammed-amin et al. Genome assembly using Nanopore-guided long and error-free DNA reads. Bmc Genomics, [s.l.], v. 16, n. 1, 20 april 2015. Springer Nature. http://dx.doi.org/10.1186/s12864-015-1519-z.
MCKERNAN, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research, [s.l.], v. 19, n. 9, p.1527-1541, 22 jun. 2009. Cold Spring Harbor Laboratory. http://dx.doi.org/10.1101/gr.091868.109.
MILLER, Jason R.; KOREN, Sergey; SUTTON, Granger. Assembly algorithms for next-generation sequencing data. Genomics, [s.l.], v. 95, n. 6, p.315-327, jun. 2010. Elsevier BV. http://dx.doi.org/10.1016/j.ygeno.2010.03.001.
MYERS, E. W.. The fragment assembly string graph. Bioinformatics, [s.l.], v. 21, n. 2, p.79-85, 1 set. 2005. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/bti1114.
NOWOSHILOW, Sergej et al. The axolotl genome and the evolution of key tissue formation regulators. Nature, [s.l.], v. 554, n. 7690, p.50-55, 24 jan. 2018. Springer Nature. http://dx.doi.org/10.1038/nature25458.
NYRÉN, Pål; LUNDIN, Arne. Enzymatic method for continuous monitoring of inorganic pyrophosphate synthesis. Analytical Biochemistry, [s.l.], v. 151, n. 2, p.504-509, dez. 1985. Elsevier BV. http://dx.doi.org/10.1016/0003-2697(85)90211-8.
PENG, Yu et al. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. Lecture Notes In Computer Science, [s.l.], p.426-440, 2010. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-12683-3_28.
PEVZNER, P. A.; TANG, H.; WATERMAN, M. S.. An Eulerian path approach to DNA fragment assembly. Proceedings Of The National Academy Of Sciences, [s.l.], v. 98, n. 17, p.9748-9753, 14 ago. 2001. Proceedings of the National Academy of Sciences. http://dx.doi.org/10.1073/pnas.171285098.
POP, M.. Genome assembly reborn: recent computational challenges. Briefings In Bioinformatics, [s.l.], v. 10, n. 4, p.354-366, 29 maio 2009. Oxford University Press (OUP). http://dx.doi.org/10.1093/bib/bbp026.
QUAIL, Michael et al. A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers. Bmc Genomics, [s.l.], v. 13, n. 1, p.341-354, 2012. Springer Nature. http://dx.doi.org/10.1186/1471-2164-13-341.
RONAGHI, M. et al. DNA SEQUENCING: A Sequencing Method Based on Real-Time Pyrophosphate. Science, [s.l.], v. 281, n. 5375, p.363-365, 17 jul. 1998. American Association for the Advancement of Science (AAAS). http://dx.doi.org/10.1126/science.281.5375.363.
ROTHBERG, Jonathan M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature, [s.l.], v. 475, n. 7356, p.348-352, jul. 2011. Springer Nature. http://dx.doi.org/10.1038/nature1024
SAIKI, R. K. et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, v. 239, n. 4839, p.487-491, jan. 1988.
SANGER, F.; COULSON, A.r.. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal Of Molecular Biology, [s.l.], v. 94, n. 3, p.441-448, maio 1975. Elsevier BV. http://dx.doi.org/10.1016/0022-2836(75)90213-2.
SANGER, F.; NICKLEN, S.; COULSON, A. R. DNA sequencing with chain-terminating inhibitors: (DNA polymerase/nucleotide sequences/bacteriophage 4X174). Proc. Natl. Acad. Sci.: Biochemistry, USA, v. 74, n. 12, p.5463-5467, dez. 1977.
SANGER, F. F.; et al. (1977). “Nucleotide sequence of bacteriophage φX174 DNA”. Nature. 265 (5596): 687–695. doi:10.1038/265687a0. PMID 870828.
SCHADT, E. E.; TURNER, S.; KASARSKIS, A.. A window into third-generation sequencing. Human Molecular Genetics, [s.l.], v. 19, n. 2, p.227-240, 21 set. 2010. Oxford University Press (OUP). http://dx.doi.org/10.1093/hmg/ddq416.
SHENDURE, Jay; JI, Hanlee. Next-generation DNA sequencing. Nature Biotechnology, [s.l.], v. 26, n. 10, p.1135-1145, out. 2008. Springer Nature. http://dx.doi.org/10.1038/nbt1486.
STADEN, R.. A mew computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Research, [s.l.], v. 8, n. 16, p.3673-3694, 1980. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/8.16.3673.
STADEN, R.. A mew computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Research, [s.l.], v. 8, n. 16, p.3673-3694, 1980. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/8.16.3673.
SNUSTAD, D. Peter; SIMMONS, Michael J.. Fundamentos de Genética. 4. ed. Rio de Janeiro: Guanabara Koogan, 2012. 903 p. Tradução Paulo A. Motta.
TREANGEN, Todd J.; SALZBERG, Steven L.. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics, [s.l.], v. 13, n. 1, p.36-46, 29 nov. 2011. Springer Nature. http://dx.doi.org/10.1038/nrg3117
VAN DIJK, Erwin L. et al. Ten years of next-generation sequencing technology. Trends In Genetics, [s.l.], v. 30, n. 9, p.418-426, set. 2014. Elsevier BV. http://dx.doi.org/10.1016/j.tig.2014.07.001.
ZERBINO, D. R.; BIRNEY, E.. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, [s.l.], v. 18, n. 5, p.821-829, 21 fev. 2008. Cold Spring Harbor Laboratory. http://dx.doi.org/10.1101/gr.074492.107.

Copyright © 2020 Guilherme Neumann.

PhD in Germany :)

Last year I started my PhD in Berlin, and since then I have learned A LOT. I could write about so many things, many challenges and fears about living abroad and starting a doctorate. However, in this first post I think it would be more interesting to start introducing you to how I got here in the first place.

I do not expect to tell you all the details, not even to persuade you to come to Germany or to live abroad. But I think it might be helpful for some of you to see how many points we have to plan for, and how your dreams are important in order to get really motivated about your goals.

Okay, let’s get started. I will summarize the main points of my experience.

The Dream of Living Abroad

I would say it all began during School time. I remember searching for universities in the U.S. and feeling amazed about people who could study at Harvard or Yale. But when looking for scholarships, I was aware of not being able to go for it. Not all opportunities are financially worth, especially when including your family into it. Anyway, I applied for two federal universities (no fees) in Rio de Janeiro (yep, I’m Brazilian) and for two private ones (BIG fees). I passed for three of four that I tried and I decided for one of the privates. The best in my opinion. I got a full scholarship there, and also financial support for food and public transportation. Nonetheless, it does not mean I was a super nerd, maybe a little bite, but I was very dedicated and persistent. I was only accepted in the waiting lists, it was not in the very first try.

THE DREAM of going abroad only got stronger. I tried twice a scholarship during College (one for the U.S. and the other one for Germany), but nothing worked. Trying to be more concise, at some point I realized I did not know why the U.S. and at the same time I became more and more confident about Germany, and also what this country meant for me during my whole life. My last name is German, I come from a city in the mountains of Rio, where the main cultural influence is German. And I have German roots. But anyway, this was not the only think. I had the opportunity to visit Berlin in 2016, and after reading a lot about the city I got convinced that this was the place that I would like to start my life and career. Berlin is this place where you feel that the future of humanity is being planned, dreamed and discussed.


After 4 years of search and reflection, I had a very clear plan of the future. Maybe it was too much planning. It brought me so much stress. But it worked at the end. And then the reality makes its presence. You need money. It does not matter whether you have a scholarship or not. You need to pay for your flights, the health insurance (mandatory in Germany), your food, accommodation, and daily costs (at least for the first months). Maybe your dream does not demand money, but it for sure demands some sort of effort. Otherwise, it would be already reality, right? My main effort was saving and getting more money. I come from a poor family and I did not have much savings. I mean, I needed some extra budget, or extra jobs. I planned my Masters with that in mind. At the end, I basically had 3 jobs in order to have money to make my dream true, and in addition to marry my love :).


I think even more difficult than the time moving to Germany, it was the time organizing that. But it was fun when looking back. That’s the most important thing. Of course, not everything generates joy. But it is important to enjoy your way, since most of the time we are living a way, seeking a goal/dream.

I am biologist who mastered in Informatics, married another guy, and had 3 jobs. Imagine how intense it was. But so delightful and inspiring. I am still today inspired on how I could manage those things. So, I believe this is another point. Sometimes we forget things we dealt with in the past and we really believe we are failing miserably. But when you remember your past, you notice you already faced similar issues. I repeat this to me right now, hahah. It is difficult to get focused and to study so many things in a row. But it is just a matter of strategy and love. As I said, we have to enjoy our way. No stress, please. Be careful of being too much precise in all your plans and schedule. You are a human being. Be happy !

I think I can finish by now. Be faithful and you will get there.

Thank you for reading this post. Any feedback is welcome.

Sincerely, Guilherme.