Population Genomics


dnastr

Table of Contents

Population genomics is the large-scale comparison of DNA sequences of populations. Population genomics is a neologism that is associated with population genetics. Population genomics studies genome-wide effects to improve our understanding of microevolution so that we may learn the phylogenetic history and demography of a population.

Popilation Genetic Analysis

Genetic diversity is the amount of variation observed between DNA sequences from distinct individuals of a given species. This pivotal concept of population genetics has implications for species health, domestication, management and conservation. Levels of genetic diversity seem to vary greatly in natural populations and species, but the determinants of this variation, and particularly the relative influences of species biology and ecology versus population history, are still largely mysterious.

Pedigree Reconstruction

Prest-plus

Detects pedigree errors, cryptic relatedness and relationship mispecification in genome-wide association study (GWAS) or linkage data. Using an optimized maximum likelihood estimation (MLE) estimator for identity by descent (IBD) probabilities, prest-plus computes accurate estimates for IBD0/1/2 using any number and combination of single nucleotide polymorphism (SNP) or microsatellite marker data. It can work as efficiently and accurately with a microsatellite linkage panel or SNP data from a GWAS study.

Official Website

Documentation

Publications

Institution(s)

Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada; Department of Clinical Biochemistry, University Health Network, Toronto, ON, Canada


FRANz

A package for pedigree reconstruction in natural populations using co-dominant genomic markers such as microsatellites and single nucleotide polymorphisms (SNPs). FRANz makes use of prior information such as known relationships (sub-pedigrees) or the age and sex of individuals. The accuracy of the algorithm is demonstrated for simulated data as well as an empirical dataset with known pedigree. The parentage inference is robust even in the presence of genotyping errors.

Official Website

Documentation

Publications

Institution(s)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany; RNomics Group, Fraunhofer Institut for Cell Therapy and Immunology (IZI), Leipzig, Germany; Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse, Vienna, Austria; The Santa Fe Institute, Santa Fe, NM, USA


RELPAIR

Infers the most likely relationship of a pair of putative sibs. RELPER consider all possible pairs of individuals in the sample, to test for additional relationships, to allow explicitly for genotyping error, and to include X-linked data. Using autosomal genome scan data, our method has excellent power to differentiate monozygotic twins, full sibs, parent-offspring pairs, second-degree (27) relatives, first cousins, and unrelated pairs but is unable to distinguish accurately among the 27 relationships of half sibs, avuncular pairs, and grandparent-grandchild pairs.

Official Website

Publications

Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Top


Forward Simulation

SeqSIMLA

A simulation tool that can simulate sequence data with user-specified disease and quantitative trait models. SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. It is useful for evaluating statistical properties for new study designs and new statistical methods using NGS.

Official Website

Publications

Institution(s)

Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Miaoli, Taiwan


SeDuS

A flexible and user-friendly forward-in-time simulator of patterns of molecular evolution within segmental duplications undergoing interlocus gene conversion and crossover. SeDuS introduces known features of interlocus gene conversion such as biased directionality and dependence on local sequence identity. Additionally, it includes aspects such as different selective pressures acting upon copy number and flexible crossover distributions. A graphical user interface allows fast fine-tuning of relevant parameters and straightforward real-time analysis of the evolution of duplicates.

Official Website

Documentation

Publications

Institution(s)

Institute of Evolutionary Biology (Universitat Pompeu Fabra – CSIC), PRBB, Barcelona, Spain


SFS_CODE

Generates samples from populations with complex demographic histories under various models of natural selection. SFS_CODE performs simulations under a general Wright-Fisher model with arbitrary demographic, selective, and mutational effects. It allows the user to simulate realistic genomic regions with several loci evolving according to a variety of mutation models (from simple to context-dependent), and takes into account insertions and deletions. Each locus can be annotated as either coding or non-coding, sex-linked or autosomal, selected or neutral, and have an arbitrary linkage structure.

Official Website

Documentation

Publications

Institution(s)

Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, Institute for Quantitative Biosciences, University of California San Francisco, CA, USA

Top


Population Structure Inference

STRUCTURE

A free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs. fastSTRUCTURE estimates approximate posterior distributions on ancestry proportions 2 orders of magnitude faster than STRUCTURE, with ancestry estimates and prediction accuracies that are comparable to those of ADMIXTURE.

Official Website

Publications

Institution(s)

Department of Genetics, Stanford University, Stanford, CA, USA


PCAngsd

Aims to infer population structure and admixture proportions in low depth next generation sequencing (NGS) data. PCAngsd utilizes genotype likelihoods to iteratively estimate individual allele frequencies. This tool is able to overcome the observed bias of low and variable sequencing depth by using individual allele frequencies as prior information. Moreover, it can push the lower boundaries of sequencing depth required to perform population genetic analyses using NGS data of large-scale genetic studies.

Official Website

Publications

Institution(s)

Department of Biology, Section for Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark


PSIKO

A software tool written in C++ for quick and accurate estimation of individual ancestry coefficients of a dataset exhibiting population structure. PSIKO takes as input file in the .geno format, with each row consisting of a SNP, and each column consisting of an individual. It then estimates the number of founder populations, outputs ancestry estimates as well as the principal components of the dataset for subsequent use in association studies.

Official Website

Publications

Institution(s)

School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich, Norfolk, UK; Centre for Novel Agricultural Products, Department of Biology, University of York, York, UK; Department of Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich, UK

Top


Genotyping by Sequencing data analysis (GBS Analysis)

Genotyping by sequencing (GBS) is a next generation sequencing based method that takes advantage of reduced representation to enable high throughput genotyping of large numbers of individuals at a large number of SNP markers. The relatively straightforward, robust, and cost-effective GBS protocol is currently being applied in numerous species by a large number of researchers.

Base Callling

PyroBayes

Consists of a base calling program for pyrosequencing reads from the 454 Life Sciences sequencing machines. PyroBayes enables single-nucleotide polymorphism (SNP) calling in resequencing applications, including in shallow read coverage.

Official Website

Publications

Institution(s)

Department of Biology, Boston College,Chestnut Hill, MA, USA


Alta-Cyclic

Provides an Illumina genome-analyzer base caller. Alta-Cyclic is an application that uses machine learning to compensate for noise factors. It was developed to improve the number of accurate reads for sequencing runs up to 70 bases. It also permits users to reduce systematic biases, simplifying confident identification of sequence variants. This method works in two stages: the training stage and the base-calling stage.

Official Website

Publications

Institution(s)

Watson School of Biological Sciences, Cold Spring Harbor, NY, USA; Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA


freeIbis

An efficient basecaller for Illumina sequencers with calibrated quality scores. freeIbis offers significant improvements in sequence accuracy owing to the use of a novel multiclass support vector machine (SVM) algorithm. This approach produces more accurate basecalls than the default Illumina basecaller. freeIbis can use the control sequences to calibrate the output of the SVM to produce directly calibrated quality scores. For instance, freeIbis can produce quality scores that correlate highly with observed ones.

Official Website

Publications

Institution(s)

Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany

Top


Read Quality Contorl

Trim Galore!

Assists in automating quality, adapter trimming and quality control (QC). Trim Galore! is a wrapper script that provides functionalities to remove biased methylation positions for reduced representation bisulfite sequencing (RRBS) sequence files (for directional, non-directional (or paired-end) sequencing). It can remove sequences if they become too short during the trimming process.

Official Website

Publications

Institution(s)

TNLIST/Department of Automation, Tsinghua University, Beijing, China


MultiQC

A tool to create a single report visualizing output from multiple tools across many samples, enabling global trends and biases to be quickly identified. MultiQC allows accurate comparison between samples, allowing detection of subtle differences not noticeable when switching between different files. Data visualization aids batch effect detection and minimizes the risk of confounding factors affecting the results of the study.

Official Website

Github

Publications

Institution(s)

Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden; Department of Molecular Medicine and Surgery, Science for Life Laboratory, Center for Molecular Medicine, Karolinska Institute, Stockholm, Sweden; Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, Royal Institute of Technology, Stockholm, Sweden


Octopus-toolkit

Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.

Official Website

Documentation

Publications

Institution(s)

Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Daejeon, Korea; Laboratory of Genetics and Physiology, National Institute of Diabetes, Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA; Department of Microbiology, Dankook University, Cheonan, Korea

Top


Adapter trimming

###Trimmomatic

Performs a variety of trimming tasks for Illumina paired-end and single ended data. Trimmomatic is a flexible, pair-aware preprocessing tool, optimized for Illumina next-generation sequencing (NGS) data. The software includes several processing steps for read trimming and filtering. It uses a pipeline-based architecture, allowing individual ‘steps’ (adapter removal, quality filtering, etc.) to be applied to each read/read pair, in the order specified by user.

Official Website

Documentation

Publications

Institution(s)

Department Metabolic Networks, Max Planck Institute of Molecular Plant Physiology, Golm,Institut für Biologie I, RWTH Aachen, Aachen and Institute of Bio- and Geosciences: Plant Sciences, Forschungszentrum Jülich, Jülich, Germany


Trim Galore!

Assists in automating quality, adapter trimming and quality control (QC). Trim Galore! is a wrapper script that provides functionalities to remove biased methylation positions for reduced representation bisulfite sequencing (RRBS) sequence files (for directional, non-directional (or paired-end) sequencing). It can remove sequences if they become too short during the trimming process.

Official Website

Publications

Institution(s)

TNLIST/Department of Automation, Tsinghua University, Beijing, China


dDocent

Offers a platform for population-level analyses. dDocent is an open-source software dedicated to individually barcoded restriction-site associated DNA sequencing (RADseq) data processing. The application employs data reduction techniques and interact with other programs to propose features such as de novo assembly of RAD loci, single nucleotides polymorphisms (SNPs) and indel calling as well as quality trimming or baseline data filtering.

Official Website

Documentation

Github

Publications

Institution(s)

Marine Genomics Laboratory, Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, TX, USA

Top


Gerline SNP Detection

GATK

Focuses on variant discovery and genotyping. GATK provides a toolkit, developed at the Broad Institute, composed of several tools and ables to support projects of any size. The application compiles an assortment of command line allowing one to analyze of high-throughput sequencing (HTS) data in various formats such as SAM, BAM, CRAM or VCF. The website includes multiple documentation for guiding users.

[Official Website]https://software.broadinstitute.org/gatk/)

Forum

Publications

Institution(s)

Broad Institute, Cambridge, MA, USA; Massachusetts General Hospital, Boston, MA, USA


SAMtools

Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.

Official Website

Publications

Institution(s)

Medical Population Genetics Program, Broad Institute, Cambridge Center, Cambridge, MA, USA


dDocent

Offers a platform for population-level analyses. dDocent is an open-source software dedicated to individually barcoded restriction-site associated DNA sequencing (RADseq) data processing. The application employs data reduction techniques and interact with other programs to propose features such as de novo assembly of RAD loci, single nucleotides polymorphisms (SNPs) and indel calling as well as quality trimming or baseline data filtering.

Official Website

Documentation

Github

Publications

Institution(s)

Marine Genomics Laboratory, Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, TX, USA

Top


Somatic SNV detection

GATK

Focuses on variant discovery and genotyping. GATK provides a toolkit, developed at the Broad Institute, composed of several tools and ables to support projects of any size. The application compiles an assortment of command line allowing one to analyze of high-throughput sequencing (HTS) data in various formats such as SAM, BAM, CRAM or VCF. The website includes multiple documentation for guiding users.

[Official Website]https://software.broadinstitute.org/gatk/)

Forum

Publications

Institution(s)

Broad Institute, Cambridge, MA, USA; Massachusetts General Hospital, Boston, MA, USA


SAMtools

Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.

Official Website

Publications

Institution(s)

Medical Population Genetics Program, Broad Institute, Cambridge Center, Cambridge, MA, USA


SNooPer

A versatile machine learning approach that uses Random Forest classification models to accurately call somatic variants in low-depth sequencing data. SNooPer uses a subset of variant positions from the sequencing output for which the class, true variation or sequencing error, is known to train the data-specific model. During the training phase, using a real dataset of 40 childhood acute lymphoblastic leukemia patients, it was shown how the SNooPer algorithm is not affected by low coverage or low variant allele frequencies, and can be used to reduce overall sequencing costs while maintaining high specificity and sensitivity to somatic variant calling.

Official Website

Publications

Institution(s)

CHU Sainte-Justine Research Center, Université de Montréal, Montreal, Canada; Department of Pediatrics, Faculty of Medicine, Université de Montréal, Montreal, Canada; Division of Hematology-Oncology, CHU Sainte-Justine Research Center, Montreal, Canada


Restriction Site Associated DNA Sequencing Data Analysis (RAD-seq Analysis)

RAD markers were first implemented using microarrays (Miller et al., 2007) and later adapted for NGS (Baird et al., 2008). RAD-seq could generate a genome-wide density of genetic markers. RAD-seq has been used to study population differentiation and selection (Emerson et al., 2010; Hohenlohe et al., 2011).

Top


Phylogenetic Inference

MEGA

An integrated tool for conducting sequence alignment, inferring phylogenetic trees, estimating divergence times, mining online databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.

Official Website

Publications

Institution(s)

Institute for Genomics and Evolutionary Medicine, Temple University Department of Biology, Temple University Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia.


GPSit

Facilitates phylogenomic analyses on microeukaryotes. GPSit is an automated method that is compatible with data from genome sequencing and transcriptome sequencing, including that from single cells. The software can contribute to the automated process and scalability of collection of extended DNA barcodes and specimen identification after genome skimming or single-cell sequencing. It is useful for molecular systematics and molecular ecological investigations.

Official Website

Publications

Institution(s)

Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao, China; Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China; Department of Life Sciences, Natural History Museum, London, UK; College of Marine Life Sciences, Ocean University of China, Qingdao, China


AnGST

Allows analysis of gene and species trees. AnGST is a phylogenomic method comparing individual gene phylogenies with the phylogeny of organisms. This tool uses the topology of the genealogy tree to function, and can infer the direction of gene transfer in addition to gene duplication. It accounts for uncertainty in gene trees by incorporating reconciliation into the tree-building process.

Official Website

Documentation

Github

Publications

Institution(s)

Computational & Systems Biology Initiative, Massachusetts Institute of Technology, Cambridge, MA, USA; Departments of Biological Engineering & Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA; The Broad Institute, Cambridge, MA, USA

Top


SNP/SNV Annotation

ANNOVAR

An efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others). Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day.

Official Website

Publications

Institution(s)

Center for Applied Genomics, Children’s Hospital of Philadelphia, Department of Biostatistics and Epidemiology and Department of Pediatrics, University of Pennsylvania, Philadelphia, PA, USA


wANNOVAR

ANNOVAR is a rapid, efficient tool to annotate functional consequences of genetic variation from high-throughput sequencing data. wANNOVAR provides easy and intuitive web-based access to the most popular functionalities of the ANNOVAR software. It provides simple and intuitive interface to help users determine the functional significance of variants. These include annotating single nucleotide variants and insertions/deletions for their effects on genes, reporting their conservation levels (such as PhyloP and GERP++ scores), calculating their predicted functional importance scores (such as SIFT and PolyPhen scores), retrieving allele frequencies in public databases (such as the 1000 Genomes Project and NHLBI-ESP 5400 exomes), and implementing a ‘variants reduction’ protocol to identify a subset of potentially deleterious variants/genes.

[Official Website]http://wannovar.wglab.org/)

Publications

Institution(s)

Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA


DNAscan

Offers a platform dedicated to DNA next-generation sequencing (NGS) data analysis, annotation and visualization. DNAscan can be set for running on various mode to adapt its performance to focus on a specific subregion or material. The application is able to detect a wide range of genetic material including single nucleotides variants (SNVs), repeat expansions and structural variants (SVs). The application can be run through Docker and Singularity.

Official Website

Publications

Institution(s)

Department of Biostatistics and Health Informatics, King’s College London, London, UK; Maurice Wohl Clinical Neuroscience Institute, King’s College London, London, UK.


Top

Find more tools: OMICTOOLS

Image Citation


Related