Protein Sequence/Expression

Apr 11, 2018 11 min read

anno

Table of Contents

Protein Sequence Analysis

Protein sequence analysis tools are used to predict specific functions, activities, origin, or localization of proteins based on their amino-acid sequence. Software tools are also used to analysis high-throughput proteomics data sequences obtained by mass-spectrometry.

Amino Acid Repeat Prediction

RADAR

A tool to identify short composition biased and gapped approximate repeats, as well as complex repeat architectures involving many different types of repeats in a query sequence. RADAR is useful because many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units.

Official Website

Documentation

Publications:

(Li et al., 2015) The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res.
(Heger and Holm, 2000) Rapid automatic detection and alignment of repeats in protein sequences. Proteins.

Institutions(s):

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK

REPRO

The program is able to recognize distant repeats in a single query sequence. The technique relies on a variation of the Smith-Waterman local alignment strategy to find non-overlapping top-scoring local alignments, followed by a graph-based iterative clustering procedure to delineate the repeat set(s) based on consistency of the pairwise top-alignments.

Official Website

Documentation

Publications:

(George and Heringa, 2000) The REPRO server: finding protein internal sequence repeats through the Web. Trends Biochem Sci.
(Heringa and Argos, 1993) A method to recognize distant repeats in protein sequences. Proteins.

Institutions(s):

European Molecular Biology Laboratory, Heidelberg, Germany

Subcellular Localizaton Prediction

TargetP

Allows users to predict eukaryotic proteins location. TargetP is a web application that scores N-terminal pre-sequences in a submitted protein. The software indicates chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP) and secretory pathway signal peptide (SP) predicted localization. The application includes parameters which allow choosing between in Plants and Non-Plants version, personalized cutoffs and the possibility to determine cleavage sites.

Official Website

Documentation

Publications:

(Emanuelsson et al., 2007) Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc.

Institutions(s):

Stockholm Bioinformatics Center, Albanova, Stockholm University, Stockholm, Sweden; Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark

pRoloc

Provides machine learning and visualization methods for interrogating and analyzing on quantitative mass spectrometry (MS) data to infer protein sub-cellular localization. PRoloc is suited for spatial proteomics data analysis provided as an R package that performs sub-cellular localization prediction from experimental and condition-specific MS-based quantitative proteomics data. The software allows classification of proteins to tens of sub-cellular compartments.

Official Website

Documentation

Publications:

(Breckels et al., 2016) Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics. PLoS Comput Biol.
(Crook et al., 2018) A Bayesian Mixture Modelling Approach For Spatial Proteomics. BioRxiv.

Institutions(s):

Computational Proteomics Unit, Department of Biochemistry, University of Cambridge, Cambridge, UK

Protein Clustering

CD-HIT

A widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques has been developed to allow efficient clustering of such datasets.

Official Website

Publications:

(Fu et al., 2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics.
(Li and Godzik, 2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics.

Institutions(s):

Center for Research in Biological Systems, University of California San Diego, La Jolla, CA, USA

USEARCH

Searches and clusters algorithms that can be orders of magnitude. USEARCH is a sequence analysis software which combines different algorithms into a single package. This software searches in database for top global hits and provides several NGS read processing features such as dereplication, paired read overlapping, quality filtering, FASTQ file statistics or chimeric sequence filtering.

Official Website

Publications:

(Edgar, 2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics.
(Edgar and Flyvbjerg, 2015) Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics.

Institutions(s):

Department of Micro- and Nanotechnology, Technical University of Denmark, Lyngby, Denmark

MS-based (Mass Spectrometry) untargeted Proteomics

A number of technologies can be used to study proteomes, but arguably none is more powerful than mass spectrometry. There are two fundamentally different MS-based strategies for analyzing proteomes: discovery-based identification and targeted quantification (Doerr, 2013; Picotti and Aebersold, 2012). With a discovery-based strategy, the goal is usually to identify as many proteins as possible. The goal of a targeted proteomics experiment is to monitor a select few proteins of interest with high sensitivity, reproducibility and quantitative accuracy. Wikipedia

2D gel Image Analysis

digeR

Provides a graphical user interface (GUI) for the visualization of protein post-translational modification (PTM) changes between different biological states. digeR is an R package for analyzing potential protein PTM changes in 2D-DIGE (2D gel) study using spots correlation. It can also be used to support other ‘omic’ data analysis in a similar manner and to assist biologist to look for panels of biomarkers which would improve the diagnosis and prognosis of the disease.

Official Website

Publications:

(Fan et al., 2009) digeR: a graphical user interface R package for analyzing 2D-DIGE data. Bioinformatics.

Institutions(s):

UCD School of Medicine and Medical Science, University College Dublin, Dublin, Ireland; UCD Conway Institute of Biomolecular and Biomolecular Research, University College Dublin, Dublin, Ireland

Plasmo2D

Performs gel alignment and protein identification computationally. Plasmo2D assists in the identification of proteins in the P. falciparum 3D7 proteome from the 2-DE gel image. It allows users to overlap the markers of the gel image on the marker provided by the software using the image resizes buttons in the software. This too can display a list of potential candidate proteins corresponding to a spot of interest.

Official Website

Publications:

(Khachane et al., 2005) “Plasmo2D”: an ancillary proteomic tool to aid identification of proteins from Plasmodium falciparum. J Proteome Res.

Institutions(s):

Department of Biochemistry, Indian Institute of Science, Bangalore, India

In Silico Analysis

OpenMS

Allows to manage and analyse Liquid chromatography coupled to mass spectrometry (LC-MS) data. OpenMS is a programming library and tool collection integrated into full-featured workflow systems, such as KNIME, Galaxy and WS-PGRADE, to facilitate bioinformatics research in the field of MS on all levels. The software provides pre-built and ready-to-use tools for analysis of both proteomics and non-targeted metabolomics data.

Official Website

Publications:

(Lange et al., 2007) A geometric approach for the alignment of liquid chromatography-mass spectrometry data. Bioinformatics.
(Pfeuffer et al., 2017) OpenMS - A platform for reproducible analysis of mass spectrometry data. J Biotechnol.

Institutions(s):

Applied Bioinformatics, Department for Computer Science, University of Tuebingen, Tuebingen, Germany; Center for Bioinformatics, University of Tuebingen, Tuebingen, Germany

multiplierz

Manages proteomic mass spectrometry workflows and data analysis. Multiplierz provides a toolset of multiple methods for peptide identification, quantitation, reporting, as well as tools for easily manipulating standard data formats. This software is a Python library compatible with new reporting formats and high-level tools to achieve post-perform proteomic analyses. The architecture of the software environment has seamless integration with native data files via mzAPI.

Official Website

Documentation

Publications:

(Parikh et al., 2009) multiplierz: an extensible API based desktop environment for proteomics data analysis. BMC Bioinformatics.
(Alexander et al., 2017) multiplierz v2.0: A Python-based ecosystem for shared access and analysis of native mass spectrometry data. Proteomics.

Institutions(s):

Department of Cancer Biology and Blais Proteomics Center, Dana- Farber Cancer Institute, Boston, MA, USA

Phosphorylation Site Assignment

Ascore

Processes localization-specific probability for every phosphorylation site within a data set. Ascore allows users to detect peptides, shows the degree of certainty and the non-localized peptides. This software can determine the proper phosphorylation site alignment by calculating the differences in site placement at the level of the site-determining ions. It avoids the potential inconsistencies that may happen from manual validation.

Official Website

Publications:

(Beausoleil et al., 2006) A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol.

Institutions(s):

Department of Cell Biology, Harvard Medical School, Boston, MA, USA; Department of Genetics and Norris Cotton Cancer Center, Lebanon, New Hampshire, NH, USA

Phosphonormalizer

Provides pairwise normalization of phosphoproteomics data. Phosphonormalizer allows analysis of large-scale experiments in various treatment conditions. It incorporates non-enriched data as a reference for normalizing the enriched data. This tool can be applied to studies of other post-translational modifications (PTMs) than phosphorylation that are commonly studied using enrichment and that can also be detected in non-enriched samples with sensitive methods.

Official Website

Publications:

(Saraei et al., 2017) Phosphonormalizer: an R package for normalization of MS-based label-free phosphoproteomics. Bioinformatics.

Institutions(s):

Turku Centre for Biotechnology, University of Turku and Åbo Akademi, Turku, Finland; Department of Future Technologies, University of Turku, Turku, Finland

PTM Identification

PIQED

Provides a complete, automated workflow for post translational modification (PTM) identification, quantification, and statistical testing from exclusively data-independent acquisition-mass spectrometry (DIA-MS) data. PIQED is a workflow and open-source software that enables a two-fold reduction of acquisition time because both identification and quantification are achieved with a single DIA analysis. PTM-specific capabilities of this package include site localization scoring and filtering, peptide consolidation to modification site-level, and optional local or global total-ion chromatogram (TIC) normalization.

Official Website

Documentation

Publications:

(Meyer et al., 2017) DIA-Pipe: Identification and Quantification of Post-Translational Modifications using exclusively Data-Independent Acquisition. BioRxiv.
(Meyer et al., 2017) PIQED: automated identification and quantification of protein modifications from DIA-MS data. Nat Methods.

Institutions(s):

Buck Institute for Research on Aging, Novato, CA, USA; Department of Pathology, Boston Children’s Hospital and Harvard Medical School, Boston, MA, USA

SUMmOn

An automated pattern recognition tool that detects diagnostic PTM fragment ion series within complex MS/MS spectra, to identify modified peptides and modification sites within these peptides. The results of the SUMmOn analysis are stored in an XML instance document. A dynamically generated XSLT style sheet is then used to create an HTML file that is formatted via CSS (cascading style sheet), and viewed in a web-browser.

Official Website

Publications:

(Pedrioli et al., 2006) Automated identification of SUMOylation sites using mass spectrometry and SUMmOn pattern recognition software. Nat Methods.

Institutions(s):

Institute for Systems Biology, Seattle, WA, USA

Glycosylation Identification

SugarQb

Enables genome-wide insights into protein glycosylation and glycan modifications in complex biological systems. SugarQb identifies intact glycopeptides, and maps and quantifies changes in protein glycosylation at a proteome scale. This method can assist user in investigation of the glycoproteome of mouse embryonic stem cells.

Official Website

Documentation

Publications:

(Stadlmann et al., 2017) Comparative glycoproteomics of stem cells identifies new players in ricin toxicity. Nature.

Institutions(s):

IMBA, Institute of Molecular Biotechnology of the Austrian Academy of Sciences, Vienna, Austria; Institute of Molecular Pathology (IMP), Vienna, Austria

MAGIC

Identifies intact N-glycosylated peptides from a public protein database without requiring any prior information of proteins or glycans. MAGIC aims to support untargeted glycopeptide analysis while the newly implemented MAGIC+ is designed to perform targeted glycopeptide analysis that allows users to upload their own protein sequence file to find glycopeptides in their data. The search results from Mascot can be integrated with the results from MAGIC-web via Reports Integrator to generate a complete protein/peptide-glycan summary report. Independent of the above three modules, the fourth module, Glycan Search, allows users to find various glycans from a large glycan database stored in the web server, regardless the types of glycosylation. MAGIC-web has a user-friendly visualization interface for easy data uploading and processing, and result interpretation. MAGIC-web is free and open to all users and there is no log in requirement.

Official Website

Publications:

(Lih et al., 2016) MAGIC-web: a platform for untargeted and targeted N-linked glycoprotein identification. Nucleic Acids Res.
(Lynn et al., 2015) MAGIC: an automated N-linked glycoprotein identification tool using a Y1-ion pattern matching algorithm and in silico MS² approach. Anal Chem.

Institutions(s):

Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan; Institute of Information Science, Academia Sinica, Taipei, Taiwan

MS-based Metaproteomics

Unipept

An open source web application that is designed for metaproteomics analysis with a focus on interactive data-visualization. Unipept is underpinned by a fast index built from UniProtKB and the NCBI taxonomy that enables quick retrieval of all UniProt entries in which a given tryptic peptide occurs. Unipept provides programmatic access to the metaproteomics analysis features. This enables integration of Unipept functionality in custom applications and data processing pipelines.

Official Website

Publications:

(Mesuere et al., 2015) The Unipept metaproteomics analysis pipeline. Proteomics.
(Mesuere et al., 2016) Unipept web services for metaproteomics analysis. Bioinformatics.

Institutions(s):

Department of Applied Mathematics, Computer Science and Statistics, Faculty of Sciences, Ghent University, Ghent, Belgium

MPA

An intuitive open-source tool for metaproteomics data analysis and interpretation, which includes multiple search engines and the feature to decrease data redundancy by grouping protein hits to so-called meta-proteins. The functionality of the MetaProteomeAnalyzer is demonstrated using a sample of a microbial community taken from a biogas plant.

Official Website

Documentation

Publications:

(Muth et al., 2015) The MetaProteomeAnalyzer: a powerful open-source software suite for metaproteomics data analysis and interpretation. J Proteome Res.
(Muth et al., 2017) MPA Portable: a Stand-alone Software Package for Analyzing Metaproteome Samples on the Go. Anal Chem.

Institutions(s):

Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany

Antibody Array Analysis

PANDA

A web-based software program developed at Emory University for analyzing phosphorylation antibody arrays.

Official Website

Publications:

(Zhong et al., 2008) LKB1 is necessary for Akt-mediated phosphorylation of proapoptotic proteins. Cancer Res.

Institutions(s):

The Winship Cancer Institute, Emory University School of Medicine, Atlanta, Georgia, USA

Find more tools: OMICTOOLS

Image Citation

proteomics