Difference between revisions of "Research"

From RiversWiki
Jump to: navigation, search
(Alignment-free genome sequence analysis)
(Alignment-free genome sequence analysis)
Line 48: Line 48:
  
 
Synonymous codons do not occur at equal frequencies. Codon usage and codon bias have been extensively studied. However, the sequential order in which synonymous codons appear within a gene has not been studied until now. Here we describe an ''in silico'' method, which is the first attempt to tackle this problem: to what extent this sequential order is unique, and to what extent the succession of synonymous codons is important. This method, which we called Intragenic, Stochastic Synonymous Codon Occurrence Replacement (ISSCOR), generates, by a Monte Carlo approach, a set of genes which code for the same amino acid sequence, and display the same codon usage, but have random permutations of the synonymous codons, and therefore different sequential codon orders from the original gene. We analyze the complete genome of the bacterium ''Helicobacter pylori'' (containing 1574 protein coding genes), and show by various, alignment-free computational methods (e.g., frequency distribution of codon-pairs, as well as that of nucleotide bigrams in codon-pairs), that: (i) not only the succession of adjacent synonymous codons is far from random, but also, which is totally unexpected, the occurrences of non-adjacent synonymous codon-pairs are highly constrained, at strikingly long distances of dozens of nucleotides; (ii) the statistical deviations from the random synonymous codon order are overwhelming; and (iii) the pattern of nucleotide bigrams in codon-pairs can be used in a novel way for characterizing and comparing genes and genomes. Our results demonstrate that the sequential order of synonymous codons within a gene must be under a strong selective pressure, which is superimposed on the classical codon usage. This new dimension can be measured by the ISSCOR method, which is simple, robust, and should be useful for comparative and functional genomics.  
 
Synonymous codons do not occur at equal frequencies. Codon usage and codon bias have been extensively studied. However, the sequential order in which synonymous codons appear within a gene has not been studied until now. Here we describe an ''in silico'' method, which is the first attempt to tackle this problem: to what extent this sequential order is unique, and to what extent the succession of synonymous codons is important. This method, which we called Intragenic, Stochastic Synonymous Codon Occurrence Replacement (ISSCOR), generates, by a Monte Carlo approach, a set of genes which code for the same amino acid sequence, and display the same codon usage, but have random permutations of the synonymous codons, and therefore different sequential codon orders from the original gene. We analyze the complete genome of the bacterium ''Helicobacter pylori'' (containing 1574 protein coding genes), and show by various, alignment-free computational methods (e.g., frequency distribution of codon-pairs, as well as that of nucleotide bigrams in codon-pairs), that: (i) not only the succession of adjacent synonymous codons is far from random, but also, which is totally unexpected, the occurrences of non-adjacent synonymous codon-pairs are highly constrained, at strikingly long distances of dozens of nucleotides; (ii) the statistical deviations from the random synonymous codon order are overwhelming; and (iii) the pattern of nucleotide bigrams in codon-pairs can be used in a novel way for characterizing and comparing genes and genomes. Our results demonstrate that the sequential order of synonymous codons within a gene must be under a strong selective pressure, which is superimposed on the classical codon usage. This new dimension can be measured by the ISSCOR method, which is simple, robust, and should be useful for comparative and functional genomics.  
 +
<!--[[Genetic:isscor | '''More..''']]-->
 +
</blockquote>
 +
 +
==== Correlating sequential and antigenic information ====
 +
 +
[[Image:Antigenic distances 1.jpg |frameless|300px|left]]
 +
Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of three different A-subtypes revealed some rather striking temporal (for A/H3N3), and spatial relationships (for A/H5N1) between groups of individual gene subsets. The application to the A/H1N1 set revealed also relationships between the seasonal H1, and the swine-like novel 2009 H1v variants in a quick and unambiguous manner. Based on these examples we consider application of the ISSCOR method for analysis of large sets of homologous genes as a worthwhile addition to a toolbox of genomics – it allows a rapid diagnostics of trends, and possibly can even aid an early warning of newly emerging epidemiological threats.
 +
Antibodies against hemagglutinin provide protective immunity to influenza virus infection, and the HA is therefore the primary component of influenza vaccines, and as the antigenic structure of HA changes significantly over time, vaccine has to be updated to ensure adequate efficacy against emerging viral variants. The WHO network of influenza centers routinely characterizes the antigenic properties of influenza viruses using inhibition assays, which combined with sequential data of variability in the antigenic HA-1 domain of the HA, are necessary to select strains for use in the seasonal vaccines. Smith ''et al''. used antigenic data from 35 years of influenza surveillance between 1968 and 2003, with the resulting antigenic dataset consisting of a table of 79 post-infection ferret antisera by 273 viral isolates, and 4215 individual HA inhibition (HI) measurements, and then constructed an antigenic 2D map, to determine the antigenic evolution of influenza A/H3N2 virus, using projection from the obtained high-dimensional antigenic data hyperspace.
 +
[[Image:Antigenic distances 2.jpg |frameless|300px|right]]
 +
The accuracy of the predictions has shown that their map might serve as a possible target of an attempt to describe antigenic relationships on a basis of the ISSCOR descriptors. Partial least squares regression (PLS-regression) is a technique used to find relationships between two data sets (X and Y), utilizing a latent variable (LV) approach to modeling the covariance – possibly present in these two spaces. Rather than finding hyperplanes of maximum variance between the response and independent variables, like is the case in the PCA-regression, it finds a linear model by projecting the predicted variables together with the observables to a newly constructed space. In this way trying to uncover the multidimensional direction in the X space, that explains the maximum multidimensional variance direction in the Y space. Therefore, the PLS-regression of the antigenic cluster centers’ ISSCOR descriptors, on the 2D map of the Smith’s antigenic clusters, was performed, and the results are shown on the Figure above. The model utilizing six LVs was found to be optimal (RMS = 0.12), considering that the regression model obtained with only five LVs was not sufficient to achieve prediction errors small enough. Table II lists fifty of the most contributing ISSCOR variables used by each of the six LVs. At the bottom of each column there are values of explained variance for the ISSCOR descriptors [X-matrix], and the antigenic map’s cluster centers [Y-matrix]. It is noteworthy that only two major LV would already suffice to explain 95% of variance in Y space but at the same time all six LV are necessary to explain the respective variance in X – the 5th and the 6th LV are both contributing almost equally strong. The same PLS model was then used to project positions of H3N2 hemagglutinin strains isolated during each of the respective years (Fig. on the right). Of quite an interest is rather wide spread of the year’s 2005 sequences, some of which are apparently reversing the general trend observed earlier, and continued subsequently by the majority of strains isolated in the other 2003-2006 yearly seasonal clusters.
 +
 
<!--[[Genetic:isscor | '''More..''']]-->
 
<!--[[Genetic:isscor | '''More..''']]-->
 
</blockquote>
 
</blockquote>

Revision as of 14:15, 28 March 2013

Rivers
Research Software Publications People

<br\>


EPIDEMIC SIMULATIONS

A virtual Polish society

Gosp bw cap.png



Virtual society exists only in the computer memory. However, in our model, it represents and reproduces the real, Polish, society. Virtual society consists of individual (distinguishable) agents, each assigned to certain, geo-referenced, household. Further, household inhabitants, depending on their age should be either retired, employed or going to school (kindergarten, primary school, secondary school, college, university). All these basic relations should be reflected in the virtual model. This virtual society was based on accessible data. Depending on the data available and on the particular infrastructure element that was to be incorporated into the virtual society, we developed a set of methods. More...


Agent Based Model of the infection spread within a small population of Guinea pigs, dependent on temperature and humidity conditions of the surroundings

Flu spread model zuk rakowski radomski Fig2b.gif


The influence that atmospheric conditions might have on the efficiency of the spread of influenza virus is important for epidemiological and evolutionary research. However, it has not been satisfactorily recognized and quantified so far. Here we provide a statistical model of influenza transmission between individuals. It has been derived from the results of recent experiments, which involved infecting guinea pigs with influenza at various temperatures and relative air humidity levels. The wide range of transmission rates in those experiments reflects the ensemble-independent phenomena. The correlation between most of our simulations and the experimental results is satisfactory. For several different conditions, we obtained transmissibility values which seem to be sufficiently accurate to provide partial input for an intended large-scale epidemiological study in the near future. More..

GENETIC STUDIES

Alignment-free genome sequence analysis

Occurence probability.gif

Synonymous codons do not occur at equal frequencies. Codon usage and codon bias have been extensively studied. However, the sequential order in which synonymous codons appear within a gene has not been studied until now. Here we describe an in silico method, which is the first attempt to tackle this problem: to what extent this sequential order is unique, and to what extent the succession of synonymous codons is important. This method, which we called Intragenic, Stochastic Synonymous Codon Occurrence Replacement (ISSCOR), generates, by a Monte Carlo approach, a set of genes which code for the same amino acid sequence, and display the same codon usage, but have random permutations of the synonymous codons, and therefore different sequential codon orders from the original gene. We analyze the complete genome of the bacterium Helicobacter pylori (containing 1574 protein coding genes), and show by various, alignment-free computational methods (e.g., frequency distribution of codon-pairs, as well as that of nucleotide bigrams in codon-pairs), that: (i) not only the succession of adjacent synonymous codons is far from random, but also, which is totally unexpected, the occurrences of non-adjacent synonymous codon-pairs are highly constrained, at strikingly long distances of dozens of nucleotides; (ii) the statistical deviations from the random synonymous codon order are overwhelming; and (iii) the pattern of nucleotide bigrams in codon-pairs can be used in a novel way for characterizing and comparing genes and genomes. Our results demonstrate that the sequential order of synonymous codons within a gene must be under a strong selective pressure, which is superimposed on the classical codon usage. This new dimension can be measured by the ISSCOR method, which is simple, robust, and should be useful for comparative and functional genomics.

Correlating sequential and antigenic information

Antigenic distances 1.jpg

Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of three different A-subtypes revealed some rather striking temporal (for A/H3N3), and spatial relationships (for A/H5N1) between groups of individual gene subsets. The application to the A/H1N1 set revealed also relationships between the seasonal H1, and the swine-like novel 2009 H1v variants in a quick and unambiguous manner. Based on these examples we consider application of the ISSCOR method for analysis of large sets of homologous genes as a worthwhile addition to a toolbox of genomics – it allows a rapid diagnostics of trends, and possibly can even aid an early warning of newly emerging epidemiological threats. Antibodies against hemagglutinin provide protective immunity to influenza virus infection, and the HA is therefore the primary component of influenza vaccines, and as the antigenic structure of HA changes significantly over time, vaccine has to be updated to ensure adequate efficacy against emerging viral variants. The WHO network of influenza centers routinely characterizes the antigenic properties of influenza viruses using inhibition assays, which combined with sequential data of variability in the antigenic HA-1 domain of the HA, are necessary to select strains for use in the seasonal vaccines. Smith et al. used antigenic data from 35 years of influenza surveillance between 1968 and 2003, with the resulting antigenic dataset consisting of a table of 79 post-infection ferret antisera by 273 viral isolates, and 4215 individual HA inhibition (HI) measurements, and then constructed an antigenic 2D map, to determine the antigenic evolution of influenza A/H3N2 virus, using projection from the obtained high-dimensional antigenic data hyperspace.

Antigenic distances 2.jpg

The accuracy of the predictions has shown that their map might serve as a possible target of an attempt to describe antigenic relationships on a basis of the ISSCOR descriptors. Partial least squares regression (PLS-regression) is a technique used to find relationships between two data sets (X and Y), utilizing a latent variable (LV) approach to modeling the covariance – possibly present in these two spaces. Rather than finding hyperplanes of maximum variance between the response and independent variables, like is the case in the PCA-regression, it finds a linear model by projecting the predicted variables together with the observables to a newly constructed space. In this way trying to uncover the multidimensional direction in the X space, that explains the maximum multidimensional variance direction in the Y space. Therefore, the PLS-regression of the antigenic cluster centers’ ISSCOR descriptors, on the 2D map of the Smith’s antigenic clusters, was performed, and the results are shown on the Figure above. The model utilizing six LVs was found to be optimal (RMS = 0.12), considering that the regression model obtained with only five LVs was not sufficient to achieve prediction errors small enough. Table II lists fifty of the most contributing ISSCOR variables used by each of the six LVs. At the bottom of each column there are values of explained variance for the ISSCOR descriptors [X-matrix], and the antigenic map’s cluster centers [Y-matrix]. It is noteworthy that only two major LV would already suffice to explain 95% of variance in Y space but at the same time all six LV are necessary to explain the respective variance in X – the 5th and the 6th LV are both contributing almost equally strong. The same PLS model was then used to project positions of H3N2 hemagglutinin strains isolated during each of the respective years (Fig. on the right). Of quite an interest is rather wide spread of the year’s 2005 sequences, some of which are apparently reversing the general trend observed earlier, and continued subsequently by the majority of strains isolated in the other 2003-2006 yearly seasonal clusters.

</blockquote>