Alignment-free genome sequence analysis
Synonymous codons do not occur at equal frequencies. Codon usage and codon bias have been extensively studied. However, the sequential order in which synonymous codons appear within a gene has not been studied until now. Here we describe an in silico method, which is the first attempt to tackle this problem: to what extent this sequential order is unique, and to what extent the succession of synonymous codons is important. This method, which we called Intragenic, Stochastic Synonymous Codon Occurrence Replacement (ISSCOR), generates, by a Monte Carlo approach, a set of genes which code for the same amino acid sequence, and display the same codon usage, but have random permutations of the synonymous codons, and therefore different sequential codon orders from the original gene. We analyze the complete genome of the bacterium Helicobacter pylori (containing 1574 protein coding genes), and show by various, alignment-free computational methods (e.g., frequency distribution of codon-pairs, as well as that of nucleotide bigrams in codon-pairs), that: (i) not only the succession of adjacent synonymous codons is far from random, but also, which is totally unexpected, the occurrences of non-adjacent synonymous codon-pairs are highly constrained, at strikingly long distances of dozens of nucleotides; (ii) the statistical deviations from the random synonymous codon order are overwhelming; and (iii) the pattern of nucleotide bigrams in codon-pairs can be used in a novel way for characterizing and comparing genes and genomes. Our results demonstrate that the sequential order of synonymous codons within a gene must be under a strong selective pressure, which is superimposed on the classical codon usage. This new dimension can be measured by the ISSCOR method, which is simple, robust, and should be useful for comparative and functional genomics.
Correlating sequential and antigenic information
Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of three different A-subtypes revealed some rather striking temporal (for A/H3N3), and spatial relationships (for A/H5N1) between groups of individual gene subsets. The application to the A/H1N1 set revealed also relationships between the seasonal H1, and the swine-like novel 2009 H1v variants in a quick and unambiguous manner. Based on these examples we consider application of the ISSCOR method for analysis of large sets of homologous genes as a worthwhile addition to a toolbox of genomics – it allows a rapid diagnostics of trends, and possibly can even aid an early warning of newly emerging epidemiological threats. Antibodies against hemagglutinin provide protective immunity to influenza virus infection, and the HA is therefore the primary component of influenza vaccines, and as the antigenic structure of HA changes significantly over time, vaccine has to be updated to ensure adequate efficacy against emerging viral variants. The WHO network of influenza centers routinely characterizes the antigenic properties of influenza viruses using inhibition assays, which combined with sequential data of variability in the antigenic HA-1 domain of the HA, are necessary to select strains for use in the seasonal vaccines. Smith et al. used antigenic data from 35 years of influenza surveillance between 1968 and 2003, with the resulting antigenic dataset consisting of a table of 79 post-infection ferret antisera by 273 viral isolates, and 4215 individual HA inhibition (HI) measurements, and then constructed an antigenic 2D map, to determine the antigenic evolution of influenza A/H3N2 virus, using projection from the obtained high-dimensional antigenic data hyperspace.
The accuracy of the predictions has shown that their map might serve as a possible target of an attempt to describe antigenic relationships on a basis of the ISSCOR descriptors. Partial least squares regression (PLS-regression) is a technique used to find relationships between two data sets (X and Y), utilizing a latent variable (LV) approach to modeling the covariance – possibly present in these two spaces. Rather than finding hyperplanes of maximum variance between the response and independent variables, like is the case in the PCA-regression, it finds a linear model by projecting the predicted variables together with the observables to a newly constructed space. In this way trying to uncover the multidimensional direction in the X space, that explains the maximum multidimensional variance direction in the Y space. Therefore, the PLS-regression of the antigenic cluster centers’ ISSCOR descriptors, on the 2D map of the Smith’s antigenic clusters, was performed, and the results are shown on the Figure above. The model utilizing six LVs was found to be optimal (RMS = 0.12), considering that the regression model obtained with only five LVs was not sufficient to achieve prediction errors small enough. Table II lists fifty of the most contributing ISSCOR variables used by each of the six LVs. At the bottom of each column there are values of explained variance for the ISSCOR descriptors [X-matrix], and the antigenic map’s cluster centers [Y-matrix]. It is noteworthy that only two major LV would already suffice to explain 95% of variance in Y space but at the same time all six LV are necessary to explain the respective variance in X – the 5th and the 6th LV are both contributing almost equally strong. The same PLS model was then used to project positions of H3N2 hemagglutinin strains isolated during each of the respective years (Fig. on the right). Of quite an interest is rather wide spread of the year’s 2005 sequences, some of which are apparently reversing the general trend observed earlier, and continued subsequently by the majority of strains isolated in the other 2003-2006 yearly seasonal clusters.
Time and spatial evolution of the Influenza-A hemagglutinin genes
Two sets of the H3N2 influenza virus hemagglutinin (HA) sequences were collected – the set A comprising of 2217 full length HA gene sequences (1701 nucleotides each); and the set B of 1810 sequences of the HA-1 antigenic region of the H3N2 hemagglutinin gene (987 nucleotides each). The compositions of the sets A and B were almost entirely different as only two virus isolates were the same in both. The ISSCOR descriptors were calculated for these two sets of sequences, using. codon spacer values from 0 to 16, and creating two matrices (MA and MB) of 2217 and 1810 rows respectively, and 2448 columns each. It was observed that the G+C content in H3N2 hemagglutinin sequences of the set A decreased linearly over the period of 40 years, by approx. 5.93% – which corresponds to the changes in the relative ratios of all four nucleotides, although by different amount for each. The linear regression parameters of the equation Nucleotide_count = Pn * year + bn were PA = 0.9608 and bA = -1.3513 * 103; PC = -0.2751 and bC = 0.8903 * 103; PG = -0.8387 and bG = 2.0607 * 103; PT = 0.1463 and bT = 0.1148 * 103 – respectively for the A, C, G, and T nucleotides. This observation is consistent with the findings of Rabadan et al., who examined ratios of A, C, G and T in the PB1, PB2 and PA segments of genomes H1N1 and H5N1. For periods of up to 90 years, they found systematic and significant changes in GàA and CàT ratios in the both H1 and H5 cases, corresponding to hosts from which viral isolates were obtained – this was also usually associated with striking differences between sequence isolates of avian and of human origin.
Results of principal component analysis (PCA) for the matrix MA, are depicted on Figure at the left (the bottom panel). There is a clear timeline trend of the sequences from the years 1968-1970, which are located at the upper left corner, going down towards the minimum at about the years 2001-2002, and then up again to reach the years 2007-2008 at the end of the right arm of the curve. Therefore, to examine this trend more synthetically, the yearly clusters were considered separately for each year between 1968 and 2008. For each yearly cluster their centers, as well as the corresponding standard deviations, were calculated, and are shown. The horizontal and vertical bars are proportional in sizes to the respective standard deviations of PC-1 and PC-2 values. Noteworthy, there are three regions of rather increased variability: first – around the year 1968, then the years 2000-2003, and finally the year 2008, with some less diverse periods in between as well. It was found earlier, based on extensive analyses of influenza HA sequences of the H1 and H3 subtypes, that the evolution of H3N2 hemagglutinin included long intervals of mostly neutral sequence evolution without noticeable antigenic change showing an excess of synonymous over nonsynonymous substitutions, punctuated by shorter intervals of rapid evolution during which newly dominant lineages quickly displaced previously coexisting ones. The stasis intervals showed rather uniform distribution of replacements over the whole HA sequence’s length, not favoring epitope regions. The analogous PCA results for the matrix MB, showing 1810 data points, are depicted on the top panel, marking also members of the antigenic clusters, described earlier as forming WHO vaccine clusters. Of special interest is the location of the A/Fujian/411/2002 sequence on this plot (the red diamond at about PC1=5 and PC2=0) as this is the oldest of the six strains constituting the Fu02 cluster, and yet all remaining sequences of this cluster (each isolated in 2003) have their PC2 values below it. Holmes et al. concluded that one clade of H3N2 viruses present at least since 2000 had provided the hemagglutinin gene for all the H3N2 viruses sampled after the 2002–2003 influenza season, and that a reassortment event was the likely progenitor of the antigenicaly variant influenza strains that caused the A/Fujian/411/2002 epidemic of the 2003–2004 season. It is possible that a significant factor of such adaptation involved optimizing the functional compatibility of reassorting segments. Such a phenomenon might be a possible explanation why the lineage leading to the FU02 antigenic type did not dominate the viral population until a few years after its initial appearance, coincidental with an HA reassortment event.
Another set C, comprising of 613 orthologous genes of the H5N1 avian flu virus’ hemagglutinin (HA), of 1707 nucleotides each – isolated from avian hosts during the 1996-2009 period – was acquired from NCBI in January 2010. All sequences were unique – in cases when two (or several) sequences were identical, only the earliest one was included. The ISSCOR descriptors were calculated for this set of sequences, using. codon spacer values from 0 to 16. The Figure on the right shows the PCA scatter plot of PC1 vs. PC2 for this set, divided for twelve territorial regions, characterized by the most numerous populations of isolates. Their plots are presented as separate panels, there are also separate panels showing remaining strains obtained from Africa, Asia and Europe. Already the analogous results for the A/H1N1 have shown some territorial clustering in several regions. In case of H5N1 the groupings are remarkably well defined and in many instances also rather well separated between themselves. This is especially evident for strains isolated in several Far Eastern, and South-East Asian countries: China, Vietnam, Hong Kong, Indonesia, and Thailand. On the other hand, isolates from Europe co-cluster together independently of their country of origin, which is also true for African and Middle-East countries – all of which form the large gathering at around PC1 = -3, and PC2 = 0 position. In roughly the same location there are also points for several Asian countries (all of which are different from the already mentioned five): Afghanistan, Pakistan, Bangladesh, India, Mongolia, Kazakhstan, and Japan. There are also two exceptions from Asia (well separated blue triangles in the panel "Asia") – one sequence from Japan (A/duck/Yokohama/aq10/2003, AB212280), and one from Myanmar (A/chicken/Pyigyitagon/204/2006, AB474081), both of them collocated together within the bottom cluster from China. The former H5N1 strain was isolated from duck meat processed for human consumption, imported to Japan from Shandong Province, China in 2003. That virus was antigenically different from other H5 viruses, including the Hong Kong H5N1 viruses isolated from humans in 1997 and 2003. The latter strain was isolated during the 1-st outbreak of bird flu in Myanmar in March 2006, and it was subsequently found to belong to the clade-7 (the WHO H5N1 evolution nomenclature) of the highly pathogenic avian influenza (HPAI). There is also one, rather unexpected case of the isolate from a Belgian crested eagle (A/crested_eagle/Belgium/01/2004, DQ182483; the green square, shown here for clarity at the edge of the top cluster in the panel "Vietnam"). On a closer examination this highly pathogenic H5N1 strain turned out to have been isolated from smuggled eagles, confiscated at the national airport, after an attempt to illegally enter them into Belgium from Thailand by a traveler. Phylogeography of H5N1 viruses was actively researched recently in the regions of Southeast Asia and Far East which were repeatedly subjected to outbreaks of bird flu in poultry, esp. in Vietnam, Thailand, and China, with conclusions rather similar ours, concerning the yearly origins and seasonal spread of the A/H3N2 viruses. Wallace et al. studied the geographic diffusion of H5N1 following the migration paths of the virus, by a way of a genetic phylogeography of H5N1’s HA and NA sequences, and shown that the Chinese province of Guangdong was the source of multiple H5N1 strains spreading at both regional and international scales. In contrast, Indochina appeared to be a regional sink, at the same time demonstrating bidirectional dispersal among localities within the region. They have shown further that the virus migration was filtered out at some international borders, like between China and Vietnam or Thailand, but not at e.g. the one between China and Japan. Their results were recently reanalyzed by Hovmoeller et al. who alleged that using a single tree, and a single optimization path, miss-estimates the frequency of transmission events, and moreover that the use of a single tree can fail to detect possible transmission events.
Virtual society exists only in the computer memory. However, in our model, it represents and reproduces the real, Polish, society. Virtual society consists of individual (distinguishable) agents, each assigned to certain, geo-referenced, household. Further, household inhabitants, depending on their age should be either retired, employed or going to school (kindergarten, primary school, secondary school, college, university). All these basic relations should be reflected in the virtual model. This virtual society was based on accessible data. Depending on the data available and on the particular infrastructure element that was to be incorporated into the virtual society, we developed a set of methods. More...
Agent Based Model of the infection spread within a small population of Guinea pigs, dependent on temperature and humidity conditions of the surroundings
The influence that atmospheric conditions might have on the efficiency of the spread of influenza virus is important for epidemiological and evolutionary research. However, it has not been satisfactorily recognized and quantified so far. Here we provide a statistical model of influenza transmission between individuals. It has been derived from the results of recent experiments, which involved infecting guinea pigs with influenza at various temperatures and relative air humidity levels. The wide range of transmission rates in those experiments reflects the ensemble-independent phenomena. The correlation between most of our simulations and the experimental results is satisfactory. For several different conditions, we obtained transmissibility values which seem to be sufficiently accurate to provide partial input for an intended large-scale epidemiological study in the near future. More..