Alignment-free genome sequence analysis by the ISSCOR method
Synonymous codons do not occur at equal frequencies. Codon usage and codon bias have been extensively studied. However, the sequential order in which synonymous codons appear within a gene has not been studied until now. Here we describe an in silico method, which is the first attempt to tackle this problem: to what extent this sequential order is unique, and to what extent the succession of synonymous codons is important. This method, which we called Intragenic, Stochastic Synonymous Codon Occurrence Replacement (ISSCOR), generates, by a Monte Carlo approach, a set of genes which code for the same amino acid sequence, and display the same codon usage, but have random permutations of the synonymous codons, and therefore different sequential codon orders from the original gene. We analyze the complete genome of the bacterium Helicobacter pylori (containing 1574 protein coding genes), and show by various, alignment-free computational methods (e.g., frequency distribution of codon-pairs, as well as that of nucleotide bigrams in codon-pairs), that: (i) not only the succession of adjacent synonymous codons is far from random, but also, which is totally unexpected, the occurrences of non-adjacent synonymous codon-pairs are highly constrained, at strikingly long distances of dozens of nucleotides; (ii) the statistical deviations from the random synonymous codon order are overwhelming; and (iii) the pattern of nucleotide bigrams in codon-pairs can be used in a novel way for characterizing and comparing genes and genomes. Our results demonstrate that the sequential order of synonymous codons within a gene must be under a strong selective pressure, which is superimposed on the classical codon usage. This new dimension can be measured by the ISSCOR method, which is simple, robust, and should be useful for comparative and functional genomics.
Correlating sequential and antigenic information
Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of three different A-subtypes revealed some rather striking temporal (for A/H3N3), and spatial relationships (for A/H5N1) between groups of individual gene subsets. The application to the A/H1N1 set revealed also relationships between the seasonal H1, and the swine-like novel 2009 H1v variants in a quick and unambiguous manner. Based on these examples we consider application of the ISSCOR method for analysis of large sets of homologous genes as a worthwhile addition to a toolbox of genomics – it allows a rapid diagnostics of trends, and possibly can even aid an early warning of newly emerging epidemiological threats. Antibodies against hemagglutinin provide protective immunity to influenza virus infection, and the HA is therefore the primary component of influenza vaccines, and as the antigenic structure of HA changes significantly over time, vaccine has to be updated to ensure adequate efficacy against emerging viral variants. The WHO network of influenza centers routinely characterizes the antigenic properties of influenza viruses using inhibition assays, which combined with sequential data of variability in the antigenic HA-1 domain of the HA, are necessary to select strains for use in the seasonal vaccines. Smith et al. used antigenic data from 35 years of influenza surveillance between 1968 and 2003, with the resulting antigenic dataset consisting of a table of 79 post-infection ferret antisera by 273 viral isolates, and 4215 individual HA inhibition (HI) measurements, and then constructed an antigenic 2D map, to determine the antigenic evolution of influenza A/H3N2 virus, using projection from the obtained high-dimensional antigenic data hyperspace.
The accuracy of the predictions has shown that their map might serve as a possible target of an attempt to describe antigenic relationships on a basis of the ISSCOR descriptors. Partial least squares regression (PLS-regression) is a technique used to find relationships between two data sets (X and Y), utilizing a latent variable (LV) approach to modeling the covariance – possibly present in these two spaces. Rather than finding hyperplanes of maximum variance between the response and independent variables, like is the case in the PCA-regression, it finds a linear model by projecting the predicted variables together with the observables to a newly constructed space. In this way trying to uncover the multidimensional direction in the X space, that explains the maximum multidimensional variance direction in the Y space. Therefore, the PLS-regression of the antigenic cluster centers’ ISSCOR descriptors, on the 2D map of the Smith’s antigenic clusters, was performed, and the results are shown on the Figure above. The model utilizing six LVs was found to be optimal (RMS = 0.12), considering that the regression model obtained with only five LVs was not sufficient to achieve prediction errors small enough. Table II lists fifty of the most contributing ISSCOR variables used by each of the six LVs. At the bottom of each column there are values of explained variance for the ISSCOR descriptors [X-matrix], and the antigenic map’s cluster centers [Y-matrix]. It is noteworthy that only two major LV would already suffice to explain 95% of variance in Y space but at the same time all six LV are necessary to explain the respective variance in X – the 5th and the 6th LV are both contributing almost equally strong. The same PLS model was then used to project positions of H3N2 hemagglutinin strains isolated during each of the respective years (Fig. on the right). Of quite an interest is rather wide spread of the year’s 2005 sequences, some of which are apparently reversing the general trend observed earlier, and continued subsequently by the majority of strains isolated in the other 2003-2006 yearly seasonal clusters.
Temporal and spatial evolution of the Influenza-A hemagglutinin genes
Two sets of the H3N2 influenza virus hemagglutinin (HA) sequences were collected – the set A comprising of 2217 full length HA gene sequences (1701 nucleotides each); and the set B of 1810 sequences of the HA-1 antigenic region of the H3N2 hemagglutinin gene (987 nucleotides each). The compositions of the sets A and B were almost entirely different as only two virus isolates were the same in both. The ISSCOR descriptors were calculated for these two sets of sequences, using. codon spacer values from 0 to 16, and creating two matrices (MA and MB) of 2217 and 1810 rows respectively, and 2448 columns each. It was observed that the G+C content in H3N2 hemagglutinin sequences of the set A decreased linearly over the period of 40 years, by approx. 5.93% – which corresponds to the changes in the relative ratios of all four nucleotides, although by different amount for each. The linear regression parameters of the equation Nucleotide_count = Pn * year + bn were PA = 0.9608 and bA = -1.3513 * 103; PC = -0.2751 and bC = 0.8903 * 103; PG = -0.8387 and bG = 2.0607 * 103; PT = 0.1463 and bT = 0.1148 * 103 – respectively for the A, C, G, and T nucleotides. This observation is consistent with the findings of Rabadan et al., who examined ratios of A, C, G and T in the PB1, PB2 and PA segments of genomes H1N1 and H5N1. For periods of up to 90 years, they found systematic and significant changes in GàA and CàT ratios in the both H1 and H5 cases, corresponding to hosts from which viral isolates were obtained – this was also usually associated with striking differences between sequence isolates of avian and of human origin.
Results of principal component analysis (PCA) for the matrix MA, are depicted on Figure at the left (the bottom panel). There is a clear timeline trend of the sequences from the years 1968-1970, which are located at the upper left corner, going down towards the minimum at about the years 2001-2002, and then up again to reach the years 2007-2008 at the end of the right arm of the curve. Therefore, to examine this trend more synthetically, the yearly clusters were considered separately for each year between 1968 and 2008. For each yearly cluster their centers, as well as the corresponding standard deviations, were calculated, and are shown. The horizontal and vertical bars are proportional in sizes to the respective standard deviations of PC-1 and PC-2 values. Noteworthy, there are three regions of rather increased variability: first – around the year 1968, then the years 2000-2003, and finally the year 2008, with some less diverse periods in between as well. It was found earlier, based on extensive analyses of influenza HA sequences of the H1 and H3 subtypes, that the evolution of H3N2 hemagglutinin included long intervals of mostly neutral sequence evolution without noticeable antigenic change showing an excess of synonymous over nonsynonymous substitutions, punctuated by shorter intervals of rapid evolution during which newly dominant lineages quickly displaced previously coexisting ones. The stasis intervals showed rather uniform distribution of replacements over the whole HA sequence’s length, not favoring epitope regions. The analogous PCA results for the matrix MB, showing 1810 data points, are depicted on the top panel, marking also members of the antigenic clusters, described earlier as forming WHO vaccine clusters. Of special interest is the location of the A/Fujian/411/2002 sequence on this plot (the red diamond at about PC1=5 and PC2=0) as this is the oldest of the six strains constituting the Fu02 cluster, and yet all remaining sequences of this cluster (each isolated in 2003) have their PC2 values below it. Holmes et al. concluded that one clade of H3N2 viruses present at least since 2000 had provided the hemagglutinin gene for all the H3N2 viruses sampled after the 2002–2003 influenza season, and that a reassortment event was the likely progenitor of the antigenicaly variant influenza strains that caused the A/Fujian/411/2002 epidemic of the 2003–2004 season. It is possible that a significant factor of such adaptation involved optimizing the functional compatibility of reassorting segments. Such a phenomenon might be a possible explanation why the lineage leading to the FU02 antigenic type did not dominate the viral population until a few years after its initial appearance, coincidental with an HA reassortment event.
Another set C, comprising of 613 orthologous genes of the H5N1 avian flu virus’ hemagglutinin (HA), of 1707 nucleotides each – isolated from avian hosts during the 1996-2009 period – was acquired from NCBI in January 2010. All sequences were unique – in cases when two (or several) sequences were identical, only the earliest one was included. The ISSCOR descriptors were calculated for this set of sequences, using. codon spacer values from 0 to 16. The Figure on the right shows the PCA scatter plot of PC1 vs. PC2 for this set, divided for twelve territorial regions, characterized by the most numerous populations of isolates. Their plots are presented as separate panels, there are also separate panels showing remaining strains obtained from Africa, Asia and Europe. Already the analogous results for the A/H1N1 have shown some territorial clustering in several regions. In case of H5N1 the groupings are remarkably well defined and in many instances also rather well separated between themselves. This is especially evident for strains isolated in several Far Eastern, and South-East Asian countries: China, Vietnam, Hong Kong, Indonesia, and Thailand. On the other hand, isolates from Europe co-cluster together independently of their country of origin, which is also true for African and Middle-East countries – all of which form the large gathering at around PC1 = -3, and PC2 = 0 position. In roughly the same location there are also points for several Asian countries (all of which are different from the already mentioned five): Afghanistan, Pakistan, Bangladesh, India, Mongolia, Kazakhstan, and Japan. There are also two exceptions from Asia (well separated blue triangles in the panel "Asia") – one sequence from Japan (A/duck/Yokohama/aq10/2003, AB212280), and one from Myanmar (A/chicken/Pyigyitagon/204/2006, AB474081), both of them collocated together within the bottom cluster from China. The former H5N1 strain was isolated from duck meat processed for human consumption, imported to Japan from Shandong Province, China in 2003. That virus was antigenically different from other H5 viruses, including the Hong Kong H5N1 viruses isolated from humans in 1997 and 2003. The latter strain was isolated during the 1-st outbreak of bird flu in Myanmar in March 2006, and it was subsequently found to belong to the clade-7 (the WHO H5N1 evolution nomenclature) of the highly pathogenic avian influenza (HPAI). There is also one, rather unexpected case of the isolate from a Belgian crested eagle (A/crested_eagle/Belgium/01/2004, DQ182483; the green square, shown here for clarity at the edge of the top cluster in the panel "Vietnam"). On a closer examination this highly pathogenic H5N1 strain turned out to have been isolated from smuggled eagles, confiscated at the national airport, after an attempt to illegally enter them into Belgium from Thailand by a traveler. Phylogeography of H5N1 viruses was actively researched recently in the regions of Southeast Asia and Far East which were repeatedly subjected to outbreaks of bird flu in poultry, esp. in Vietnam, Thailand, and China, with conclusions rather similar ours, concerning the yearly origins and seasonal spread of the A/H3N2 viruses. Wallace et al. studied the geographic diffusion of H5N1 following the migration paths of the virus, by a way of a genetic phylogeography of H5N1’s HA and NA sequences, and shown that the Chinese province of Guangdong was the source of multiple H5N1 strains spreading at both regional and international scales. In contrast, Indochina appeared to be a regional sink, at the same time demonstrating bidirectional dispersal among localities within the region. They have shown further that the virus migration was filtered out at some international borders, like between China and Vietnam or Thailand, but not at e.g. the one between China and Japan. Their results were recently reanalyzed by Hovmoeller et al. who alleged that using a single tree, and a single optimization path, miss-estimates the frequency of transmission events, and moreover that the use of a single tree can fail to detect possible transmission events.
Improving the readability and an in silico automatic interpretation of Neighbor Joining phylogenetic reconstructions
One of major drawbacks of traditional methods the phylogenetic trees reconstruction is that all molecular sequences are considered as leaves of the tree. Nodes in a phylogenetic tree can be divided in two types: [a] internal nodes, and [b] leaves. Internal nodes are sequences, which are distinct but have at least one offspring. In particular, an offspring can be an assemblage of internal nodes and/or leaves. Nodes with no offspring are ought to be presented as leaves, unfortunately, in traditionally build phylogenetic trees all sequences are treated as leaves, no matter if they have any offspring or not. Such trees treat internal nodes as auxiliary nodes to hold other nodes, or as information about potential ancestor sequence, which is not present in the analyzed set (Figure a on the left). This could pose a potential trap in designing algorithm analyzing evolutionary pathways, as one would expect ancestors to be closer to root than descendants (in the number of the edges between them). Only sequences on terminal branches (leaves), and the root are observed. All internal branches of the tree are not observed – if necessary, such sequences have to be estimated by the reconstruction (Ren et al.). Therefore, a possible other role of internal nodes in traditionally build trees might be to represent potential ancestor sequence[s]. This situation can happen only when all adjacent edges’ weights are greater than an unit distance – a necessary condition, as in any other edge weights’ configuration such an internal node will be just an ordinary auxiliary node. The ambiguity of leaves’ character, and internal nodes’ meaning in traditional phylogenetic trees contributes to difficulties in efficient analysis of evolutionary pathways in sillico. To alleviate this problem was the goal of the present study. The next section describes an algorithmic recipe, which considers character of all nodes in a traditional tree, and then attributes them with an appropriate role in a translated evolutionary tree (Figure b). In a schematic phylogenetic tree shown (panel a) typical reconstructed phylogenetic tree, which contains internal nodes without labels – written in Newick notationa as: ((A:d1, B:d2):d0, C:0); and (panel b) phylogenetic tree with all nodes labeled obtained from the tree in (a), written in Newick notation as: ((A:d0+d1, B:d0+d2)C:0); in both panels C is the ancestor of A and B.
Ever increasing amounts of genetic information stored in sequential databases require efficient methods to automatically reveal their phylogenetic relationships. We have implemented a framework for in silico unambiguous analysis of phylogenetic trees, based on information contained in tree’s topology, together with its branches length. The resulting, translated tree has all nodes labeled, with no constraints on nodes’ degree, and the subsequent finding of evolutionary pathways from the QPF-translated tree is robust and straightforward. Main features of the method are: small demands on computational time, and the ability to analyze phylogenies obtained prior to the proposed QPF analysis by any traditional tree-building technique. In the QPF approach edge’s weights distribution plays a crucial role as it is used to infer family character of nodes. In a perfect phylogeny’s tree all edge weights should be multiples of a unitary distance, so the pathway length between the nodes strictly represents distance between them. In a perfect phylogeny there are no two identical mutations at the same position. However, when analyzing a real sequence set such situation is not uncommon – the same mutations can occur at the same positions. Therefore, the edges’ weights distribution will be disturbed, and not all weights would be multiples of unitary distance. Such disturbed distribution depends on few factors. A bigger chance for a noisy tree occurs when: (i) the analyzed set consist a large number of sequences; (ii) sequences chain’s lengths are small; (iii) the number of characters which code each position in the chain is small; (iv) the mutations tend to be not distributed randomly throughout the chain (and also throughout a time domain – mostly due to biases in the time of samples isolation, and the numbers of isolated sequences). The proposed QPF algorithm translates a traditional phylogenetic tree, to a tree with all nodes labeled as to their phylogenetic character – in the resulting tree there are no auxiliary nodes, nor there are any nodes’ graph-theoretical degree constraints. Translated tree forms an adequate data structure, optimized for a quick evolutionary pathways finding. Therefore, although a resulting, translated tree might lose some information, as a side effect – should there be in particular any missing ancestors (not present in an analyzed sequences set), it is not necessarily a drawback, as for evolutionary pathways analysis only available sequences can be used anyway. The QPF is a robust, novel technique for an unambiguous labeling and analysis of phylogenetic trees generated by any traditional method of user’s choice. The assumption that the threshold value h is optimal when is set to the ½-unit distance, has confirmed the role of such threshold as a decision classifier distinguishing nodes of traditional phylogenetic tree into two classes – with a parent’s, or a child’s character. From a practical viewpoint, the proper choice of the h threshold value provide well over 95% accuracy, even for larger trees, despite a noise always present in traditionally generated trees' reconstructions.
In another extension of the classic NJ algorithm (which we've called NJ+), also enabling a correct handling of node assignments independently of node’s degree, the emphasis was given to a problem of correct handling sequences which might be missing in the analyzed set (the problem of missing ancestors). Again the NJ+ method reconstructs phylogenetic tree, by assigning the sequences with offspring as internal nodes, and the sequences without offspring as leaf nodes. In the resulting tree there are no constraints for the number of adjacent nodes, which means that the solution tree needs not to be a binary graph only. The subsequent derivation of evolutionary pathways, and pair-wise mutations, are then an algorithmically straightforward, with edge’s length corresponding directly to the number of mutations. The performance of the NJ+ method was examined on some synthetically generated artificial data, and two real data sets (Tryponosoma cruzi, and influenza 2009 pandemic virus hemagglutinin of H1N1). The performance tests on artificial and real data sets confirmed that the NJ+ method reconstructs trees with higher accuracy than the classic NJ algorithm. Obtaining evolutionary pathways and pair-wise mutations from the tree is straightforward, to facilitate further phylogenetic analyses. Additionally, other tree reconstruction algorithms can be extended in the proposed manner, to also give unbiased topologies.
In order to illustrate the problem of inadequacy of existing phylogenetic trees to cope with ancestor sequences producing multiple (that is not binary) offspring, a small fragment of just few branches (67 sequences total) was culled from a large tree for the 2009 pandemic H1N1 virus hemagglutinin. The input tree, built from 3243 unique sequences of 1701 nucleotides each, contained many places where it was clear that respective mother sequences had multiple offspring (in the most striking case 41 unique daughter sequences). Each offspring sequence was distinct, and differed by just a SNP from its ancestor – therefore there were no ambiguity as to their respective ancestry. The trees, reconstructed for this small set of 67 sequences, by the NJ-classic algorithm, and by our modified NJ+ approach are presented on the Figure at the left (panel A: NJ-classic cladogram; panel B: NJ-classic phylogram; panel C: NJ+ cladogram; panel D: NJ+ phylogram). In the NJ-classic tree there were 133 nodes, and in the NJ+ tree – 67 nodes (every NJ+ node was labeled). The drawback of binary constraint of the NJ-classic is well observed in NJ-classic cladogram (panel A). It can not be as clearly observed in the NJ-classic phylogram (panel B) as distances between the respective auxiliary internal nodes are zero. However, because of binary limitations in the NJ-classic tree there are several edges distancing each mother and daughter pairs of nodes. In contract, the NJ+ reconstructions provide that there is always a spacer of exactly one edge between each daughter and mother sequences.
A virtual Polish society
The individual agent-based models (IABMs) are devised to describe explicitly the interactions among individuals and their environment, thus providing a process-oriented alternative to strictly analytical mathematical models. The construction of such models requires, however, a generation of the system of the, so called, software agents, as well as the geo-referenced network of public objects (like schools, factories, offices, hospitals, transportation systems, etc.), as well as with the selected elements of infrastructure. Due to high dimensionality of this virtual society (multiple agents operating on multiple contexts), the evolution of such system is typically performed using the stochastic Monte-Carlo methods. Moreover, sometimes even the preliminary creation of such complex system, mostly because of the scarcity of the real data, calls for the usage of stochastic methods. The issues involved in the construction of virtual society stem mostly from a need to recreate details of the information contained, usually buried in aggregated data as delivered by national census bureaus, and/or other such services. There is a need, then, to devise techniques allowing for either deconvolution of various sources of the aggregated data into their constituting components, or reconstruction of original data, by building geo-referenced software constructs, which can serve as a reasonable approximation of the missing data with detailed enough granulation.
As an example of application of the AB system constructed here, the simple analysis of the school accessibility in Malopolskie voivodeship was performed, namely, the results obtained within the model described were collated with available statistical data. Although the quantitative calibration was not possible, qualitative correlation turned out to be noticeable. Another use of the model was to simplify analysis of the number and locations of rescue service units (RSUs) in local communes. Although it certainly requires a significant development, the model opens the way to a country-wide, immediate analysis of the effectiveness of possible location of RSUs. Based on the model system, a selected panel of statistical tests was used to verify and validate the assumed input criteria and boundary conditions, such as e.g. the accessibility of schools or the location of rescue service units.
A particular emphasis was given to the contact patterns arising from daily commuting to schools or workplaces. The platform served as a base for further large scale epidemiological and transportation simulation studies. However, an application scope of the platform extends beyond the simulations of epidemic or transportation, and pertains to any situation where there are no easily available means, other than computer simulations, to conduct large scale investigations of complex population dynamics. While a virtual society exists only in the computer's memory, it reproduced the real Polish society rather well: consisting of individual (distinguishable) agents used to mimic the real society of 38 millions of citizens, each assigned to a certain, geo-referenced, household. Moreover, household inhabitants, depending on their age were either retired, employed or going to school (kindergarten, primary school, secondary school, college, university). All these basic relations were reflected in the virtual model as this virtual society was based on accessible census data. (more: http://jasss.soc.surrey.ac.uk/13/1/13.html)
Influenza epidemic spread simulation for Poland - a very large scale, individual based model study
The individual agent-based model (IABM) described above was used for studying the effects of influenza epidemic in large scale (38 million individuals) stochastic simulations, together with the resulting various scenarios of disease spread in Poland. Simple transportation rules were employed to mimic individuals' travels in dynamic route-changing schemes, allowing for the infection spread during their journeys.
Parameter space was checked for stable behavior, especially towards the effective infection transmission rate variability. Following the assumptions employed in this work, a two-dimensional scan of the alpha and f parameters was performed in order to determine feasible scenarios of epidemic evolution. It should be noted, that the same R0 values may be obtained with different sets of alpha and f. In our research the R0 value varies from 1.44 to 2.42, which is a range comparable to other ranges studied in this field. In principle, R0 values are linearly related to alpha for each of the f values (R0 monotonically increases with alpha when other parameters, here f, are fixed), and therefore, the properties of the model system may be analyzed using alpha classification for each of the f ratios used. A series of possible epidemic courses for alpha ranging from 0.1 to 0.2 and for f taking values of 0.2, 0.4, 0.6, 0.8 and 1.0 were obtained by carrying out the simulations. In this parameter range we have encountered several similar R0 values which correspond to different epidemic courses. One can see that the definition of R0 does not fully account for the dynamically changing structure of the contact network or for the dynamic threshold of the "saturation" of the contexts with already infected or recovered agents. Therefore, the dynamics of the epidemic outbreak depends on the population, structure-type and infectiousness of the contexts. A rather atypical epidemic simulation run scenario for f=0.4 is depicted here as an example - atypical as for other f values epidemics run their courses much quicker (ranging from 100 to 250 days on average).
Although the reported model was based on quite simple assumptions, it allowed to observe two different types of epidemic scenarios: characteristic for urban and rural areas. This differentiates it from the results obtained in the analogous studies for the UK or US, where settlement and daily commuting patterns are both substantially different and more diverse. The resulting epidemic scenarios from these IABM simulations were compared with simple, differential equations based, SIR models: both types of the results displaying strong similarities. The pDYN software platform developed here was used in the next stage of the project, employed to study various epidemic mitigation strategies.
Agent Based Model of the infection spread within a small population of Guinea pigs, dependent on temperature and humidity conditions of the surroundings
The influence that atmospheric conditions might have on the efficiency of the spread of influenza virus is important for epidemiological and evolutionary research. However, it has not been satisfactorily recognized and quantified so far. Here we provide a statistical model of influenza transmission between individuals. It has been derived from the results of recent experiments, which involved infecting guinea pigs with influenza at various temperatures and relative air humidity levels. The wide range of transmission rates in those experiments reflects the ensemble-independent phenomena. The correlation between most of our simulations and the experimental results is satisfactory. For several different conditions, we obtained transmissibility values which seem to be sufficiently accurate to provide partial input for an intended large-scale epidemiological study in the near future.