Surajit Ray | Publications

10 most cited publications All publications

Evaluation of MHC class I peptide binding prediction servers: Applications for vaccine research
Lin H.H., Ray S., Tongchusak S., Reinherz E.L., and Brusic V. BMC Immunology. 9

Abstract

Background: Protein antigens and their specific epitopes are formulation targets for epitope-based vaccines. A number of prediction servers are available for identification of peptides that bind major histocompatibility complex class I (MHC-I) molecules. The lack of standardized methodology and large number of human MHC-I molecules make the selection of appropriate prediction servers difficult. This study reports a comparative evaluation of thirty prediction servers for seven human MHC-I molecules. Results: Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal, and 28 poor ability to classify binders from non-binders. The classifiers for HLA-A*0201, A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402 moderate classification accuracy. Sixteen prediction servers predict peptide binding affinity to MHC-I molecules with high accuracy; correlation coefficients ranging from r = 0.55 (B*0801) to r = 0.87 (A*0201). Conclusion: Non-linear predictors outperform matrix-based predictors. Most predictors can be improved by non-linear transformations of their raw prediction scores. The best predictors of peptide binding are also best in prediction of T-cell epitopes. We propose a new standard for MHC-I binding prediction - a common scale for normalization of prediction scores, applicable to both experimental and predicted data. The results of this study provide assistance to researchers in selection of most adequate prediction tools and selection criteria that suit the needs of their projects. © 2008 Lin et al; licensee BioMed Central Ltd.

A nonparametric statistical approach to clustering via mode identification
Li J., Ray S., and Lindsay B.G. Journal of Machine Learning Research. 8

Scopus Link | Citing Papers |

Abstract

A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/~jiali/hmac.

The topography of multivariate normal mixtures
Ray S. and Lindsay B.G. Annals of Statistics. 33 (5)

Abstract

Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. It is shown that their topography, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points, as well as the ridges of the density. A plot of the elevations on the ridgeline shows the key features of the mixed density. In addition, by use of the ridgeline, we uncover a function that determines the number of modes of the mixed density when there are two components being mixed. A followup analysis then gives a curvature function that can be used to prove a set of modality theorems. © Institute of Mathematical Statistics, 2005.

SARS-CoV-2 Omicron is an immune escape variant with an altered cell entry pathway
Willett B.J., Grove J., MacLean O.A., Wilkie C., De Lorenzo G., Furnon W., Cantoni D., Scott S., Logan N., Ashraf S., Manali M., Szemiel A., Cowton V., Vink E., Harvey W.T., Davis C., Asamaphan P., Smollett K., Tong L., Orton R., Hughes J., Holland P., Silva V., Pascall D.J., Puxty K., da Silva Filipe A., Yebra G., Shaaban S., Holden M.T.G., Pinto R.M., Gunson R., Templeton K., Murcia P.R., Patel A.H., Klenerman P., Dunachie S., Dunachie S., …, R.M., Moll R.J., McCarthy S.A., Lensing S.V., Leonard S., Farr B.W., Scott C., Beaver C., Ariani C.V., Weldon D., Jackson D.K., Betteridge E., Tonkin-Hill G., Johnston I., Martincorena I., Bonfield J., Barrett J.C., Sillitoe J., Keatley J.-P., Oliver K., James K., Shirley L., Prestwood L., Foulser L., Gourtovaia M., Dorman M.J., Quail M.A., Spencer Chapman M.H., Park N.R., Livett R., Amato R., Kay S., Goodwin S., Thurston S.A.J., Rajatileka S., Gonçalves S., Lo S., Sanderson T., Maclean A., Goldstein E.J., Ferguson L., Tomb R., Catalan J., Jones N., Haughney J., Robertson D.L., Palmarini M., Ray S., Thomson E.C., PITCH Consortium, and The COVID-19 Genomics UK (COG-UK) Consortium Nature Microbiology. 7 (8)

Abstract

Vaccines based on the spike protein of SARS-CoV-2 are a cornerstone of the public health response to COVID-19. The emergence of hypermutated, increasingly transmissible variants of concern (VOCs) threaten this strategy. Omicron (B.1.1.529), the fifth VOC to be described, harbours multiple amino acid mutations in spike, half of which lie within the receptor-binding domain. Here we demonstrate substantial evasion of neutralization by Omicron BA.1 and BA.2 variants in vitro using sera from individuals vaccinated with ChAdOx1, BNT162b2 and mRNA-1273. These data were mirrored by a substantial reduction in real-world vaccine effectiveness that was partially restored by booster vaccination. The Omicron variants BA.1 and BA.2 did not induce cell syncytia in vitro and favoured a TMPRSS2-independent endosomal entry pathway, these phenotypes mapping to distinct regions of the spike protein. Impaired cell fusion was determined by the receptor-binding domain, while endosomal entry mapped to the S2 domain. Such marked changes in antigenicity and replicative biology may underlie the rapid global spread and altered pathogenicity of the Omicron variant. © 2022, The Author(s).

BIC and Alternative Bayesian Information Criteria in the Selection of Structural Equation Models
Bollen K.A., Harden J.J., Ray S., and Zavisca J. Structural Equation Modeling. 21 (1)

Abstract

Selecting between competing structural equation models is a common problem. Often selection is based on the chi-square test statistic or other fit indices. In other areas of statistical research Bayesian information criteria are commonly used, but they are less frequently used with structural equation models compared to other fit indices. This article examines several new and old information criteria (IC) that approximate Bayes factors. We compare these IC measures to common fit indices in a simulation that includes the true and false models. In moderate to large samples, the IC measures outperform the fit indices. In a second simulation we only consider the IC measures and do not include the true model. In moderate to large samples the IC measures favor approximate models that only differ from the true model by having extra parameters. Overall, SPBIC, a new IC measure, performs well relative to the other IC measures. © 2014 Copyright Taylor and Francis Group, LLC.

Use of Machine Learning and Artificial Intelligence to predict SARS-CoV-2 infection from Full Blood Counts in a population
Banerjee A., Ray S., Vorselaars B., Kitson J., Mamalakis M., Weeks S., Baker M., and Mackenzie L.S. International Immunopharmacology. 86

Abstract

Since December 2019 the novel coronavirus SARS-CoV-2 has been identified as the cause of the pandemic COVID-19. Early symptoms overlap with other common conditions such as common cold and Influenza, making early screening and diagnosis are crucial goals for health practitioners. The aim of the study was to use machine learning (ML), an artificial neural network (ANN) and a simple statistical test to identify SARS-CoV-2 positive patients from full blood counts without knowledge of symptoms or history of the individuals. The dataset included in the analysis and training contains anonymized full blood counts results from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 rt-PCR test during a visit to the hospital. Patient data was anonymised by the hospital, clinical data was standardized to have a mean of zero and a unit standard deviation. This data was made public with the aim to allow researchers to develop ways to enable the hospital to rapidly predict and potentially identify SARS-CoV-2 positive patients. We find that with full blood counts random forest, shallow learning and a flexible ANN model predict SARS-CoV-2 patients with high accuracy between populations on regular wards (AUC = 94–95%) and those not admitted to hospital or in the community (AUC = 80–86%). Here, AUC is the Area Under the receiver operating characteristics Curve and a measure for model performance. Moreover, a simple linear combination of 4 blood counts can be used to have an AUC of 85% for patients within the community. The normalised data of different blood parameters from SARS-CoV-2 positive patients exhibit a decrease in platelets, leukocytes, eosinophils, basophils and lymphocytes, and an increase in monocytes. SARS-CoV-2 positive patients exhibit a characteristic immune response profile pattern and changes in different parameters measured in the full blood count that are detected from simple and rapid blood tests. While symptoms at an early stage of infection are known to overlap with other common conditions, parameters of the full blood counts can be analysed to distinguish the viral type at an earlier stage than current rt-PCR tests for SARS-CoV-2 allow at present. This new methodology has potential to greatly improve initial screening for patients where PCR based diagnostic tools are limited. © 2020 The Authors

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
Shi P., Ray S., Zhu Q., and Kon M.A. BMC Bioinformatics. 12

Abstract

Background: The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.Results: We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets. Conclusions: The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis. © 2011 Shi et al; licensee BioMed Central Ltd.

Model selection in high dimensions: A quadratic-risk-based approach
Ray S. and Lindsay B.G. Journal of the Royal Statistical Society. Series B: Statistical Methodology. 70 (1)

Abstract

We propose a general class of risk measures which can be used for data-based evaluation of parametric models. The loss function is defined as the generalized quadratic distance between the true density and the model proposed. These distances are characterized by a simple quadratic form structure that is adaptable through the choice of a non-negative definite kernel and a bandwidth parameter. Using asymptotic results for the quadratic distances we build a quick-to-compute approximation for the risk function. Its derivation is analogous to the Akaike information criterion but, unlike the Akaike information criterion, the quadratic risk is a global comparison tool. The method does not require resampling, which is a great advantage when point estimators are expensive to compute. The method is illustrated by using the problem of selecting the number of components in a mixture model, where it is shown that, by using an appropriate kernel, the method is computationally straightforward in arbitrarily high data dimensions. In this same context it is shown that the method has some clear advantages over the Akaike information criterion and Bayesian information criterion. © 2008 Royal Statistical Society.

Quadratic distances on probabilities: A unified foundation
Lindsay B.G., Markatou M., Ray S., Yang K.E., and Chen S.-C. Annals of Statistics. 36 (2)

Abstract

This work builds a unified framework for the study of quadratic form distance measures as they are used in assessing the goodness of fit of models. Many important procedures have this structure, but the theory for these methods is dispersed and incomplete. Central to the statistical analysis of these distances is the spectral decomposition of the kernel that generates the distance. We show how this determines the limiting distribution of natural goodness-of-fit tests. Additionally, we develop a new notion, the spectral degrees of freedom of the test, based on this decomposition. The degrees of freedom are easy to compute and estimate, and can be used as a guide in the construction of useful procedures in this class. © Institute of Mathematical Statistics, 2008.

Functional principal component analysis of spatially correlated data
Liu C., Ray S., and Hooker G. Statistics and Computing. 27 (6)

Abstract

This paper focuses on the analysis of spatially correlated functional data. We propose a parametric model for spatial correlation and the between-curve correlation is modeled by correlating functional principal component scores of the functional data. Additionally, in the sparse observation framework, we propose a novel approach of spatial principal analysis by conditional expectation to explicitly estimate spatial correlations and reconstruct individual curves. Assuming spatial stationarity, empirical spatial correlations are calculated as the ratio of eigenvalues of the smoothed covariance surface Cov(Xi(s) , Xi(t)) and cross-covariance surface Cov(Xi(s) , Xj(t)) at locations indexed by i and j. Then a anisotropy Matérn spatial correlation model is fitted to empirical correlations. Finally, principal component scores are estimated to reconstruct the sparsely observed curves. This framework can naturally accommodate arbitrary covariance structures, but there is an enormous reduction in computation if one can assume the separability of temporal and spatial components. We demonstrate the consistency of our estimates and propose hypothesis tests to examine the separability as well as the isotropy effect of spatial correlation. Using simulation studies, we show that these methods have some clear advantages over existing methods of curve reconstruction and estimation of model parameters. © 2016, The Author(s).