The use of functional imaging such as PET in radiotherapy (RT) is rapidly expanding with new cancer treatment techniques. A fundamental step in RT planning is the accurate segmentation of tumours based on clinical diagnosis. Furthermore, recent tumour control techniques such as intensity modulated radiation therapy (IMRT) dose painting requires the accurate calculation of multiple nested contours of intensity values to optimise dose distribution across the tumour. Recently, convolutional neural networks (CNNs) have achieved tremendous success in image segmentation tasks, most of which present the output map at a pixel-wise level. However, its ability to accurately recognize precise object boundaries is limited by the loss of information in the successive downsampling layers. In addition, for the dose painting strategy, there is a need to develop image segmentation approaches that reproducibly and accurately identify the high recurrent-risk contours. To address these issues, we propose a novel hybrid-CNN that integrates a kernel smoothing-based probability contour approach (KsPC) to produce contour-based segmentation maps, which mimic expert behaviours and provide accurate probability contours designed to optimise dose painting/IMRT strategies. Instead of user-supplied tuning parameters, our final model, named KsPC-Net, applies a CNN backbone to automatically learn the parameters and leverages the advantage of KsPC to simultaneously identify object boundaries and provide probability contour accordingly. The proposed model demonstrated promising performance in comparison to state-of-the-art models on the MICCAI 2021 challenge dataset (HECKTOR).
With the increasing integration of functional imaging techniques like Positron Emission Tomography (PET) into radiotherapy (RT) practices, a paradigm shift in cancer treatment methodologies is underway. A fundamental step in RT planning is the accurate segmentation of tumours based on clinical diagnosis. Furthermore, novel tumour control methods, such as intensity modulated radiation therapy (IMRT) dose painting, demand the precise delineation of multiple intensity value contours to ensure optimal tumour dose distribution. Recently, convolutional neural networks (CNNs) have made significant strides in 3D image segmentation tasks, most of which present the output map at a voxel-wise level. However, because of information loss in subsequent downsampling layers, they frequently fail to precisely identify precise object boundaries. Moreover, in the context of dose painting strategies, there is an imperative need for reliable and precise image segmentation techniques to delineate high recurrence-risk contours.To address these challenges, we introduce a 3D coarse-to-fine framework, integrating a CNN with a kernel smoothing-based probability volume contour approach (KsPC). This integrated approach generates contour-based segmentation volumes, mimicking expert-level precision and providing accurate probability contours crucial for optimizing dose painting/IMRT strategies. Our final model, named KsPC-Net, leverages a CNN backbone to automatically learn parameters in the kernel smoothing process, thereby obviating the need for user-supplied tuning parameters.The 3D KsPC-Net exploits the strength of KsPC to simultaneously identify object boundaries and generate corresponding probability volume contours, which can be trained within an endto-end framework. The proposed model has demonstrated promising performance, surpassing state-of-the-art models when tested against the MICCAI 2021 challenge dataset (HECKTOR).
Objectives The SARS-CoV-2 Alpha variant was associated with increased transmission relative to other variants present at the time of its emergence and several studies have shown an association between Alpha variant infection and increased hospitalisation and 28-day mortality. However, none have addressed the impact on maximum severity of illness in the general population classified by the level of respiratory support required, or death. We aimed to do this. Methods In this retrospective multi-centre clinical cohort sub-study of the COG-UK consortium, 1475 samples from Scottish hospitalised and community cases collected between 1st November 2020 and 30th January 2021 were sequenced. We matched sequence data to clinical outcomes as the Alpha variant became dominant in Scotland and modelled the association between Alpha variant infection and severe disease using a 4-point scale of maximum severity by 28 days: 1. no respiratory support, 2. supplemental oxygen, 3. ventilation and 4. death. Results Our cumulative generalised linear mixed model analyses found evidence (cumulative odds ratio: 1.40, 95% CI: 1.02, 1.93) of a positive association between increased clinical severity and lineage (Alpha variant versus pre-Alpha variants). Conclusions The Alpha variant was associated with more severe clinical disease in the Scottish population than co-circulating lineages. : 2023 Pascall et al.
Objectives: To determine how the intrinsic severity of successively dominant SARS-CoV-2 variants changed over the course of the pandemic. Methods: A retrospective cohort analysis in the NHS Greater Glasgow and Clyde (NHS GGC) Health Board. All sequenced non-nosocomial adult COVID-19 cases in NHS GGC with relevant SARS-CoV-2 lineages (B.1.177/Alpha, Alpha/Delta, AY.4.2 Delta/non-AY.4.2 Delta, non-AY.4.2 Delta/Omicron, and BA.1 Omicron/BA.2 Omicron) during analysis periods were included. Outcome measures were hospital admission, ICU admission, or death within 28 days of positive COVID-19 test. We report the cumulative odds ratio; the ratio of the odds that an individual experiences a severity event of a given level vs all lower severity levels for the resident and the replacement variant after adjustment. Results: After adjustment for covariates, the cumulative odds ratio was 1.51 (95% CI: 1.082.11) for Alpha versus B.1.177, 2.09 (95% CI: 1.423.08) for Delta versus Alpha, 0.99 (95% CI: 0.761.27) for AY.4.2 Delta versus non-AY.4.2 Delta, 0.49 (95% CI: 0.221.06) for Omicron versus non-AY.4.2 Delta, and 0.86 (95% CI: 0.681.09) for BA.2 Omicron versus BA.1 Omicron. Conclusions: The direction of change in intrinsic severity between successively emerging SARS-CoV-2 variants was inconsistent, reminding us that the intrinsic severity of future SARS-CoV-2 variants remains uncertain. 2023
Understanding the spatiotemporal dynamics of river water chemistry from its source to sinks is critical for constraining the origin, transformation, and hotspots of contaminants in a river basin. To provide new spatiotemporal constraints on river chemistry, dissolved trace element concentrations were measured at 17 targeted locations across the Ramganga River catchment. River water samples were collected across three seasons: pre-monsoon, monsoon, and post-monsoon between 2019 and 2021. To remove the dependency of trace element concentrations on discharge, we used molar ratios, as discharge data on Indian transboundary rivers are not publicly available. The dataset reveals significant spatiotemporal variability in dissolved trace element concentrations of the Ramganga River. Samples collected upstream of Moradabad, a major industrial city in western Uttar Pradesh, are characterized by ~ 1.22.5 times higher average concentrations of most of the trace elements except Sc, V, Cr, Rb, and Pb, likely due to intense waterrock interactions in the headwaters. Such kind of enrichment in trace metal concentrations was also observed at sites downstream of large cities and industrial centers. However, such enrichment was not enough to bring a major change in the River Ganga chemistry, as the signals got diluted downstream of the Ramganga-Ganga confluence. The average river water composition of the Ramganga River was comparable to worldwide river water composition, albeit a few sites were characterized by very high concentrations of dissolved trace elements. Finally, we provide an outlook that calls for an assessment of stable non-traditional isotopes that are ideally suited to track the origin and transformation of elements such as Li, Mg, Ca, Ti, V, Cr, Fe, Ni, Cu, Zn, Sr, Ag, Cd, Sn, Pt, and Hg in Indian rivers. 2023, The Author(s).
Impaired water quality continues to be a serious problem in surface waters worldwide. Despite extensive regulatory water quality monitoring implemented by the Government of India over the past two decades, the spatial and temporal resolution of water quality observations, the range of monitored contaminants and data related to characterisation of point source effluents are still limited. In addition, discharge data for trans-boundary rivers is considered sensitive information and is not publicly available. Hence, quantifying, and mitigating pollutant loads and planning effective mitigation strategies are hindered by data paucity and there is an urgent need for the development of decision support tools (DST) that can account for these uncertainties. In this study, we tested the application of a probabilistic DST based on Bayesian Belief Networks, to evaluate pollution risk from nutrients (phosphate, nitrate, ammonia), sediments and heavy metals (Cd, Cr, Cu, Pb, Zn) in the Ramganga river basin (30,839 km2), the first major tributary of the Ganga in the state of Uttar Pradesh, India, and is understood to be a significant source of pollution into the Ganga River, contributed from a range of industries, domestic sources and intensive farming practices. Bayesian belief networks are graphical causal models that enable to integrate observational data (both spatial and temporal) with data from literature and expert knowledge within a probabilistic framework, whilst accounting for uncertainty. The objectives of this study were to 1) develop a parsimonious conceptual model of the system that allows harnessing diverse but limited data, 2) evaluate the important components of the system to inform further data collection and management strategies, and 3) simulate plausible management scenarios. We simulated the impacts of point source management interventions on pollution risk, including provision of sufficient municipal sewage treatment plant (STP) capacity, enhanced STP treatment levels and sufficient industrial wastewater effluent treatment capacity. We found a clear effect of enhanced STP interventions on improved regulatory standard compliance for nitrate (from 92% to 95%) and phosphate (from 33% to 41%). However, the effect of interventions on heavy metal pollution risk was not clear, due to considerable uncertainties related to the lack of reliable discharge data and the characterisation of industrial effluent quality. The parsimonious DST helped to collate the available understanding related to water quality impacts from multiple pollutants in the Ramganga river basin, while sensitivity analysis highlighted critical areas for further data collection.
An outbreak of acute hepatitis of unknown aetiology in children was reported in Scotland in April 20221 and has now been identified in 35 countries2. Several recent studies have suggested an association with human adenovirus (HAdV), a virus not commonly associated with hepatitis. Here we report a detailed case-control investigation and find an association between adeno-associated virus (AAV2) infection and host genetics in disease susceptibility. Using next-generation sequencing (NGS), reverse transcription-polymerase chain reaction (RT-PCR), serology and in situ hybridisation (ISH), we detected recent infection with AAV2 in the plasma and liver samples of 26/32 (81%) hepatitis cases versus 5/74 (7%) of controls. Further, AAV2 was detected within ballooned hepatocytes alongside a prominent T cell infiltrate in liver biopsies. In keeping with a CD4+ T-cell-mediated immune pathology, the Human Leucocyte Antigen (HLA) class II DRB1*04:01 allele was identified in 25/27 cases (93%), compared with a background frequency of 10/64 (16%; p=5.49 x 10-12). In summary, we report an outbreak of acute paediatric hepatitis associated with AAV2 infection (most likely acquired as a coinfection with HAdV which is required as a helper virus to support AAV2 replication) and HLA class II-related disease susceptibility.
Radionuclide ventriculography (RNVG) can be used to quantify mechanical dyssynchrony and may be a valuable adjunct in the assessment of heart failure with reduced ejection fraction (HFrEF). The study aims to investigate the effect of beta-blockers on mechanical dyssynchrony using novel RNVG phase parameters.
In March 2020 mathematics became a key part of the scientific advice to the UK government on the pandemic response to COVID-19. Mathematical and statistical modelling provided critical information on the spread of the virus and the potential impact of different interventions. The unprecedented scale of the challenge led the epidemiological modelling community in the UK to be pushed to its limits. At the same time, mathematical modellers across the country were keen to use their knowledge and skills to support the COVID-19 modelling effort. However, this sudden great interest in epidemiological modelling needed to be coordinated to provide much-needed support, and to limit the burden on epidemiological modellers already very stretched for time. In this paper we describe three initiatives set up in the UK in spring 2020 to coordinate the mathematical sciences research community in supporting mathematical modelling of COVID-19. Each initiative had different primary aims and worked to maximise synergies between the various projects. We reflect on the lessons learnt, highlighting the key roles of pre-existing research collaborations and focal centres of coordination in contributing to the success of these initiatives. We conclude with recommendations about important ways in which the scientific research community could be better prepared for future pandemics.
There have been numerous risk tools developed to enable triaging of SARS-CoV-2 positive patients with diverse levels of complexity. Here we presented a simplified risk-tool based on minimal parameters and chest X-ray (CXR) image data that predicts the survival of adult SARS-CoV-2 positive patients at hospital admission. We analysed the NCCID database of patient blood variables and CXR images from 19 hospitals across the UK using multivariable logistic regression. The initial dataset was non-randomly split between development and internal validation dataset with 1434 and 310 SARS-CoV-2 positive patients, respectively. External validation of the final model was conducted on 741 Accident and Emergency (A&E) admissions with suspected SARS-CoV-2 infection from a separate NHS Trust. The LUCAS mortality score included five strongest predictors (Lymphocyte count, Urea, C-reactive protein, Age, Sex), which are available at any point of care with rapid turnaround of results. Our simple multivariable logistic model showed high discrimination for fatal outcome with the area under the receiving operating characteristics curve (AUC-ROC) in development cohort 0.765 (95% confidence interval (CI): 0.738-0.790), in internal validation cohort 0.744 (CI: 0.673-0.808), and in external validation cohort 0.752 (CI: 0.713-0.787). The discriminatory power of LUCAS increased slightly when including the CXR image data. LUCAS can be used to obtain valid predictions of mortality in patients within 60 days of SARS-CoV-2 RT-PCR results into low, moderate, high, or very high risk of fatality.
Vaccines based on the spike protein of SARS-CoV-2 are a cornerstone of the public health response to COVID-19. The emergence of hypermutated, increasingly transmissible variants of concern (VOCs) threaten this strategy. Omicron (B.1.1.529), the fifth VOC to be described, harbours multiple amino acid mutations in spike, half of which lie within the receptor-binding domain. Here we demonstrate substantial evasion of neutralization by Omicron BA.1 and BA.2 variants in vitro using sera from individuals vaccinated with ChAdOx1, BNT162b2 and mRNA-1273. These data were mirrored by a substantial reduction in real-world vaccine effectiveness that was partially restored by booster vaccination. The Omicron variants BA.1 and BA.2 did not induce cell syncytia in vitro and favoured a TMPRSS2-independent endosomal entry pathway, these phenotypes mapping to distinct regions of the spike protein. Impaired cell fusion was determined by the receptor-binding domain, while endosomal entry mapped to the S2 domain. Such marked changes in antigenicity and replicative biology may underlie the rapid global spread and altered pathogenicity of the Omicron variant. 2022, The Author(s).
Background: Accurate diagnostic tools to identify patients at risk of cancer therapy-related cardiac dysfunction (CTRCD) are critical. For patients undergoing cardiotoxic cancer therapy, ejection fraction assessment using radionuclide ventriculography (RNVG) is commonly used for serial assessment of left ventricular (LV) function. Methods: In this retrospective study, approximate entropy (ApEn), synchrony, entropy, and standard deviation from the phase histogram (phase SD) were investigated as potential early markers of LV dysfunction to predict CTRCD. These phase parameters were calculated from the baseline RNVG phase image for 177 breast cancer patients before commencing cardiotoxic therapy. Results: Of the 177 patients, 11 had a decline in left ventricular ejection fraction (LVEF) of over 10% to an LVEF below 50% after treatment had commenced. This patient group had a significantly higher ApEn at baseline to those who maintained a normal LVEF throughout treatment. Of the parameters investigated, ApEn was superior for predicting the risk of CTRCD. Combining ApEn with the baseline LVEF further improved the discrimination between the groups. Conclusions: The results suggest that RNVG phase analysis using approximate entropy may aid in the detection of sub-clinical LV contraction abnormalities, not detectable by baseline LVEF measurement, predicting a subsequent decline in LVEF. 2020, The Author(s).
In the version of this article initially published, the author affiliation information was incomplete, neglecting to note that Brian J. Willett, Joe Grove, Oscar A. MacLean, Craig Wilkie, Giuditta De Lorenzo, Wilhelm Furnon, Diego Cantoni, Sam Scott, Nicola Logan and Shirin Ashraf contributed equally and that John Haughney, David L. Robertson, Massimo Palmarini, Surajit Ray and Emma C. Thomson jointly supervised the work, as now indicated in the HTML and PDF versions of the article. The Author(s) 2022.
Acute kidney injury (AKI) is a prevalent complication in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) positive inpatients, which is linked to an increased mortality rate compared to patients without AKI. Here we analysed the difference in kidney blood biomarkers in SARS-CoV-2 positive patients with non-fatal or fatal outcome, in order to develop a mortality prediction model for hospitalised SARS-CoV-2 positive patients. A retrospective cohort study including data from suspected SARS-CoV-2 positive patients admitted to a large National Health Service (NHS) Foundation Trust hospital in the Yorkshire and Humber regions, United Kingdom, between 1 March 2020 and 30 August 2020. Hospitalised adult patients (aged 18 years) with at least one confirmed positive RT-PCR test for SARS-CoV-2 and blood tests of kidney biomarkers within 36 h of the RT-PCR test were included. The main outcome measure was 90-day in-hospital mortality in SARS-CoV-2 infected patients. The logistic regression and random forest (RF) models incorporated six predictors including three routine kidney function tests (sodium, urea; creatinine only in RF), along with age, sex, and ethnicity. The mortality prediction performance of the logistic regression model achieved an area under receiver operating characteristic (AUROC) curve of 0.772 in the test dataset (95% CI: 0.6940.823), while the RF model attained the AUROC of 0.820 in the same test cohort (95% CI: 0.7400.870). The resulting validated prediction model is the first to focus on kidney biomarkers specifically on in-hospital mortality over a 90-day period. 2022 by the authors. Licensee MDPI, Basel, Switzerland.
The global pandemic of coronavirus disease 2019 (COVID-19) is continuing to have a significant effect on the well-being of the global population, thus increasing the demand for rapid testing, diagnosis, and treatment. As COVID-19 can cause severe pneumonia, early diagnosis is essential for correct treatment, as well as to reduce the stress on the healthcare system. Along with COVID-19, other etiologies of pneumonia and Tuberculosis (TB) constitute additional challenges to the medical system. Pneumonia (viral as well as bacterial) kills about 2 million infants every year and is consistently estimated as one of the most important factor of childhood mortality (according to the World Health Organization). Chest X-ray (CXR) and computed tomography (CT) scans are the primary imaging modalities for diagnosing respiratory diseases. Although CT scans are the gold standard, they are more expensive, time consuming, and are associated with a small but significant dose of radiation. Hence, CXR have become more widespread as a first line investigation. In this regard, the objective of this work is to develop a new deep transfer learning pipeline, named DenResCov-19, to diagnose patients with COVID-19, pneumonia, TB or healthy based on CXR images. The pipeline consists of the existing DenseNet-121 and the ResNet-50 networks. Since the DenseNet and ResNet have orthogonal performances in some instances, in the proposed model we have created an extra layer with convolutional neural network (CNN) blocks to join these two models together to establish superior performance as compared to the two individual networks. This strategy can be applied universally in cases where two competing networks are observed. We have tested the performance of our proposed network on two-class (pneumonia and healthy), three-class (COVID-19 positive, healthy, and pneumonia), as well as four-class (COVID-19 positive, healthy, TB, and pneumonia) classification problems. We have validated that our proposed network has been able to successfully classify these lung-diseases on our four datasets and this is one of our novel findings. In particular, the AUC-ROC are 99.60, 96.51, 93.70, 96.40% and the F1 values are 98.21, 87.29, 76.09, 83.17% on our Dataset X-Ray 1, 2, 3, and 4 (DXR1, DXR2, DXR3, DXR4), respectively. 2021 The Authors
Since December 2019 the novel coronavirus SARS-CoV-2 has been identified as the cause of the pandemic COVID-19. Early symptoms overlap with other common conditions such as common cold and Influenza, making early screening and diagnosis are crucial goals for health practitioners. The aim of the study was to use machine learning (ML), an artificial neural network (ANN) and a simple statistical test to identify SARS-CoV-2 positive patients from full blood counts without knowledge of symptoms or history of the individuals. The dataset included in the analysis and training contains anonymized full blood counts results from patients seen at the Hospital Israelita Albert Einstein, at So Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 rt-PCR test during a visit to the hospital. Patient data was anonymised by the hospital, clinical data was standardized to have a mean of zero and a unit standard deviation. This data was made public with the aim to allow researchers to develop ways to enable the hospital to rapidly predict and potentially identify SARS-CoV-2 positive patients. We find that with full blood counts random forest, shallow learning and a flexible ANN model predict SARS-CoV-2 patients with high accuracy between populations on regular wards (AUC = 9495%) and those not admitted to hospital or in the community (AUC = 8086%). Here, AUC is the Area Under the receiver operating characteristics Curve and a measure for model performance. Moreover, a simple linear combination of 4 blood counts can be used to have an AUC of 85% for patients within the community. The normalised data of different blood parameters from SARS-CoV-2 positive patients exhibit a decrease in platelets, leukocytes, eosinophils, basophils and lymphocytes, and an increase in monocytes. SARS-CoV-2 positive patients exhibit a characteristic immune response profile pattern and changes in different parameters measured in the full blood count that are detected from simple and rapid blood tests. While symptoms at an early stage of infection are known to overlap with other common conditions, parameters of the full blood counts can be analysed to distinguish the viral type at an earlier stage than current rt-PCR tests for SARS-CoV-2 allow at present. This new methodology has potential to greatly improve initial screening for patients where PCR based diagnostic tools are limited. 2020 The Authors
We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size n). We also consider a modification of PBIC which is more favourable to complex models. 2019, East China Normal University 2019.
Peatlands are spatially heterogeneous ecosystems that develop due to a complex set of autogenic physical and biogeochemical processes and allogenic factors such as the climate and topography. They are significant stocks of global soil carbon, and therefore predicting the depth of peatlands is an important part of establishing an accurate assessment of their magnitude. Yet there have been few attempts to account for both internal and external processes when predicting the depth of peatlands. Using blanket peatlands in Great Britain as a case study, we compare a linear and geostatistical (spatial) model and several sets of covariates applicable for peatlands around the world that have developed over hilly or undulating terrain. We hypothesized that the spatial model would act as a proxy for the autogenic processes in peatlands that can mediate the accumulation of peat on plateaus or shallow slopes. Our findings show that the spatial model performs better than the linear model in all casesroot mean square errors (RMSE) are lower, and 95% prediction intervals are narrower. In support of our hypothesis, the spatial model also better predicts the deeper areas of peat, and we show that its predictive performance in areas of deep peat is dependent on depth observations being spatially autocorrelated. Where they are not, the spatial model performs only slightly better than the linear model. As a result, we recommend that practitioners carrying out depth surveys fully account for the variation of topographic features in prediction locations, and that sampling approach adopted enables observations to be spatially autocorrelated. 2018 Young et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
This paper focuses on the analysis of spatially correlated functional data. We propose a parametric model for spatial correlation and the between-curve correlation is modeled by correlating functional principal component scores of the functional data. Additionally, in the sparse observation framework, we propose a novel approach of spatial principal analysis by conditional expectation to explicitly estimate spatial correlations and reconstruct individual curves. Assuming spatial stationarity, empirical spatial correlations are calculated as the ratio of eigenvalues of the smoothed covariance surface Cov(Xi(s) , Xi(t)) and cross-covariance surface Cov(Xi(s) , Xj(t)) at locations indexed by i and j. Then a anisotropy Matrn spatial correlation model is fitted to empirical correlations. Finally, principal component scores are estimated to reconstruct the sparsely observed curves. This framework can naturally accommodate arbitrary covariance structures, but there is an enormous reduction in computation if one can assume the separability of temporal and spatial components. We demonstrate the consistency of our estimates and propose hypothesis tests to examine the separability as well as the isotropy effect of spatial correlation. Using simulation studies, we show that these methods have some clear advantages over existing methods of curve reconstruction and estimation of model parameters. 2016, The Author(s).
This article develops methods of statistical monitoring of clinical trials with multiple co-primary endpoints, where success is defined as meeting both endpoints simultaneously. In practice, a group sequential design (GSD) method is used to stop trials early for promising efficacy, and conditional power (CP) is used for futility stopping rules. In this article, we show that stopping boundaries for the GSD with multiple co-primary endpoints should be the same as those for studies with single endpoints. Lan and Wittes proposed the B-value tool to calculate the CP of single endpoint trials and we extend this tool to calculate the CP for studies with multiple co-primary endpoints. We consider the cases of two-arm studies with co-primary normal and provide an example of implementation with simulated trial. A fixed-weight sample size reestimation approach based on CP is introduced. 2014 American Statistical Association.
In this article, we study the power properties of quadratic-distance-based goodness-of-fit tests. First, we introduce the concept of a root kernel and discuss the considerations that enter the selection of this kernel. We derive an easy to use normal approximation to the power of quadratic distance goodness-of-fit tests and base the construction of a noncentrality index, an analogue of the traditional noncentrality parameter, on it. This leads to a method akin to the Neyman-Pearson lemma for constructing optimal kernels for specific alternatives. We then introduce a midpower analysis as a device for choosing optimal degrees of freedom for a family of alternatives of interest. Finally, we introduce a new diffusion kernel, called the Pearson-normal kernel, and study the extent to which the normal approximation to the power of tests based on this kernel is valid. Supplementary materials for this article are available online. 2014 American Statistical Association.
Selecting between competing structural equation models is a common problem. Often selection is based on the chi-square test statistic or other fit indices. In other areas of statistical research Bayesian information criteria are commonly used, but they are less frequently used with structural equation models compared to other fit indices. This article examines several new and old information criteria (IC) that approximate Bayes factors. We compare these IC measures to common fit indices in a simulation that includes the true and false models. In moderate to large samples, the IC measures outperform the fit indices. In a second simulation we only consider the IC measures and do not include the true model. In moderate to large samples the IC measures favor approximate models that only differ from the true model by having extra parameters. Overall, SPBIC, a new IC measure, performs well relative to the other IC measures. 2014 Copyright Taylor and Francis Group, LLC.
We extend the concept of the ridgeline from Ray and Lindsay (Ann Stat 33:2042-2065, 2005) to finite mixtures of general elliptical densities with possibly distinct density generators in each component. This can be used to obtain bounds for the number of modes of two-component mixtures of t distributions in any dimension. In case of proportional dispersion matrices, these have at most three modes, while for equal degrees of freedom and equal dispersion matrices, the number of modes is at most two. We also give numerical illustrations and indicate applications to clustering and hypothesis testing. Springer International Publishing Switzerland 2013.
Pattern discovery in sequences is an important unsolved problem in biology, with many applications, including detecting regulation of genes by transcription factors, and differentiating proteins of infecting organisms such as viruses from an animal's own genome. In this article we describe some of the recent statistical approaches developed to address these problems, and some possible future directions for progress in this field. 2012 Elsevier B.V.
The main result of this article states that one can get as many as D+. 1 modes from just a two component normal mixture in D dimensions. Multivariate mixture models are widely used for modeling homogeneous populations and for cluster analysis. Either the components directly or modes arising from these components are often used to extract individual clusters. Although in lower dimensions these strategies work well, our results show that high dimensional mixtures are often very complex and researchers should take extra precautions when using mixture models for cluster analysis. Further our analysis shows that the number of modes depends on the component means and eigenvalues of the ratio of the two component covariance matrices, which in turn provides a clear guideline as to when one can use mixture analysis for clustering high dimensional data. 2012 Elsevier Inc.
We present a new approach to factor rotation for functional data. This is achieved by rotating the functional principal components toward a predefined space of periodic functions designed to decompose the total variation into components that are nearly-periodic and nearly-aperiodic with a predefined period. We show that the factor rotation can be obtained by calculation of canonical correlations between appropriate spaces which make the methodology computationally efficient. Moreover, we demonstrate that our proposed rotations provide stable and interpretable results in the presence of highly complex covariance. This work is motivated by the goal of finding interpretable sources of variability in gridded time series of vegetation index measurements obtained from remote sensing, and we demonstrate our methodology through an application of factor rotation of this data. Institute of Mathematical Statistics, 2012.
Background: In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation. Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods. Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics. 2012 Ray, Pyne.
Bayes factors (BFs) play an important role in comparing the fit of statistical models. However, computational limitations or lack of an appropriate prior sometimes prevent researchers from using exact BFs. Instead, it is approximated, often using the Bayesian Information Criterion (BIC) or a variant of BIC. The authors provide a comparison of several BF approximations, including two new approximations, the Scaled Unit Information Prior Bayesian Information Criterion (SPBIC) and Information matrix-based Bayesian Information Criterion (IBIC). The SPBIC uses a scaled unit information prior that is more general than the BIC's unit information prior, and the IBIC utilizes more terms of approximation than the BIC. Through simulation, the authors show that several measures perform well in large samples, that performance declines in smaller samples, and that SPBIC and IBIC provide improvement to existing measures under some conditions, including small sample sizes. The authors illustrate the use of the fit measures with the crime data of Ehrlich and then conclude with recommendations for researchers. The Author(s) 2012.
Background: The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.Results: We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets. Conclusions: The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis. 2011 Shi et al; licensee BioMed Central Ltd.
Protein microarrays are a high-throughput technology capable of generating large quantities of proteomics data. They can be used for general research or for clinical diagnostics. Bioinformatics and statistical analysis techniques are required for interpretation and reaching biologically relevant conclusions from raw data. We describe essential algorithms for processing protein microarray data, including spot-finding on slide images, Z score, and significance analysis of microarrays (SAM) calculations, as well as the concentration dependent analysis (CDA). We also describe available tools for protein microarray analysis, and provide a template for a step-by-step approach to performing an analysis centered on the CDA method. We conclude with a discussion of fundamental and practical issues and considerations. 2011, Springer Science+Business Media, LLC.
This work builds a unified framework for the study of quadratic form distance measures as they are used in assessing the goodness of fit of models. Many important procedures have this structure, but the theory for these methods is dispersed and incomplete. Central to the statistical analysis of these distances is the spectral decomposition of the kernel that generates the distance. We show how this determines the limiting distribution of natural goodness-of-fit tests. Additionally, we develop a new notion, the spectral degrees of freedom of the test, based on this decomposition. The degrees of freedom are easy to compute and estimate, and can be used as a guide in the construction of useful procedures in this class. Institute of Mathematical Statistics, 2008.
We propose a general class of risk measures which can be used for data-based evaluation of parametric models. The loss function is defined as the generalized quadratic distance between the true density and the model proposed. These distances are characterized by a simple quadratic form structure that is adaptable through the choice of a non-negative definite kernel and a bandwidth parameter. Using asymptotic results for the quadratic distances we build a quick-to-compute approximation for the risk function. Its derivation is analogous to the Akaike information criterion but, unlike the Akaike information criterion, the quadratic risk is a global comparison tool. The method does not require resampling, which is a great advantage when point estimators are expensive to compute. The method is illustrated by using the problem of selecting the number of components in a mixture model, where it is shown that, by using an appropriate kernel, the method is computationally straightforward in arbitrarily high data dimensions. In this same context it is shown that the method has some clear advantages over the Akaike information criterion and Bayesian information criterion. 2008 Royal Statistical Society.
Background: Protein antigens and their specific epitopes are formulation targets for epitope-based vaccines. A number of prediction servers are available for identification of peptides that bind major histocompatibility complex class I (MHC-I) molecules. The lack of standardized methodology and large number of human MHC-I molecules make the selection of appropriate prediction servers difficult. This study reports a comparative evaluation of thirty prediction servers for seven human MHC-I molecules. Results: Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal, and 28 poor ability to classify binders from non-binders. The classifiers for HLA-A*0201, A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402 moderate classification accuracy. Sixteen prediction servers predict peptide binding affinity to MHC-I molecules with high accuracy; correlation coefficients ranging from r = 0.55 (B*0801) to r = 0.87 (A*0201). Conclusion: Non-linear predictors outperform matrix-based predictors. Most predictors can be improved by non-linear transformations of their raw prediction scores. The best predictors of peptide binding are also best in prediction of T-cell epitopes. We propose a new standard for MHC-I binding prediction - a common scale for normalization of prediction scores, applicable to both experimental and predicted data. The results of this study provide assistance to researchers in selection of most adequate prediction tools and selection criteria that suit the needs of their projects. 2008 Lin et al; licensee BioMed Central Ltd.
The advancing technology for automatic segmentation of medical images should be accompanied by techniques to inform the user of the local credibility of results. To the extent that this technology produces clinically acceptable segmentations for a significant fraction of cases, there is a risk that the clinician will assume every result is acceptable. In the less frequent case where segmentation fails, we are concerned that unless the user is alerted by the computer, she would still put the result to clinical use. By alerting the user to the location of a likely segmentation failure, we allow her to apply limited validation and editing resources where they are most needed. We propose an automated method to signal suspected non-credible regions of the segmentation, triggered by statistical outliers of the local image match function. We apply this test to m-rep segmentations of the bladder and prostate in CT images using a local image match computed by PCA on regional intensity quantile functions. We validate these results by correlating the non-credible regions with regions that have surface distance greater than 5.5mm to a reference segmentation for the bladder. A 6mm surface distance was used to validate the prostate results. Varying the outlier threshold level produced a receiver operating characteristic with area under the curve of 0.89 for the bladder and 0.92 for the prostate. Based on this preliminary result, our method has been able to predict local segmentation failures and shows potential for validation in an automatic segmentation pipeline.
Background. A key step in the development of an adaptive immune response to pathogens or vaccines is the binding of short peptides to molecules of the Major Histocompatibility Complex (MHC) for presentation to T lymphocytes, which are thereby activated and differentiate into effector and memory cells. The rational design of vaccines consists in part in the identification of appropriate peptides to effect this process. There are several algorithms currently in use for making such predictions, but these are limited to a small number of MHC molecules and have good but imperfect prediction power. Results. We have undertaken an exploration of the power gained by taking advantage of a natural representation of the amino acids in terms of their biophysical properties. We used several well-known statistical classifiers using either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding by biophysical properties leads to substantially lower misclassification error. Conclusion. Representation of amino acids using a few important bio-physio-chemical property provide a natural basis for representing peptides and greatly improves peptide-MHC class I binding prediction. 2007 Ray and Kepler; licensee BioMed Central Ltd.
A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/~jiali/hmac.
Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. It is shown that their topography, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points, as well as the ridges of the density. A plot of the elevations on the ridgeline shows the key features of the mixed density. In addition, by use of the ridgeline, we uncover a function that determines the number of modes of the mixed density when there are two components being mixed. A followup analysis then gives a curvature function that can be used to prove a set of modality theorems. Institute of Mathematical Statistics, 2005.
Pearson's X2- and the log-likelihood ratio X2-statistics are fundamental tools in goodness-of-f it testing. Cressie and Read constructed a general family of divergences which includes both statistics as special cases. This family is indexed by a single parameter, and divergences at either end of the scale are more powerful against alternatives of one type while being rather poor against the opposite type. Here we present several new goodness-of-fit testing procedures which have reasonably high powers for both kinds of alternative. Graphical studies illustrate the advantages of the new methods.