**Mathematical Foundations of Data Science** – May 1-3, 2017

### Bradley Plenary Speaker:

**Aad van der Vaart** Leiden University

### Invited Speakers:

**Ben Adcock** Simon Fraser University**Alexandra Chouldechova** Carnegie Mellon University**Serkan Gugercin** Virginia Tech**Matthias Heinkenschloss** Rice University**Kody Law** Oak Ridge National Lab**Evelyn Lunasin** U.S. Naval Academy**Ekaterina Merkurjev** Michigan State University**Elizabeth Munch** University of Albany**Takashi Owada** Purdue University**Victor Panaretos** École Polytechnique Fédérale de Lausanne**Jose Perea** Michigan State University**Antonio Rieser** Centro de Investigación en Matemáticas

#### Organizing Committee

- Vasileios Maroulas (Chair)
- Jan Rosinski
- Clayton Webster
- Steve Wise

## Bradley Lecture

#### Aad van der Vaart, Leiden University

*Bayesian statistics in high dimensions*

We present an overview of Bayesian methods to estimate functions or high-dimensional parameter vectors, and discuss the validity (or not) of these methods from a non-Bayesian point of view. For instance, we consider using a Gaussian process as a prior for a function (such as a regression function, or the initial value in a pdf, or, after exponentiation and normalization, a density function). We characterize the rate at which the corresponding posterior distribution can recover a true function as the noise level tends to zero or the number of observations tends to infinity, and discuss how this rate can be improved by scaling the time axis, showing that an appropriate random scaling leads to adaptive recovery over a scale of smoothness levels. Recovery means that the posterior distribution concentrates most of its mass near the parameter that generates the data, for most observations. It refers mostly to the location of the posterior distribution. A second use of the posterior distribution is uncertainty quantification, and refers to the spread of the posterior distribution. In fact, it is at the core of the Bayesian method to use the full posterior distribution as an indication of remaining uncertainty. We discuss the general difficulties of uncertainty quantification in nonparametric statistics, from which Bayesian methods of course also cannot escape. We argue that these difficulties imply that the uncertainty quantification of adaptive Bayesian methods must be misleading for some true parameters, and present concrete examples. We next show that for so-called self-similar parameters the uncertainty quantification is valid.

The first talk is a general introduction to these aspects of nonparametric Bayesian statistics, focused mostly at curve or surface estimation. In the second talk we address similar issues in the Bayesian recovery of regression parameters in sparse high-dimensional models.

## Invited Talks

#### Ben Adcock, Simon Fraser University

*Sparse polynomial approximation of high-dimensional functions*

Many problems in scientific computing require the approximation of smooth, high-dimensional functions from limited amounts of data. For instance, a common problem in uncertainty quantification involves identifying the parameter dependence of the output of a computational model. Complex physical systems require computational models with many parameters, resulting in multivariate functions of many variables. Although the amount of data may be large, the curse of dimensionality essentially prohibits collecting or processing enough data to reconstruct the unknown function using classical approximation techniques. In this talk, I will give an overview of the approximation of smooth, high-dimensional functions by sparse polynomial expansions. I will focus on the recent application of techniques from compressed sensing to this problem, and demonstrate how such approaches can theoretically overcome the curse of dimensionality. If time, I will also discuss several extensions, including dealing with corrupted and/or unstructured data, the effect of model error and incorporating additional information such as gradient data. I will also highlight several challenges and open problems.

#### Alexandra Chouldechova, Carnegie Mellon University

*Data science in society: bias, discrimination and accountability*

In May 2016 the White House published a seminal report on data and ethics entitled, “Big Data: A Report on Algorithmic Systems, Opportunity, and Civil rights”. Highlighting key challenges, the report states, “big data techniques have the potential to enhance our ability to detect and prevent discriminatory harm. But, if these technologies are not implemented with care, they can also perpetuate, exacerbate, or mask harmful discrimination.” This talk will provide an overview of bias, discrimination and disparate impact in the context of criminal recidivism risk prediction. I will describe several fairness criteria that have recently been applied to assess the fairness of recidivism prediction instruments. We will see that the criteria cannot all be simultaneously satisfied when recidivism prevalence differs across groups, and will discuss some consequences of this result. The second part of the talk will touch on issues of accountability and transparency in societal applications of predictive modeling. Throughout the talk I will highlight multidisciplinary challenges in societal applications of data science.

#### Serkan Gugercin, Virginia Tech

*Data-Driven Reduced Order Models via Interpolation*

Dynamical systems are a principal tool in the modeling, prediction, and control of physical phenomena with applications ranging from wave propagation and vibration suppression in large structures to timely prediction of storm surges before an advancing hurricane. Direct numerical simulation of these models may be the only possibility for accurate prediction or control of complex physical phenomena. However, the ever increasing need for improved accuracy requires the inclusion of ever more detail in the modeling stage, leading inevitably to ever larger-scale, ever more complex dynamical systems. Simulations in such large-scale settings can be overwhelming and make unmanageably large demands on computational resources, which is the main motivation for model reduction. Using the systems-theoretic techniques, model reduction aims to produce a much lower dimensional system whose input/output behavior mimics that of the original as closely as possible.

Interpolatory model reduction methods have matured quickly in the last decade and emerged as extremely effective strategies for large scale problems. In this talk, we will focus on the data-driven framework for interpolatory model reduction in which the goal is to produce a reduced model using only accessible (simulation) inputs and outputs. For the first part of the talk, the data will correspond to evaluating underlying transfer function at a given frequency. We will show how to construct locally optimal reduced models in this framework, allowing us to construct high-fidelity rational approximants using frequency domain measurements. Then, we will extend this framework to the case of time-domain simulations where using a nonintrusive approach, we will show how construct reduced models directly from trajectories of the inputs and outputs of the full model, without requiring the full-model operators.

#### Matthias Heinkenschloss, Rice University

*Risk averse PDE constrained optimization*

Optimal control and optimal design problems governed by partial differential equations (PDEs) arise in many engineering and science applications. In these applications one wants to maximize the performance of the system subject to constraints. When problem data, such as material parameters, are not known exactly but are modeled as random fields, the system performance is a random variable. So-called risk measures are applied to this random variable to obtain the objective function for PDE constrained optimization under uncertainty. Instead of only maximizing expected performance, risk averse optimization also considers the deviation of actual performance below expected performance. The resulting optimization problems are difficult to solve, since a single objective function evaluation requires sampling of the governing PDE at many parameters. In addition, risk averse optimization requires sampling in the tail of the distribution.

This talk demonstrates the impact of risk averse optimization formulations on the solution and illustrates the shortcomings of (plain) Monte Carlo and sparse grid sampling methods in the context of risk averse optimization. New importance sampling schemes are introduced that exploit the structure of risk measures and use reduced order models to identify the small regions in parameter space which are important for the optimization. It is shown that these new sampling schemes substantially reduce the cost of solving the optimization problems.

#### Kody Law, Oak Ridge National Lab

*Multilevel Monte Carlo methods for Bayesian Inference*

For half a century computational scientists have been numerically simulating complex systems. Uncertainty is recently becoming a requisite consideration in complex applications which have been classically treated deterministically. This has led to an increasing interest in recent years in uncertainty quantification (UQ). Another recent trend is the explosion of available data. Bayesian inference provides a principled and well-defined approach to the integration of data into an a priori known distribution. The posterior distribution, however, is known only point-wise (possibly with an intractable likelihood) and up to a normalizing constant. Monte Carlo methods have been designed to sample such distributions, such as Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) samplers. Recently, the multilevel Monte Carlo (MLMC) framework has been extended to some of these cases, so that approximation error can be optimally balanced with statistical sampling error, and ultimately the Bayesian inverse problem can be solved for the same asymptotic cost as solving the deterministic forward problem. This talk will concern the recent development of multilevel SMC (MLSMC) samplers and the resulting estimators for standard quantities of interest as well as normalizing constants. MLMC data assimilation methods, which combine dynamical systems with data in an online fashion, will also be presented, including ML particle filters and ensemble Kalman filters. This class of algorithms are expected to become prevalent in the age of increasingly parallel emerging architecture, where resilience and reduced data movement will be crucial algorithmic considerations.

#### Evelyn Lunasin, US Naval Academy

*Data Assimilation Algorithm Based on Feedback Control Theory*

We investigate the effectiveness of a simple finite-dimensional feedback control scheme for globally stabilizing solutions of infinite-dimensional dissipative evolution equations introduced by Azouani and Titi. This feedback control algorithm overcomes some of the major difficulties in control of multi-scale processes: It does not require the presence of separation of scales nor does it assume the existence of a finite-dimensional globally invariant inertial manifold. We present a theoretical framework for a control algorithm which allows us to give a systematic stability analysis, and present the parameter regime where stabilization or control objective is attained. In addition, the number of observables and controllers that were derived analytically and implemented in our numerical studies is consistent with the finite number of determining modes that are relevant to the underlying physical system. We verify the results computationally in the context of the Chafee-Infante reaction-diffusion equation, the Kuramoto-Sivashinsky equation, and other applied control problems, and observe that the control strategy is robust and independent of the model equation describing the dissipative system.

#### Ekaterina Merkurjev, Michigan State University

*Convex variational methods for multiclass data segmentation on graphs*

Graph-based variational methods have recently shown to be highly competitive for various classification problems of high-dimensional data, but are inherently difficult to handle from an optimization perspective. In this talk, we will describe a convex relaxation for a certain set of graph-based multiclass data segmentation problems, featuring region homogeneity terms, supervised information and/or certain constraints or penalty terms acting on the class sizes. Particular applications include semi-supervised classification of high-dimensional data and unsupervised segmentation of unstructured 3D point clouds. Theoretical analysis indicates that the convex relaxation closely approximates the original NP-hard problems, and these observations are also confirmed experimentally. An efficient duality based algorithm is developed that handles all constraints on the labeling function implicitly. Experiments on semi-supervised classification indicate consistently higher accuracies than related local minimization approaches, and considerably so when the training data are not uniformly distributed among the data set. The accuracies are also highly competitive against a wide range of other established methods on three benchmark datasets. Experiments on 3D point clouds acquired by a LaDAR in outdoor scenes, demonstrate that the scenes can accurately be segmented into object classes such as vegetation, the ground plane and human-made structures.

#### Elizabeth Munch, University of Albany-SUNY

*Reeb graphs, Mapper graphs, and Metrics*

In order to investigate complex data, Topological Data Analysis (TDA) provides a suite of tools which give methods to investigate the structural properties of the data. These often come in the form of topological signatures which summarize some aspect of the data, allowing for a quantitative measure of the shape. One of these signatures, Mapper [Singh et al 2007], takes as input data points with pairwise distance information, along with a real-valued function on the data, and returns a graph which encodes information about the connected components of the space restricted to nearby function values. This tool has been shown to be incredibly powerful on real data sets, including breast cancer genetic data, disease recovery, and diabetes, to name a few. It has been particularly useful for exploratory data analysis by giving a visualize able summary of the data for use by domain scientists. In this talk we will discuss Mapper, its mathematical properties, and methods for comparison of these objects. We will also discuss the convergence of Mapper to the underlying ground truth information, the Reeb graph.

#### Takashi Owada, Purdue University

*Limit Theorems for Hyperbolic Random Geometric Graph Model*

We shall consider a geometric graph model on the “hyperbolic” space, which is characterized by a negative Gaussian curvature. Among several equivalent models representing the hyperbolic space, we treat the most commonly used d-dimensional Poincare ball model. One of the main characteristics of geometric graphs on the hyperbolic space is tree-like hierarchical structure. From this viewpoint, we discuss the asymptotic behavior of subtree counts. It then turns out that the spatial distribution of subtrees is crucially determined by an underlying curvature of the space. For example, if the space gets flatter and closer to the Euclidean space, subtrees are asymptotically scattered near the center of the Poincare ball. On the contrary, if the space becomes “more hyperbolic” (i.e., more negatively curved), the distribution of trees is asymptotically determined by those concentrated near the boundary of the Poincare ball. This is joint work with Yogeshwaran D. at Indian Statistical Institute.

#### Victor Panaretos, EPFL

*Fréchet Means and Procrustes Analysis in Wasserstein Space*

Given a collection of diffuse random measures on R^d, we consider two problems at the interface of functional and geometrical data analysis: constructing their optimal multicoupling in a mean square sense; and of determining their Fréchet mean with respect to the Wasserstein metric. Though these problems admit explicit solutions when d=1, they are considerably more challenging when d>1. To attack them, we exploit the geometry of Wasserstein space, and specifically the relation between its tangent bundle and the class of optimal coupling maps. This allows us to determine the gradient of the Fréchet functional and show that the two problems at hand represent two sides of the same coin: gradient descent for the Fréchet mean reduces to a Procrustes algorithm, requiring only successive solution of pairwise coupling problems, thus being practically feasible; and the optimal maps obtained as a by-product of the Procrustes algorithm can be used to explicitly construct the optimal multi-coupling. We show how these results can be used in order to consistently solve the problem of registration of warped spatio-temporal point processes in an entirely nonparametric fashion, and provide an asymptotic analysis. (Based on joint work with Yoav Zemel, EPFL).

#### Jose Perea, Michigan State University

*Topological time series analysis and learning*

Time varying observations are ubiquitous in today’s data rich world; examples include real-valued time series, video data, dynamic networks, etc. We will describe how techniques from nonlinear time series analysis and computational topology can be combined to generate topological summaries describing the dynamics underlying these systems. In order to leverage these topological features for learning purposes, we will describe theoretical results characterizing compactness in the space of persistence diagrams and present a Weierstrass approximation type theorem for compactly supported functions on persistence diagrams. Several applications will be presented.

#### Antonio Rieser, Centro de Investigación en Matemáticas

*Homotopy theory for data and simplicial complexes*

A nearly ubiquitous assumption in modern data analysis is that a given high-dimensional data set concentrates around a lower dimensional space. Recently, a great deal of attention has been focused on how to use point samples from a metric measure space to estimate the topological and geometric invariants of the space, and on applying the resulting algorithms to real data sets. Most techniques for studying the topology of data, in particular persistent homology, proceed by considering families of topological spaces which were in some way thicker than the original set. In this talk, we ask instead show that a non-trivial homotopy theory may be constructed directly on sets of points. We then show that similar constructions also apply to a variety of combinatorial objects, and give several computations for homotopy groups on point clouds, graphs, and simplicial complexes.

## Posters

#### Erik Jose Amezquita Morataya, Universidad de Guanajuato

*Efficient object classification using Euler Characteristic*

An important problem faced in archaeology is to establish an efficient computer-based object classification algorithm, however, most of such classifications might be dependent on users’ subjectivities, hampering its implementation. Hence, we propose classification criteria based on the object’s topological and geometrical properties, such criteria being more robust and objective. This particular project is based on the idea of Euler Characteristic Graph (ECG), which summarizes the topological-geometrical evolution of an object. Finally, based on it, a new possible classification of pre-Columbian masks is suggested.

#### Feng Bao, University of Tennessee, Chattanooga

*Complex optimization for big computational and experimental neutron datasets*

We present a framework to use high performance computing to determine accurate solutions to the inverse optimization problem of big experimental data against computational models. We demonstrate how image processing, mathematical regularization, and hierarchical modeling can be used to solve complex optimization problems on big data. We also demonstrate how both model and data information can be used to further increase solution accuracy of optimization by providing confidence regions for the processing and regularization algorithms. We use the framework in conjunction with the software package SIMPHONIES to analyze results from neutron scattering experiments on silicon single crystals, and refine first principles calculations to better describe the experimental data.

#### Gilberto Flores and Yair Adan Hernandez, Universidad de Guanajuato-CIMAT

*TDA of Regions with Highest Probability on Manifolds*

Simulating on manifolds is relevant in Topological Data Analysis (TDA) when testing the behavior of several algorithms. This is shown, for example, in Otter et al. [3].

The poster is divided in two. In the first part, we present the definition of uniformity with respect to a measure and we discuss three methods of simulating on a parametrized manifold. The first one is a simple and natural simulation method: to sample from the uniform distribution on the parametrization’s domain. The second method, proposed by Diaconis et al. [1], allows us to sample from the uniform distribution, with respect to the Hausdorff measure, on the manifold. The third method, based on the circular law for random matrices, produces point cloud data with repulsion. Next, we introduce the Hausdorff distance and a concentration inequality shown by Fasy et al. [2] in order to analyze how a cloud point covers the manifold as the number of points increases.

In the second part we show examples of simulations on the 3-petal rose, 2-Torus, 2-Sphere and Klein Bottle. We compare point clouds generated from the different methods using persistence diagrams, generated from the Rips filtration and a filtration on the upper level sets from the kernel density estimator. Finally, we discuss about the concentration inequality with simulation from diferent distributions on the Klein Bottle.

References[1] Diaconis, P., S. Holmes, and M. Shahshahani: Sampling from a Manifold. Advances in Modern Statistical Theory and Applications: a Festschrift in honor of Morris L. Eaton, 10:102–125, 2013.[2] Fasy, B.T., F. Lecci, A. Rinaldo , L. Wasserman, S. Balakrishnan, and A. Singh: Confidence Sets for Persistence Diagrams. The Annals of Statistics, 42(6):2301–2339, 2014.[3] Otter, N., M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington: A roadmap for the computation of persistent homology. ArXiv e-prints, June 2015.

#### David Keffer, University of Tennessee, Knoxville

*3D-3D Point Set Registration to Unknown References: Applications in Materials Science*

3D-3D Point Set Registration to Unknown References: Applications in Materials Science Keffer, D.J., McNutt, N.W., Rios, O., Maroulas, V. Point Set Registration in three dimensions is becoming increasingly relevant as a means of extracting structure at the atomic level from data sets containing tens of millions of atomic coordinates from experiment and simulation. Two challenges are investigated, which involve extracting a distribution of structures when the underlying reference, or average structure, is known (as possible from x-ray diffraction for crystalline materials) or unknown (as in the case of highly disordered or amorphous materials). The most complete experimental data sets come from atomic probe tomography (APT), but such data sets contain significant sources of error, including missing atoms and uncertainty in the coordinates. We describe efforts to account for this noise. We present two examples: obtaining the atomic-level structure from experimental APT data of high entropy alloys and from simulated data of lithium-ions in battery anodes composed of hierarchically structured carbon composites. McNutt, N.W., Rios, O., Maroulas, V., Keffer, D.J., “Interfacial Li-ion Localization in Hierarchical Carbon Anodes”, Carbon 111 2017 pp. 828-834. doi: 10.1016/j.carbon.2016.10.061. McNutt, N.W., McDonnell, M.T., Rios, O., Keffer, D.J., “Li-ion Localization and Energetics as a Function of Anode Structure”, ACS Appl. Mater. Interfaces, 9(8) 2017 pp. 6988–7002. doi: 10.1021/acsami.6b13748

#### Andrew Marchese, University of Tennessee, Knoxville

*Signal Classification with a point process distance on the space of Persistence Diagrams*

We consider the problem of signal classification. Through the use of delay-embedding and persistent homology, the signal classification problem is transformed into a persistence diagram classification problem. We propose a new distance on the space of persistence diagrams and introduce a classification scheme utilizing it. This distance takes into consideration the differences in small persistence points among persistence diagrams. Classification using this distance in benchmarked in both synthetic data and real acoustic signals, and outperforms current signal classification techniques. The work is joint with V. Maroulas.

#### Joshua Mike, University of Tennessee, Knoxville

*Combinatorial Hodge Theory for Equitable Kidney Paired Donation*

The problem of Kidney Paired Donation (KPD) has traditionally been approached within an integer programming framework. Here we adopt computational topology methods to find kidney exchange cycles. We employ combinatorial Helmholtz decomposition to split the edge flow describing the KPD pool into three parts. The curl portion of the flow represents local cycles and is trivial here. The gradient portion creates a scoring that we use to measure inequity in the kidney exchange. This scoring measures typical cases of disparity within a KPS pool, specifically that under-demanded pairs and highly sensitized patients have lower scores than typical patient-donor pairs. The last portion of the decomposition is used to guide our search for kidney exchange cycles by capturing the 1-cohomology of the kidney exchange graph and investigating the tendency of the donations to occur in cycles. Further results demonstrate that PD pair score and the change to obtain a kidney are positively correlated when using top trading cycles and chains; in contrast, we show that our method eliminates disparity in a KPD pool; i.e., the chance to obtain a kidney through our method is independent of score. This is joint work with V. Maroulas.

#### Xiaoyang Pan, University of Tennessee, Knoxville

*Asymptotic behaviors of the optimal filter for nonlinear dynamical systems with Levy noises*

The learning of spatio-temporal data under a partially observed framework modelled by stochastic dynamical systems has increasingly become an important direction in data science. With applications to finances, wireless networks, etc, a stochastic dynamical system usually consists of a latent process of interest and a related process that gives the observable data. For such a model, optimal filtering provides a manner of data assimilation to estimate the targeting processes. In this work, a systematic treatment of the nonlinear filter is concerned with establishing the uniqueness for the well-known Zakai equation of the optimal filter for a general Levy system. Upon the establishment we further study the probabilistic behavior of the optimal filter with small signal perturbations by deriving a large deviation principle, which offers asymptotic approximations of the probabilities as well as simulations of rare events regarding the objective spatial-temporal process.

#### Rolando Rojo, University of Guanajuato

*Estimation of linear-circular densities with an application to camera trap data*

Camera trap data contain detected activity times of individuals of species. These are typically analyzed via circular densities in order to study patterns of hourly activity and the overlapping between predators and preys. Here, we incorporate moonlight luminance, computed by astronomical considerations, and seek to understand its joint behavior in regard to species activity. The resulting object of interest is a linear-circular density (i.e. a density on a cylinder). A semiparametric family of densities is considered, and estimation results are shown for a specific predator-prey pair from the biological station in Chamela, Mexico. Biological insight appears to be gained from the results. Joint work with Miguel Nakamura (Centro de Investigación en Matemáticas, Guanajuato, Mexico), and Enrique Martínez-Meyer and Gemma Abisay Ortiz (Centro del Cambio Global y Sustentabilidad en el Sureste, Villahermosa, Tabasco, Mexico)

#### Érika B. Roldán Roa, CIMAT

*Random Lattice Animals*

Lattice Animals are combinatorial structures defined on a lattice in a metric space. We use some probability distributions on different models, including the uniform distribution, to be able to study their topology. This is a stochastic topological model over cubical complexes.

#### Ioannis Sgouralis, Arizona State University

*Biophysical inference with Bayesian non-parametric methods*

Single molecule experiments allow assessment of in vivo processes even at the molecular scale. Although of central importance in modern Biophysics, the interpretation of the provided measurements is challenging due to excessive noise levels and a lack of physical understanding. For this, single molecule measurements are commonly modeled and analyzed with Bayesian methods; however, the traditional methods entail a finite number of parameters and so they may resolve only the simplest biomolecules. Instead, non-parametric approaches are needed for complex biomolecular systems since only non-parametrics allow full posterior inference without assuming a pre-specified or fixed number of potential states. This characteristic makes them ideal alternatives to the existing methods which are based on model selection and information criteria. Here we adapt the infinite hidden Markov model for the analysis of sequential —time series— data. The infinite hidden Markov model identifies the number of states required for each dataset without recourse to any additional steps while self-consistently learns every parameter learned by the traditional hidden Markov model. Despite this generality, simple experimental features (such as drift), common to single molecule time traces result in data over-interpretation and the recruitment of artifactual states. Here we present also modifications that are required for the analysis of time traces from force-spectroscopy and FRET experiments that avoid these pitfalls.

#### Rupei Xu, University of Texas, Dallas

*A Systematic Study of the System with Noisy Data: Inverse Optimization, Machine Learning and Game Theory*

Given a convex optimization problem, the inverse optimization problem is to determine unknown parameters based on the knowledge of its optimal solutions. Due to prohibitively expensive or inaccuracy, bounded rationality, model mismatching, etc, there are noises in the problem. Traditional optimization and inverse optimization techniques may cause statistically inconsistent or computational inefficiency. In this paper, we use a systematic way to study the system in different aspects of Inverse Optimization, Machine Learning and Game Theory, investigating the connections of parameters and optimal solutions vertically and horizontally.