OCEAN

Why OCEAN?

Machine learning and artificial intelligence (AI) have made major strides in the last two decades. The progress has been based on a dramatic increase of data and computing capacity, in the context of a centralized paradigm that requires aggregating data in a single location where massive computing resources can be brought to bear.

This fully centralized machine learning paradigm is, however, increasingly at odds with realworld use cases, for reasons that are both technological and societal. In particular, centralised learning risks exposing user privacy, makes inefficient use of communication resources, creates data processing bottlenecks, and may lead to concentration of economic and political power.

It thus appears most timely to develop the theory and practice of a new form of machine learning that targets heterogeneous, massively decentralised networks, involving self-interested agents who expect to receive value (or rewards, incentive) for their participation in data exchanges.

In response to these challenges, OCEAN is an ERC-funded project which aims to develop statistical and algorithmic foundations for systems involving multiple incentive-driven learning and decision-making agents, including uncertainty quantification predominantly with a Bayesian focus. OCEAN will study the interaction of learning with market constraints (scarcity, fairness, privacy), connecting adaptive microeconomics and market-aware machine learning. To achieve these goals, OCEAN will need to develop new statistical and machine-learning methodologies, together with algorithms for sampling and optimisation which are both scalable to large problems, and have provable theoretical guarantees.

What scientific challenges are we willing to address?

OCEAN vision draws together two notions of intelligence: the microeconomic and the statistical. There have been important historical linkages between computation, economics, and statistics, but these linkages have mostly been pairwise. The OCEAN challenge involves completing the triad, which means more than simply gluing together existing concepts.

Statistical inference and prediction within large network of agents

Being able to perform statistical tasks in the presence of numerous, interlinked agents requires to deal with data complexity (which can be high-dimensional, with multi-scale time and spatial dependence), communication and computation constraints (due to limited bandwith and hardware constraints) as well as statistical heterogeneity (the volume of data may be imbalanced and heterogeneous). Finally, privacy necessitates the minimization of data transfer, and effective inference, which is facilitated by increased data communication. In some situations, privacy protection becomes a matter of degree, to be traded off as a function of economic benefits and costs in a controlled manner.

Economic value of data and welfare-maximizing mechanisms in the presence of rational agents

Dealing with self-interested agents requires to tailor incentives so they are guaranteed to benefit from sharing data and taking part into a collaborative statistical model. This implies developing a measure of economic and inferential value of data (for instance based on improvement of accuracy or reduction of uncertainty in model parameters), understanding the cost of sharing data (including communication, privacy costs and loss of competitiveness) as well as reflecting on fairness to fairly distribute the value from analysis among participants, which can be achieved through a market or information-sharing policy. Addressing these issues requires to introduce tools and concepts from the contract theory and mechanism design literature.

Autonomous and adaptative decision-making within time-varying environments

Real-world decision-making under uncertainty is not merely the product of data analysis on a computer; rather, it is an interaction between data and the preferences, knowledge and skills of multiple decision-maker. The ensuing decisions result from the agent’s local data (acquired knowledge) but also from other agents’ decisions. These scenarios add a Bayesian game-theoretic dimension to learning problems. OCEAN will build a holistic statistical infrastructure in which agents have to explore to learn their preferences. This implies tackling a novel design-of-experiment problems that have seldom been considered up to now, and create a new challenge for the field of sequential statistics.

What is our research agenda?

The science behind OCEAN is a blend of new methods from numerical probability, Bayesian computational statistics, machine learning, distributed algorithms, multi-agent systems, and game theory. Our vision to advance theory is critical to our proposal, as quantitative and rigorous statements about performance are essential to formulate meaningful trade-offs between computational, economic, and inferential goals.

Optimization and dynamic systems

The classical optimization toolbox offers an insufficiently rich corpus of methods for analyzing connections and trade-offs between computational, inferential, and strategic goals. A richer toolbox can be achieved by treating constraints as forces rather than as geometric regions in the configuration space. We want to extend previous work on first-order methods to the more powerful concepts of accelerated gradient descent and to the general setting of variational inequalities.

Bayesian inference and sampling

We will develop uncertainty quantification methods within a coherent Bayesian framework that are applicable to the general federated learning setting. The FL setting requires we reframe the basic Bayesian inferential paradigm to cope with high dimensionality, insufficient information, and heterogeneity. We will also need to develop the theory and methods for efficient approximate Bayesian computation, as well as devise new classes of communication and computationally efficient stochastic gradient Markov Chain Monte Carlo algorithms.

Federated learning

Most federated learning methods focus on predictive approaches. Our objective is to extend their scope to embrass statistical inference on complex models. Our specific objectives are (a) communication efficiency beyond convex risk minimization—with new compression strategies, novel aggregation rules (b) FL beyond stochastic gradient descent—to address complex inference problems, and (c) FL Bayesian methods to provide a complete inferential toolbox.

Privacy

Designing methods providing strong privacy guarantee is key in many real-world inference problems. This requires to develop a framework for inference unifying cryptographic and statistical concepts, to maximise the learning potential of data without compromising privacy.

Economic value of data and incentives

This is an emerging domain that poses many exciting research challenges blending Bayesian statistics and economic concepts. A first challenge is to formalize the concept of “economic” value of data—in terms of a particular inference or prediction problem, given the data already available. From there, we could be able to design and investigate structures of data-sharing markets,
and promote long-term stability in data federation as well as social welfare.

Strategic experimentation

We want to address a learning problem where agents collect data relevant to decision-making and learn from others’ experiments. Specific objectives include devising strategies for multi-agent multi-armed bandits where the agents’ action rewards are interdependent, due to scarcity and congestion, and Markov games with a special emphasis on scenarios in which data and control are decentralized and where multiple, possibly conflicting, objectives should be met.

Online matching

We intend to devise matching process for agents within a dynamic exchange network. The core challenges are integrating relevant local structures, improving algorithm performances with ML driven oracles, and building private and fair matching.