Learning pure quantum states (almost) without regret

Josep Lumbreras¹ josep.lumbreras@u.nus.edu Mikhail Terekhov² mikhail.terekhov@epfl.ch Marco Tomamichel^1,3 marco.tomamichel@nus.edu.sg ¹Centre for Quantum Technologies, National University of Singapore, Singapore ²School of Computer and Communication Sciences, EPFL, Switzerland ³Department of Electrical and Computer Engineering, National University of Singapore, Singapore

(June 5, 2025)

Abstract

We initiate the study of sample-optimal quantum state tomography with minimal disturbance to the samples. Can we efficiently learn a precise description of a quantum state through sequential measurements of samples while at the same time making sure that the post-measurement state of the samples is only minimally perturbed? Defining regret as the cumulative disturbance of all samples, the challenge is to find a balance between the most informative sequence of measurements on the one hand and measurements incurring minimal regret on the other. Here we answer this question for qubit states by exhibiting a protocol that for pure states achieves maximal precision while incurring a regret that grows only polylogarithmically with the number of samples, a scaling that we show to be optimal.

1 Introduction

In this work, we approach quantum state tomography from a new angle. Given sequential access to a finite number of samples of a quantum state, our goal is not only to accurately learn a classical description of the state but also to use measurements that disturb the samples as little as possible. Generally these two goals are incompatible, and we are thus interested in tomography algorithms that find an optimal balance between them. We call this setting quantum state tomography with minimal regret.

Minimizing disturbance is important in many real-world scenarios where the samples that we use for tomography are in fact resources for another tasks — and we thus want to learn the state in a way that is as non-intrusive as possible, ensuring that the post-measurement states remain useful for their intended purpose. An example of this occurs in quantum key distribution, where tomography can be used to keep reference frames aligned during a run but any disturbance due to tomographic measurements will induce bit errors in the correlations used to extract a secret key. Disturbance is also relevant for state-agnostic resource distillation where resourceful states might be destroyed by tomographic measurements but learning the unknown state is crucial since optimal extraction protocols generally depend on its description.

In both cases we encounter a fundamental trade-off between exploration (learning the state) and exploitation (using the samples for another purpose). These types of trade-offs are fundamental to the study of adaptive algorithms in machine learning, and our work establishes a strong link between quantum tomography and the classical multi-armed bandit model in reinforcement learning. In fact, one of the main technical ingredients of the present work is a classical bandit algorithm by some of us [16], originally inspired by shot noise in quantum mechanics.

To illustrate this connection from a physics perspective, it helps to reflect on how measurements disturb quantum systems. A defining feature of quantum theory is that measurements generally disturb the system being measured. But what does this disturbance intuitively mean? To illustrate this, consider a qubit prepared in the pure state

\displaystyle|\psi\rangle=\sqrt{1-\epsilon^{2}}|0\rangle+\epsilon|1\rangle,

(1)

with $\epsilon\in[0,1]$ . A projective measurement in the computational basis $\{|0\rangle,|1\rangle\}$ collapses the state to $|0\rangle$ with probability $1-\epsilon^{2}$ , and to $|1\rangle$ with probability $\epsilon^{2}$ . If $\epsilon=0$ or $\epsilon=1$ , the post-measurement state coincides with the initial state with certainty—there is no disturbance. More generally, when $\epsilon\approx 0$ or $\epsilon\approx 1$ , the post-measurement state remains close to the original one with high probability, indicating low disturbance. In contrast, for $\epsilon=1/\sqrt{2}$ , the state is “maximally far” in the measurement basis, and the outcome is maximally uncertain; the post-measurement state is always far to the initial state, signifying maximal disturbance. This simple example illustrates how disturbance is linked to randomness: measurements that induce minimal disturbance tend to yield more deterministic outcomes, while those that induce maximal disturbance produce outcomes with higher variance.

But how can one perform low-disturbance measurements without prior knowledge of the state? With access to only a single copy, it is fundamentally impossible to design a measurement that avoids disturbing the state while still extracting useful information. However, the situation changes when multiple identical copies are available. In that case, one could strategically use some of them to gain partial information about the state and adapt future measurements to be less disturbing. This naturally leads to the central question of our work:

Given access to a finite sequence of an unknown qubit system, what is the best strategy for performing single-copy projective measurements that extract as much information as possible while minimizing the overall disturbance?

The notion of disturbance is a foundational concept in quantum mechanics and has been explored from various perspectives, notably in works that reformulate the uncertainty principle to quantify the trade-off between measurement-induced disturbance and information gain [18, 5]. Another framework where disturbance is studied are weak measurements, which aim to minimally disturb the quantum system while still providing partial information about it. This idea dates back to the seminal work [2], and has since become a central tool in understanding the interplay between information gain and quantum disturbance. However, performing these measurements typically comes at the cost of low information gain: the less disturbing the measurement, the less informative it is about the quantum state, making weak measurements unsuitable for tasks that demand accurate estimation. In contrast, projective measurements—which are the focus of this work—provide maximal information about the system but often cause significant disturbance, collapsing the state entirely.

In our setting, we are not concerned with the disturbance of a single copy, but rather with the cumulative disturbance across a sequence of identically prepared quantum states. Our goal is to use each copy as effectively as possible to extract information and achieve sample-optimal estimation of the underlying state. This naturally motivates the design of adaptive measurement strategies that balance the tradeoff between information gain and disturbance over time.

Another related concept is that of gentle measurements, which were formalized recently in [1], but have their roots in earlier work, notably the “gentle measurement” lemma introduced in [20]. These are measurements that guarantee, for certain sets of states, that the post-measurement state remains close to the original one, while still allowing useful information to be extracted. Although this is related to weak measurements, an important distinction is that a gentle measurement is considered weak only if it remains non-disturbing across all states (not only a set). Moreover, this framework does not address how to adaptively learn an unknown state using a sequence of projective measurements, which are typically more informative than gentle measurements [6]. Our contribution does not lie in proposing a new class of measurements, but rather in developing adaptive strategies that employ projective measurements in a way that minimizes cumulative disturbance across the sequence of quantum states.

Formally, we consider a sequential decision-making scenario in which the learner has access to $T$ independent copies of an unknown qubit state $\rho$ . At each round $t\in[T]$ , the learner selects a measurement direction $|\psi_{t}\rangle$ and performs a projective measurement in the basis $\{|\psi_{t}\rangle,|\psi_{t}^{c}\rangle\}$ . The outcome $r_{t}\in\{0,1\}$ is sampled according to Born’s rule, $\Pr[r_{t}=1]=p_{t}:=\langle\psi_{t}|\rho|\psi_{t}\rangle.$ The measurement outcome determines the post-measurement state $\tilde{\psi}_{t}\in\{|\psi_{t}\rangle,|\psi_{t}^{c}\rangle\}$ , with $r_{t}=1$ corresponding to the state collapsing to $|\psi_{t}\rangle$ , and $r_{t}=0$ to its orthogonal complement.

We quantify the disturbance introduced by the learner through the cumulative expected infidelity between the unknown state $\rho$ and the resulting post-measurement state $\tilde{\psi}_{t}\in\{|\psi_{t}\rangle,|\psi^{c}_{t}\rangle\}$ , compared to the minimal possible disturbance, which occurs when the measurement direction $\psi_{t}$ is aligned with the eigenvector corresponding to the largest eigenvalue of $\rho$ . Formally, we define the cumulative disturbance as

\displaystyle\textnormal{Disturbance}(T):=\sum_{t=1}^{T}\left(\mathbb{E}[1-F(% \rho,\tilde{\psi}_{t})]-\min_{\tilde{\psi}}\operatorname{\mathbb{E}}[1-F(\rho,% \tilde{\psi})]\right),

(2)

where $F(\rho,\sigma):=\left(\operatorname{Tr}\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}% \right)^{2}$ denotes the quantum fidelity. Note that the second term in the definition of disturbance, $\min_{\psi}(1-F(\rho,\psi))$ , is constant across rounds and acts as a normalization ensuring that the disturbance vanishes when the learner selects the optimal, least-disturbing measurement direction. In particular, when $\rho$ is pure, this minimum is zero, as an optimal measurement does not alter the state. The expression for the disturbance simplifies to the following closed form,

\displaystyle\textnormal{Disturbance}(T)=\sum_{t=1}^{T}2(\lambda_{\max}(\rho)-% p_{t})(\lambda_{\max}(\rho)+p_{t}-1),

(3)

where $\lambda_{\max}(\rho)$ denotes the largest eigenvalue of $\rho$ .

While the above notion of disturbance is defined with respect to the observed outcome, we could also define it as the cumulative infidelity between the unknown state and the average post-measurement state $\rho_{t}=p_{t}\psi_{t}+(1-p_{t})\psi_{t}^{c}$ , i.e

\displaystyle\textnormal{Disturbance}^{*}(T):=

\displaystyle\sum_{t=1}^{T}1-F(\rho,\rho_{t}).

(4)

Although these two notions of disturbance differ in their interpretation—depending on whether the measurement outcomes are observed or not—their behavior is qualitatively the same: both vanish when the measurement direction $\psi_{t}$ is aligned with the state $\rho$ . In particular, when $\rho$ is a pure state, the two disturbances coincide. Both quantities are controlled by a simpler quantity, the regret, defined as

\displaystyle\textnormal{Regret}(T):=\sum_{t=1}^{T}\left(\lambda_{\max}(\rho)-% \langle\psi_{t}|\rho|\psi_{t}\rangle\right),

(5)

since it can be checked that both disturbances satisfy

\displaystyle\textnormal{Disturbance}(T)=\Theta(\textnormal{Regret}(T))\quad% \text{and}\quad\textnormal{Disturbance}^{*}(T)=\Theta(\textnormal{Regret}(T)),

(6)

which means that minimizing disturbance is essentially equivalent to minimizing regret. Intuitively, regret remains small when the chosen probe directions are closely aligned with the dominant eigenvector of $\rho$ , highlighting that learning the structure of the unknown state is necessary to keep the disturbance low. However, since we are also interested in reconstructing the state, we further require that the learner outputs a final estimate $\hat{\rho}_{T}$ with high fidelity to the true state after $T$ rounds. That is, in addition to minimizing cumulative disturbance or regret, the algorithm must also achieve low estimation error defined as

\displaystyle\textnormal{Err}(T):=1-F(\rho,\hat{\rho}_{T}).

(7)

The regret admits a direct physical interpretation in certain quantum thermodynamic scenarios. In particular in [15] some of the present authors established a connection between regret and the cumulative energy dissipation in quantum state-agnostic work extraction protocols.

Consider a setting in which an unknown source emits identically prepared quantum systems in a fixed state $\rho$ . One can design a battery system that sequentially interacts with each quantum copy to extract work from the system and transfer energy into the battery. If the state $\rho$ is known, the protocol can be tailored to extract work optimally at every step. However, when $\rho$ is unknown, each interaction entails a probability of failure due to the mismatch between the protocol and the true state.

This mismatch can be modeled as performing a projective measurement in a guessed direction (corresponding to the control applied to the battery), and success depends on the alignment of this direction with the actual state. In this context, the regret quantifies the cumulative free energy that is wasted due to not applying the optimal work extraction strategy. This interpretation shows the necessity of performing non-invasive tomography on the fly, using each quantum copy not only as a source of energy but also as a source of information about the unknown state.

We emphasize that the notion of regret defined in (6) is not merely a formal construction, but a meaningful quantity that captures the cumulative disturbance caused by projective measurements. Moreover, it admits a concrete physical interpretation in quantum thermodynamics, where it corresponds to the total energy dissipation in agnostic work-extraction protocols. Thus, regret serves as both an operational and physical measure of performance in settings that require minimally disturbing projective measurements.

Challenges. We note that the task of minimizing the regret (6) is captured by the multi-armed quantum bandit (MAQB) framework introduced in [14] (see also [3]). This framework was the first to formalize the exploration–exploitation trade-off in online learning of quantum state properties using classical algorithms. In particular, it was shown that when the unknown state $\rho$ is mixed, the regret suffers a fundamental lower bound of order $\textnormal{Regret}(T)=\Theta(\sqrt{T})$ , which is nearly tight, as there exist protocols achieving $\textnormal{Regret}(T)=\tilde{O}(\sqrt{T})$ by reducing the problem to a linear stochastic bandit [12] and applying classical bandit algorithms in that setting.

However, this lower bound does not apply when $\rho=|\psi\rangle\!\langle\psi|$ is a pure state. The reason is that the lower bound relies on having vanishing statistical noise in the random outcomes $r_{t}$ , whereas in our setting the shot noise vanishes as the measurement direction $|\psi_{t}\rangle$ approaches the target state $|\psi\rangle$ . In this case, the regret simplifies to

\displaystyle\textnormal{Regret}(T)=\sum_{t=1}^{T}\left(1-F(\psi,\psi_{t})% \right),

(8)

so minimizing regret becomes equivalent to performing online quantum state tomography with minimal infidelity, where the goal is to align each probe direction $\psi_{t}$ as closely as possible to the unknown pure state $\psi$ . It is important to emphasize that this notion of regret minimization is not addressed by standard quantum state tomography algorithms, which typically aim to design measurement schemes optimized to output a single classical estimator $\hat{\psi}_{T}$ minimizing the final estimation error (7), rather than controlling the cumulative error across all measurement rounds. This motivates the following question:

•

Question 1. Can we perform single copy sample-optimal state tomography in infidelity and achieve at the same time sub-linear regret for unknown pure states? How much adaptiveness is needed for this task?

It is important to note that adaptiveness plays a crucial role for algorithms that aim to minimise the cumulative disturbance of the post-measured state. One could try to use one of the existing sample-optimal algorithms in the incoherent setting such as [9, 11, 8], which for $T$ samples achieve infidelity $\textnormal{Err}(T)=O(1/T)$ . However, since these algorithms either use fixed bases or randomized measurements, this inevitably leads to a linear scaling $\textnormal{Regret}(T)=O(T)$ . A natural next step is to consider a simple strategy with one round of adaptiveness, where we use a fraction $\alpha\in[0,1]$ of the copies for state tomography to produce a good estimate $\hat{\psi}$ of the unknown $\psi$ , and use the remaining copies to measure along the estimated direction. Using sample-optimal state tomography algorithms this leads to a regret scaling

\displaystyle\textnormal{Regret}(T)=O\left(\alpha T+(T-\alpha T)\frac{1}{% \alpha T}\right),

(9)

which, optimized over $\alpha$ , gives $\text{Regret}(T)=O(\sqrt{T})$ , but results in a sub-optimal error $\textnormal{Err}(T)=O(1/\sqrt{T})$ .

In [14] it was left open the question whether if for pure states one can find an algorithms with a scaling better than $O(\sqrt{T})$ or find a matching lower bound. Since our problem is closely related to the MAQB framework; we name it the pure-state multi-armed quantum bandit (PSMAQB) and use it to address the following questions at the intersection of quantum state tomography and linear stochastic bandits [12]:

•

Question 2. Can we break the square root barrier for pure states by showing that $\textnormal{Regret}(T)=o(\sqrt{T})$ for the PSMAQB problem?

Achieving a better scaling for the PSMAQB problem would provide a physically motivated linear bandit setting where the square root barrier can be surpassed. The linear bandit model with a noise structure studied in [16] is inspired by shot noise in quantum mechanics; however, as we will discuss later, this setting does not align with the PSMAQB problem. The main challenge lies in designing a new algorithm and techniques that exploit the specific structure of the PSMAQB setting compared to the standard linear bandit problem.

2 Main results

In this work, we provide affirmative answers at the same time to Questions 1 and 2 through the following Theorem.

Theorem 1 (informal).

For any unknown pure qubit state $|\psi\rangle$ , we present an algorithm that achieves

\displaystyle\mathbb{E}\left[\textup{Regret}(T)\right]=O\big{(}\log^{2}(T)\big% {)}.

(10)

Moreover, at each time step $t\in[T]$ , our algorithm outputs an online estimate $|\hat{\psi}_{t}\rangle$ with infidelity scaling as

\displaystyle\mathbb{E}\left[1-|\langle\psi|\hat{\psi}_{t}\rangle|^{2}\right]=% \widetilde{O}\left(\frac{1}{t}\right)

(11)

Both statements also holds with high probability.

To prove Theorem 1 we provide an almost fully adaptive adaptive quantum state tomography algorithm that uses $O(T/\log(T))$ rounds of adaptiveness. The exact algorithm and Theorem can be found in Sections 4 and 5. We say that our algorithm is “online” because it is able to output at each time step $t\in[T]$ an estimator with the almost optimal infidelity scaling $O(\frac{1}{t})$ up to logarithmic factors. Now we sketch the main idea of how our algorithm updates the measurements.

1.

Estimation. At each time step $t\in[T]$ we use the past information of measurements on the direction of $\ket{\psi_{a_{1}}},...,\ket{\psi_{a_{t-1}}}$ and associated outcomes $r_{1},...,r_{t-1}\in\{0,1\}^{\otimes t-1}$ to build a high probability confidence region $\mathcal{C}_{t}$ for the unknown environment $|\psi\rangle$ .
2.

Exploration-exploitation. A batch of measurements is performed, given by the directions of maximum uncertainty of $\mathcal{C}_{t}$ such that they give enough information to construct $\mathcal{C}_{t+1}$ (exploration) and also minimise the regret (6) (exploitation).

For the estimation part, we work with the Bloch sphere representation of the unknown state $\Pi=|\psi\rangle\!\langle\psi|=\frac{I+\theta\cdot\sigma}{2}$ where $\theta\in\mathbb{S}^{2}$ and for $\sigma$ we can take the standard Pauli Basis i.e $\sigma=(\sigma_{x},\sigma_{y},\sigma_{z})$ . For each measurement direction $\Pi_{a_{t}}=|\psi_{a_{t}}\rangle\!\langle\psi_{a_{t}}|$ , our algorithm performs $k$ independent measurements using the same direction, and it builds the following $k$ online weighted least squares estimators of $\theta$ ,

\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}r_{s,i}a_{s}\quad\text{for }i\in[k],

(12)

where $r_{s,i}\in\{0,1\}$ is the outcome of the measurement (up to some renormalization) using the projector $\Pi_{a_{s}}$ with Bloch vector $a_{s}\in\mathbb{R}^{3}$ , $V_{t}=\mathbb{I}+\sum_{s=1}^{t}\frac{1}{\hat{\sigma}^{2}_{s}(a_{s})}a_{s}a_{s}% ^{\mathsf{T}}$ is the design matrix and $\hat{\sigma}^{2}_{s}(a_{s})$ is a variance estimator of the real variance associated to the outcome $r_{s}$ . The key point where we take advantage from the structure of the quantum problem is that the variance of the outcome $r_{a}$ associated to the action $\Pi_{a}$ can be bounded as $\operatorname{\mathbb{V}}[r_{a}]\leq 1-|\langle\psi|\psi_{a}\rangle|^{2}$ . The idea is that through a careful choice of actions we can make the terms $1/\hat{\sigma}^{2}_{s}(a_{s})$ arbitrarily large and “boost” the confidence on the directions $a_{s}$ in the estimators (12) that are close to $\theta$ . However, this comes at a price, and is that in order to get good concentration bounds for our estimator we need to deal with unbounded random variables and finite variance. We address this issue using the new ideas of median of means (MoM) for online least squares estimators introduced in [4, 17, 19]. The construction takes inspiration from the old method of median of means [13, Chapter 3] for real random variables with unbounded support and bounded variance but requires non-trivial adaptation for online linear least squares estimators. Similarly to the real case we use the $k$ independent estimators (12) in order to construct the MoM estimator $\widetilde{\theta}^{\text{\tiny wMoM}}_{t}$ such that we can build a confidence region with concentration bounds scaling as $1-\exp(-k)$ . In particular, the reference we cite for the median of means [19] is a recent theoretical contribution that explicitly posed as an open question the range of settings where this approach could be applied; our work provides a concrete and significant answer in the quantum domain. We give the exact construction in Section 4.1.

\begin{overpic}[percent,width=216.81pt]{algorithm_psmaqb.png} \put(50.0,70.0){\rotatebox{80.0}{$|\psi^{+}_{a_{t}}\rangle$}} \put(70.0,42.0){\rotatebox{0.0}{$|\psi^{-}_{a_{t}}\rangle$}} \put(75.0,85.0){\rotatebox{0.0}{$\mathcal{C}_{t}$}} \put(82.0,70.0){{$|\widehat{\psi}_{t}\rangle$}} \put(90.0,60.0){{$|\psi\rangle$}} \par\end{overpic}

Figure 1: The algorithm at each time step outputs an estimator

|\widehat{\psi}_{t}\rangle

and builds a high-probability confidence region

\mathcal{C}_{t}

(shaded region) around the unknown state

|\psi\rangle

on the Bloch sphere representation. Then uses the optimistic principle to output measurement directions

|\psi^{\pm}_{a_{t}}\rangle

that are close the unknown state

\ket{\psi}

projecting into the Bloch sphere the extreme points of the largest principal axis of

\mathcal{C}_{t}

. This particular choice allows optimal learning of

\ket{\psi}

(exploration) and simultaneously minimizes the regret (exploitation).

For the exploration-exploitation part, we take the ideas that we develop in [16] (see Figure 1). We give the precise action choice in Section 4.2, and here we sketch the main points. We take inspiration from the optimistic principle for bandit algorithms which in short tells us to choose the most rewarding actions with the available information. In order to use this idea, we use the confidence region that we build in the estimation part and we select measurements that align with the (unknown) direction of $\ket{\psi}$ . See Figure 1. Our algorithm also achieves the relation $1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}=O\left(1/\lambda_{\min}(V_{t})\right)$ , where the minimum eigenvalue $\lambda_{\min}(V_{t})$ quantifies the direction of maximum uncertainty (exploration) of our estimator. The maximum eigenvalue $\lambda_{\max}(V_{t})$ quantifies the amount of exploitation. We can relate these two concepts through the Theorem we formally state and prove in [16, Theorem 3], which states that for our particular measurement choice we have $\lambda_{\min}(V_{t})=\Omega(\sqrt{\lambda_{\max}(V_{t})})$ . Using this relation and a careful analysis, we can show that $\lambda_{\max}(V_{t})=\Omega(t^{2})$ which gives $\lambda_{\min}(V_{t})=\Omega(t)$ and the scaling $1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}=O(1/t)$ . We emphasize that the key point that allows to achieve the rate $\lambda_{\min}(V_{t})=\Omega(t)$ is the fact that the variance estimators $\hat{\sigma}^{2}_{s}$ can get as close as possible to zero since the variance of the rewards goes to zero if we select measurements close to $|\psi\rangle$ .

To check the optimality of the regret, we derive a minimax expected regret lower bound based on the optimal quantum state tomography for pure state results in [10]. The proof does not follow directly from [10], and we have to adapt it to the bandit setting.

Theorem 2 (informal).

The cumulative expected regret for any strategy is bounded by

\displaystyle\mathbb{E}\left[\textup{Regret}(T)\right]=\Omega(\log T),

(13)

where the expectation is taken over the probability distribution of rewards and actions induced by the learner strategy and also uniformly over the set of pure state environments.

This result is formally derived in Section 6. There it is also generalized to the $d$ -dimensional case, in which case the bound is given by $\mathbb{E}\left[\textup{Regret}(T)\right]=\Omega(d\log(T/d))$ . The proof relies on the fact that individual actions of a strategy at time $t\in[T]$ can be viewed as quantum state tomographies using $t$ copies of the state. A relation between the fidelity of these tomographies and the regret of the strategy allows us to convert the fidelity upper bound from [10] to a regret lower bound. We use measure-theoretic tools to adapt the proof from [10] to a more general case where the tomography can output an arbitrary distribution of states. We remark that this is a noteworthy result since in [16] they argue how regret lower bound techniques for classical linear bandits fail for noise models with vanishing variance.

2.1 Outlook and open problems

From a quantum state tomography perspective, our work introduces completely new techniques for the adaptive setting, such as the median of means online least squares estimator or the optimistic principle. We expect these techniques to be useful in other quantum learning settings that require adaptiveness, particularly when quantum states serve as resources and must be minimally disturbed during the learning process, such as the state-agnostic work extraction protocols in [15]. Our algorithm achieves a polylogarithmic regret, which is an exponential improvement over all previously known algorithms for quantum tomography which can only achieve such a fidelity by accumulating a linear regret. At a fundamental level, our algorithm goes beyond traditional tomography ideas and shows that is enough to project near the state in order to optimally learn it with minimal disturbance to the samples. From a classical bandit perspective, it is surprising that the setting of learning pure quantum states gives the first non-trivial example of a linear bandit with continuous action sets that achieves polylogarithmic regret. This model motivated our classical work [16] and, jointly with the current work, we establish a bridge between the fields of quantum state tomography and linear stochastic bandits or, more generally, reinforcement learning.

We leave as an open problem the generalization of the algorithm beyond qubits. In particular, our approach relies on the one-to-one correspondence between pure qubit states and the Bloch sphere. While the chosen measurements are specifically designed to work with high-dimensional spheres, for $d>2$ this correspondence is no longer an isomorphism, and it is not straightforward how to generalize the measurement directions.

We also leave open the question of whether other state tomography algorithms, especially those designed to minimize disturbance, such as weak or gentle measurements, can achieve sublinear regret—particularly polylogarithmic regret. We believe that adaptiveness plays a crucial role in any algorithm aiming to minimize the regret.

3 The model

In this section first connect the notions of disturbance and regret we formally state the PSMAQB problem and make a connection with a linear stochastic bandit problem. Then we define a slightly more general model where the key feature is that the variance of the rewards vanishes with the same behaviour as the PSMAQB problem.

3.1 Notation

First, we introduce some basic notation and conventions. Let $[t]=\{1,2,...,t\}$ for $t\in\mathbb{N}$ . For real vectors $x,y\in\mathbb{R}^{d}$ we denote their inner product as $\langle x,y\rangle=x_{1}y_{1}+...+x_{d}y_{d}$ . Given a real vector $x\in\mathbb{R}^{d}$ we denote the 2-norm as $\|x\|_{2}$ and for a real semi-positive definite matrix $A\in\mathbb{R}^{d\times d}$ , $A\geq 0$ the weighted norm with $A$ as $\|x\|^{2}_{A}=\langle x,Ax\rangle$ . The set corresponding to the surface of the unit sphere is $\mathbb{S}^{d-1}=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}$ . For a real symmetric matrix $A\in\mathbb{R}^{d\times d}$ we denote $\lambda_{\max}(A)$ , $\lambda_{\min}(A)$ its maximum and minimum eigenvalues respectively. We use the ordering $\lambda_{\min}(A)\leq\lambda_{2}(A),....,\lambda_{d-1}(A)\leq\lambda_{\max}(A)$ for the $i$ -th $\lambda_{i}(A)$ eigenvalue in increasing order. For a random variable $X$ (discrete or continuous) we denote $\operatorname{\mathbb{E}}[X]$ and $\operatorname{\mathbb{V}}[X]$ its expectation value and variance respectively. A random variable $X$ is $\sigma$ -subgaussian if $\forall\lambda\in\mathbb{R},\operatorname{\mathbb{E}}\left[\exp(\lambda X)% \right]\leq\exp\left(\lambda^{2}\sigma^{2}/2\right)$ .

Let $\mathcal{S}_{d}=\{\rho\in\mathbb{C}^{d\times d}:\operatorname{Tr}(\rho)=1,\rho% \geq 0\}$ be the set of quantum states in a $d$ -dimensional Hilbert space $\mathcal{H}=\mathbb{C}^{d}$ and $\mathcal{S}^{*}_{d}=\{\rho\in\mathcal{S}_{d}:\rho^{2}=\rho\}$ the set of pure states or rank-1 projectors. We will use the parametrization given in [7] of a $d$ -dimensional quantum state $\rho_{\theta}\in\mathcal{S}_{d}$ ,

\displaystyle\rho_{\theta}=\frac{\mathbb{I}}{d}+\left(\sqrt{\frac{d(d-1)}{2d^{% 2}}}\right)\theta\cdot\sigma

(14)

where $\theta\in\mathbb{R}^{d^{2}-1}$ , and $\sigma=(\sigma_{1},...,\sigma_{d^{2}-1})$ is a vector of orthogonal, traceless, Hermitian matrices with the normalization condition $\operatorname{Tr}(\sigma_{i}\sigma_{j})=2\delta_{i,j}$ . We will use the subscript $\theta$ in the quantum state $\rho_{\theta}$ in order to denote the vector of the parametrization (14). In particular the normalization is taken such that $\|\theta\|_{2}^{2}\leq 1$ with equality if $\rho_{\theta}$ is pure. Note that the parametrization enforces $\rho^{\dagger}_{\theta}=\rho_{\theta}$ and $\operatorname{Tr}(\rho_{\theta})=1$ . Also there are some extra conditions on the vector $\theta$ regarding the positivity of the density matrix $\rho_{\theta}$ but we will not use them. For two quantum states $\rho,\sigma\in\mathcal{S}_{d}$ the fidelity is $F(\rho,\sigma)=\left(\operatorname{Tr}(\sqrt{\sqrt{\sigma}\rho\sqrt{\sigma}})% \right)^{2}$ and the infidelity $1-F(\rho,\sigma)$ . For a Hilbert space $\mathcal{H}$ , the set of linear operators on it will be denoted by $\operatorname{End}(\mathcal{H})$ . The joint state of a system consisting of $n$ copies of a pure state $\Pi_{\theta}\in\mathcal{S}_{d}^{*}$ is given by the $n$ -th tensor power $\Pi_{\theta}^{\otimes n}\in\operatorname{End}(\mathcal{H}^{\otimes n})$ . Using Dirac notation, we can express $\Pi_{\theta}=|\psi_{\theta}\rangle\!\langle\psi_{\theta}|$ for some normalized $|\psi_{\theta}\rangle\in\mathcal{H}$ . Then, the span of all $n$ -copy states of the form $|\psi_{\theta}\rangle^{\otimes n}$ is called the symmetric subspace of $\mathcal{H}^{\otimes n}$ , denoted by $\mathcal{H}^{\otimes n}_{+}$ . Its dimension is $D_{n}=\binom{n+d-1}{d}$ . The symmetrization operator $\Pi^{+}_{n}\in\operatorname{End}(\mathcal{H}^{\otimes n})$ is the projector onto $\mathcal{H}^{\otimes n}_{+}$ .

3.2 Cumulative disturbance and regret

Here we formally show that the notions of disturbance (2) and (4) for qubits are indeed controlled by the regret defined as in (6).

Lemma 3.

Consider the notion of disturbance defined in (2) then we have that

\displaystyle\textnormal{Disturbance}(T)=\Theta(\textnormal{Regret}(T)),

(15)

where regret is defined as in (6).

Proof.

First, without loss of generality we define $\frac{1}{2}\leq p_{t}=\langle\psi_{t}|\rho|\psi_{t}\rangle$ and using $\psi^{c}=\mathbb{I}-\psi$ we can directly compute

\displaystyle\mathbb{E}[1-F(\rho,\tilde{\psi}_{t})]=2p_{t}\left(1-p_{t}\right),

(16)

and using $p_{t}\leq\lambda_{\max}(\rho)$ we have

\displaystyle\min_{\psi}\operatorname{\mathbb{E}}[1-F(\rho,\tilde{\psi})]=2% \lambda_{\max}(\rho)(1-\lambda_{\max}(\rho)).

(17)

Then using the identity $1=x^{2}+(1-x)^{2}+2x(1-x)$ we have

\displaystyle 2p_{t}\left(1-p_{t}\right)-2\lambda_{\max}(\rho)(1-\lambda_{\max% }(\rho))=2(\lambda_{\max}(\rho)-p_{t})(\lambda_{\max}(\rho)+p_{t}-1).

(18)

By using $p_{t}\leq\lambda_{\max}(\rho)\leq 1$ we have

\displaystyle 2(\lambda_{\max}(\rho)-p_{t})(\lambda_{\max}(\rho)+p_{t}-1)\leq 2% (\lambda_{\max}(\rho)-p_{t}),

(19)

which leads to $\textnormal{Disturbance}(T)\leq 2\textnormal{Regret}(T)$ . Then the converse bound follows simply by using $p_{t}\geq\frac{1}{2}$ in the second factor of (18). ∎

Lemma 4.

Consider the notion of disturbance defined in (4), then we have that

\displaystyle\textnormal{Disturbance}^{*}(T)=\Theta(\textnormal{Regret}(T)),

(20)

where regret is defined as in (6).

Proof.

First, without loss of generality we define $\frac{1}{2}\leq p_{t}=\langle\psi_{t}|\rho|\psi_{t}\rangle$ and then using the clossed formula for the fidelity for qubits and $\rho_{t}=p_{t}\psi+(1-p_{t})\psi^{c}_{t}$ , we have

	$\displaystyle F(\rho,\rho_{t})$	$\displaystyle=\operatorname{Tr}(\rho\rho_{t})+2\sqrt{\det\rho\det\rho_{t}}$		(21)
		$\displaystyle=p_{t}^{2}+(1-p_{t})^{2}+2\sqrt{\lambda_{\max}(\rho)(1-\lambda_{% \max}(\rho))p_{t}(1-p_{t})}.$		(22)

Using that $p_{t}\leq\lambda_{\max}(\rho)$ we have

	$\displaystyle F(\rho,\rho_{t})$	$\displaystyle\geq p_{t}^{2}+(1-p_{t})^{2}+2p_{t}(1-\lambda_{\max}(\rho))$		(23)
		$\displaystyle=1+2p_{t}(p_{t}-\lambda_{\max}(\rho)).$		(24)

Then we can upper bound the infidelity as

\displaystyle 1-F(\rho,\rho_{t})\leq(\lambda_{\max}(\rho)-p_{t})p_{t}\leq 2(% \lambda_{\max}(\rho)-p_{t}),

(25)

which leads to $\textnormal{Disturbance}^{*}(T)\leq 2\textnormal{Regret}(T)$ . For the other bound we can use the geometric mean $2\sqrt{xy}\leq x+y$ in (21) and we have

	$\displaystyle F(\rho,\rho_{t})$	$\displaystyle\leq p_{t}^{2}+(1-p_{t})^{2}+p_{t}(1-p_{t})+\lambda_{\max}(\rho)(% 1-\lambda_{\max}(\rho))$		(26)
		$\displaystyle=1+(p_{t}-\lambda_{\max}(\rho))(p_{t}+\lambda_{\max}(\rho)-1)\leq 1% +(p_{t}-\lambda_{\max}(\rho)),$		(27)

where we used $p_{t},\lambda_{\max}(\rho)\leq 1$ . This gives

\displaystyle 1-F(\rho,\rho_{t})\geq\lambda_{\max}(\rho)-p_{t},

(28)

which leads to $\textnormal{Disturbance}^{*}(T)\geq\textnormal{Regret}(T)$ . ∎

3.3 Multi-armed quantum bandit for pure states

The model that we are interested in is the general multi-armed quantum bandit model described in [14][Section 2.3] with the action set being all rank-1 projectors and with pure state environments. For completeness, we state the basic definitions for this particular case for any dimension.

Definition 5.

Let $d\in\mathbb{N}$ . A $d$ -dimensional pure state multi-armed quantum bandit (PSMAQB) is given by a measurable space $(\mathcal{A},\Sigma)$ , where $\mathcal{A}=\mathcal{S}_{d}^{*}$ is the action set and $\Sigma$ is a $\sigma$ -algebra of subsets of $\mathcal{A}$ . The bandit is in an environment, a quantum state $\Pi_{\theta}\in\mathcal{S}_{d}^{*}$ , that is unknown.

The interaction with the PSMAQB is done by a learner that interacts sequentially over $t\in[T]$ rounds with the unknown environment $\Pi_{\theta}\in\mathcal{S}^{*}_{d}$ . At each time step $t\in[T]$ :

1.

The learner selects an action $\Pi_{a_{t}}\in\mathcal{A}$ .

Performs a measurement on the unknown environment $\Pi_{\theta}$ using the two-outcome POVM $\{\Pi_{a_{t}},I_{d\times d}-\Pi_{a_{t}}\}$ and receives a reward $r_{t}\in\{0,1\}$ sampled according to the Born’s rule, i.e

\displaystyle\mathrm{Pr}_{\Pi_{\theta}}\left(r_{t}|\Pi_{a_{t}}\right)=\begin{% cases}\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\quad\text{if}\quad r_{t}=1,\\ 1-\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\quad\text{if}\quad r_{t}=0,\\ 0\quad\text{else}.\end{cases}

(29)

We note that the reward at time step $t$ after selecting $\Pi_{a_{t}}\in\mathcal{A}$ can be written as

\displaystyle r_{t}=\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})+\epsilon_{t},

(30)

where $\epsilon_{t}$ is a Bernoulli random variable with values $\epsilon_{t}\in\{1-\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}}),-\operatorname{% Tr}(\Pi_{\theta}\Pi_{a_{t}})\}$ such that

	$\displaystyle\operatorname{\mathbb{E}}\left[\epsilon_{t}\|\mathcal{F}_{t-1}\right]$	$\displaystyle=0,$		(31)
	$\displaystyle\operatorname{\mathbb{V}}\left[\epsilon_{t}\|\mathcal{F}_{t-1}\right]$	$\displaystyle=\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\left(1-\operatorname{% Tr}(\Pi_{\theta}\Pi_{a_{t}})\right),$		(32)

where $\mathcal{F}_{t-1}:=\{r_{1},\Pi_{a_{1}},...,r_{t-1},\Pi_{a_{t-1}},\Pi_{a_{t}}\}$ is a $\sigma$ -filtration.

Formally the learner is described by a policy.

Definition 6.

A policy $\pi$ is a set of conditional probability measures $\{\pi_{t}\}_{t\in\mathbb{N}}$ on the action set $\mathcal{A}$ of the form

\displaystyle\pi_{t}(\cdot|r_{1},\Pi_{a_{1}},...,r_{t-1},\Pi_{a_{t-1}}):\Sigma% \rightarrow[0,1].

(33)

Then the policy interacting with the environment $\Pi_{\theta}$ defines the probability measure over the set of actions and rewards $P_{\Pi_{\theta},\Pi}:\left(\Sigma\times\{0,1\}\right)^{\times T}\rightarrow[0,1]$ as

\displaystyle\int\cdots\int\mathrm{Pr}_{\Pi_{\theta}}\left(r_{T}|\Pi_{a_{T}}% \right)\pi_{T}(d\Pi_{T}|r_{1},\Pi_{a_{1}},...,r_{T-1},\Pi_{a_{T-1}})\cdots% \mathrm{Pr}_{\Pi_{\theta}}\left(r_{1}|\Pi_{a_{1}}\right)\pi_{1}\left(d\Pi_{a_{% 1}}\right),

(34)

where the integrals are taken with respect to the corresponding subsets of actions.

The goal of the learner is to efficiently learn a classical description of the environment $\Pi_{\theta}=\ket{\psi_{\theta}}\!\bra{\psi_{\theta}}\in\mathcal{S}_{d}^{*}$ while minimizing the disturbance of the post-measured state $\tilde{\psi}_{t}\in\mathcal{S}^{*}_{d}$ that is distributed accordingly to

\displaystyle\mathrm{Pr}\left(R_{t}|\Pi_{a_{t}}\right)=\begin{cases}% \operatorname{Tr}\left(\Pi_{\theta}\Pi_{a_{t}}\right)\quad\text{if }\quad R_{t% }=\Pi_{a_{t}}\\ 1-\operatorname{Tr}\left(\Pi_{\theta}\Pi_{a_{t}}\right)\quad\text{if }\quad R_% {t}=\Pi^{c}_{a_{t}}\\ 0\quad\text{else},\end{cases}

(35)

where $\Pi_{a_{t}}=|\psi_{a_{t}}\rangle\!\bra{\psi_{a_{t}}}$ and

\displaystyle\Pi^{c}_{a_{t}}=|\psi^{c}_{a_{t}}\rangle\!\bra{\psi^{c}_{a_{t}}},% \quad|\psi^{c}_{a_{t}}\rangle=\frac{\ket{\psi}-\langle\psi_{a_{t}}|\psi\rangle% \ket{\psi_{a_{t}}}}{\sqrt{1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}}}.

(36)

The task of the learner is captured by minimizing the cumulative regret, our figure of merit that is defined as follows.

Definition 7.

Given a $d$ -dimensional pure state multi-armed quantum bandit, a policy $\pi$ , and unknown environment $\Pi_{\theta}\in\mathcal{S}^{*}_{d}$ and $T\in\mathbb{N}$ , the cumulative regret is defined as

\displaystyle\textup{Regret}(T,\pi,\Pi_{\theta})

\displaystyle:=\sum_{t=1}^{T}1-\operatorname{Tr}(\Pi_{\theta},\Pi_{a_{t}}).

(37)

We note that the regret quantifies the cumulative infidelity between the unknown environment and the post-measured state. And this notion of regret is consistent with the one introduced in the introduction (6) since $\lambda_{\max}(\Pi_{\theta})=1$ .

Note that indeed minimizing the regret (37) implies selecting actions $\Pi_{a_{t}}$ that have high fidelity respect to the environment (learning the environment) but at the same time minimizing the cumulative infidelity of the post-measured states. In general the goal of the learner is to minimize the expected cumulative regret that is simply defined as $\operatorname{\mathbb{E}}_{\Pi_{\theta}}[\text{Regret}(T,\pi,\Pi_{\theta})]$ where the expectation $\operatorname{\mathbb{E}}_{\Pi_{\theta}}$ is taken over the probability measure (34). When the context is clear, we will use the notation $\text{Regret}(T)$ . Moreover the expression of the regret (37) coincides with the notion of regret introduced for general multi-armed bandits [14, Section 2.3]. For that reason we refer to the PSMAQB problem as the task of finding a policy that minimizes the expected regret $\operatorname{\mathbb{E}}_{\Pi_{\theta}}[\text{Regret}(T,\pi,\Pi_{\theta})]$ . Minimizing the regret means achieving sublinear regret on $T$ since $\text{Regret}(T)\leq T$ holds for any policy.

3.4 Classical model

In order to study the PSMAQB it is helpful to study it using the linear stochastic bandit framework. The idea will be to express the actions and unknown quantum states as real vectors using the parametrization (14).

In the linear stochastic bandit model, the action set is a subset of real vectors i.e $\mathcal{A}\subseteq\mathbb{R}^{d}$ , and the reward at time step $t\in[T]$ after selecting action $a_{t}\in\mathcal{A}$ is given by

\displaystyle r_{t}=\langle a_{t},\theta\rangle+\epsilon_{t}

(38)

where $\theta\in\mathbb{R}^{d}$ is the unknown parameter and $\epsilon_{t}$ is some bounded $\sigma-$ subgaussian noise that in general can depend on $\theta$ and $a_{t}$ . The regret for this model is given by

\displaystyle\text{Regret}_{cl}(T,\pi,\theta):=\sum_{t=1}^{T}\max_{a\in% \mathcal{A}}\langle\theta,a\rangle-\langle\theta,a_{t}\rangle,

(39)

where the policy $\pi$ is defined analogously to Definition 6. We used the subscript $cl$ to differentiate between quantum and classical model.

In order to express the PSMAQB model as a linear stochastic bandit we can use the parametrization (14) and express the expected reward for action $\Pi_{a_{t}}\in\mathcal{S}^{*}_{d}$ as

\displaystyle\operatorname{Tr}(\Pi_{a_{t}}\Pi_{\theta})=\frac{1}{d}\left(1+% \left(d-1\right)\langle a_{t},\theta\rangle\right).

(40)

Inverting the above expression we have

\displaystyle\langle a_{t},\theta\rangle=\frac{d\operatorname{Tr}(\Pi_{\theta}% \Pi_{a_{t}})-1}{d-1}.

(41)

Let’s quickly revisit the regret expression and use the above identities in order to connect the quantum and classical versions of the regret. We denote $\Pi_{a^{*}}=\operatorname*{\arg\!\max}_{\Pi\in\mathcal{A}}\operatorname{Tr}(% \Pi\Pi_{\theta})$ the optimal action and recall that $1=\operatorname{Tr}(\Pi_{a^{*}}\Pi_{\theta})$ . Then we have

	$\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})$	$\displaystyle=\sum_{t=1}^{T}\operatorname{Tr}(\Pi_{a^{*}}\Pi_{\theta})-% \operatorname{Tr}(\Pi_{a_{t}}\Pi_{\theta})$
		$\displaystyle=\frac{d-1}{d}\sum_{t=1}^{T}\langle\theta,a^{*}-a_{t}\rangle.$

Note that by the normalization (14) we have that for $\rho_{\theta}$ and $\Pi_{a_{t}}$ the corresponding real vecotrs are normalized $\|\theta\|_{2}=\|a_{t}\|=1$ . Thus, since $a^{*}=\theta$ the regret can be written as

	$\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})$	$\displaystyle=\frac{d-1}{d}\sum_{t=1}^{T}\left(1-\langle\theta,a_{t}\rangle\right)$		(42)
		$\displaystyle=\frac{d-1}{2d}\sum_{t=1}^{T}\\|\theta-a_{t}\\|_{2}^{2}.$		(43)

Now we want to formulate a classical bandit such that the environment and actions are given by the real vectors that parameterize the quantum states (14). In order to have an expected linear reward that is linear with respect to $\theta$ and $a_{t}$ it is sufficient to define a renormalized reward as

\displaystyle\tilde{r}_{t}=\frac{dr_{t}-1}{d-1}\in\left\{1,\frac{-1}{d-1}% \right\},

(44)

where we used the reward of the quantum model $r_{t}\in\{0,1\}$ given by 29. Using $\operatorname{\mathbb{E}}[r_{t}|\mathcal{F}_{t-1}]=\operatorname{Tr}(\Pi_{a_{t% }}\rho_{\theta})$ and (40) it is easy to see that

\displaystyle\operatorname{\mathbb{E}}[\tilde{r}_{t}|\mathcal{F}_{t-1}]=% \langle\theta,a_{t}\rangle,

(45)

where naturally we use $\mathcal{F}_{t-1}=\{\tilde{r}_{1},a_{1},...,\tilde{r}_{t-1},a_{t-1},a_{t}\}.$ Thus, we can write the reward in the form (38)

	$\displaystyle\tilde{r_{t}}=\langle\theta,a_{t}\rangle+\epsilon_{t},\quad% \operatorname{\mathbb{E}}[\epsilon_{t}\|\mathcal{F}_{t-1}]=0,$
	$\displaystyle\operatorname{\mathbb{V}}[\epsilon_{t}\|\mathcal{F}_{t-1}]=\left(1% -\langle\theta,a_{t}\rangle\right)\left(1+(d-1)\langle\theta,a_{t}\rangle% \right),$

where the expectation and variance follow from a direct calculation. Then we can study a $d$ -dimensional PSMAQB as a linear stochastic bandit choosing the action set

\displaystyle\mathcal{A}^{\text{quantum}}_{d}:=\{a\in\mathbb{R}^{d^{2}-1}:\Pi_% {a}\in\mathcal{S}^{*}_{d}\}

(46)

with unknown parameter $\theta\in\mathbb{R}^{d^{2}-1}$ such that $\Pi_{\theta}\in\mathcal{S}^{*}_{d}$ . The regret of this linear model is given by $\text{Regret}_{cl}=\frac{1}{2}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}$ and we have the following relation with the quantum model:

\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})=\frac{d-1}{d}\text{Regret}_{cl}% (T,\pi,\theta),

(47)

where we take the same strategy $\pi$ in both sides since we can identify the actions of both bandits through the parametrization (14) and the relation between rewards given by (44). When the context is clear we will simply use $\text{Regret}(T)$ for both quantum and classical model.

3.5 Linear bandit with linearly vanishing variance noise

In [16] some of the present authors introduced the framework of stochastic linear bandits with linear vanishing noise where the setting is a linear bandit with action set $\mathcal{A}=\mathbb{S}^{d}$ , unknown parameter $\theta\in\mathbb{S}^{d}$ and reward $r_{t}=\langle\theta,a_{t}\rangle+\epsilon_{t}$ such that $\epsilon_{t}$ is $\sigma_{t}$ -subgaussian with $\operatorname{\mathbb{E}}[\epsilon_{t}|\mathcal{F}_{t-1}]=0$ and the property of vanishing noise $\sigma^{2}_{t}\leq 1-\langle\theta,a_{t}\rangle^{2}$ . In order to study a PSMAQB we will relax the condition on the subgaussian noise and we will replace it by the following condition on the noise

\displaystyle\operatorname{\mathbb{E}}\left[\epsilon_{t}|\mathcal{F}_{t-1}% \right]=0,\quad\operatorname{\mathbb{V}}\left[\epsilon_{t}|\mathcal{F}_{t-1}% \right]\leq 1-\langle\theta,a_{t}\rangle^{2}.

(48)

As in the classical model of the previous section using that $\max_{a\in\mathcal{A}}\langle\theta,a\rangle=1$ we have that the regret is given by

\displaystyle\text{Regret}(T)=\sum_{t=1}^{T}1-\langle\theta,a_{t}\rangle=\frac% {1}{2}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}.

(49)

We note that finding a strategy that minimizes regret for the above model will also work for a $d=2$ PSMAQB with unknown $\Pi_{\theta}\in\mathcal{S}^{*}_{2}$ using the relations of last sections since

\displaystyle\mathcal{A}_{2}^{\text{quantum}}=\{a\in\mathbb{R}^{3}:\|a\|_{2}=1% \}=\mathbb{S}^{2},

(50)

and the variance of the PSMAQB (3.4) fullfills the relation (48).

4 Algorithm for bandits with linearly vanishing variance noise

In this Section we are going to present an algorithm for the linear bandit model explained in Section 3.5 that is based on the algorithm LINUCB-VN studied in [16] for linear bandits with linearly vanishing noise. Later we will show how to use this algorithm for the qubit PSMAQB problem.

4.1 Median of means for an online least squares estimator

First we discuss the medians of means method for the online linear least squares estimator introduced in [19]. We are going to use this estimator later in order to design a strategy that minimizes the regret for the model introduced in Section 3.5. The reason we need this estimator is that in the analysis of our algorithm we need concentration bounds for linear least squares estimators where the random variables have bounded variance and a possibly unbounded subgaussian parameter. The condition of bounded variance is weaker than the usual assumption of bounded subgaussian noise, however we can recover similar concentration bounds of the estimator if we implement a median of means.

In order to build the median of means online least squares estimator for linear bandits we need to sample $k$ independent rewards for each action. Specifically given an action set $\mathcal{A}\subset\mathbb{R}^{d}$ , an unknown parameter $\theta\in\mathbb{R}^{d}$ , at each time step $t$ we select an action $a_{t}\in\mathcal{A}$ and sample $k$ independent rewards using $a_{t}$ where the outcome rewards are distributed as

\displaystyle r_{t,i}=\langle\theta,a_{t}\rangle+\epsilon_{t,i}\quad\text{for % }i\in[k],

(51)

for some noise such that $\operatorname{\mathbb{E}}[\epsilon_{t,i}|\mathcal{F}_{t-1}]=0$ . We refer to $k$ as the number of subsamples per time step. Then at time step $t$ we define $k$ least squares estimators as

\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}r_{s,i}a_{s}% \quad\text{for }i\in[k],

(52)

where $V_{t}$ is the design matrix defined as

\displaystyle V_{t}=\lambda\mathbb{I}+\sum_{s=1}^{t}a_{s}a_{s}^{\mathsf{T}},

(53)

with $\lambda>0$ being a parameter that ensures invertibility of $V_{t}$ . We note that the design matrix is independent of $i$ . Then the median of means for least squares estimator (MOMLSE) is defined as

\displaystyle\widetilde{\theta}_{t}^{\text{\tiny MoM}}:=\tilde{\theta}_{t,k^{*% }}\quad\text{where }k^{*}=\operatorname*{\arg\!\min}_{j\in[k]}y_{j},

(54)

where

\displaystyle y_{j}=\text{median}\{\|\tilde{\theta}_{t,j}-\tilde{\theta}_{t,i}% \|_{V_{t}}:i\in[k]/j\}\quad\text{for }j\in[k].

(55)

Using the results in [19] we have that the above estimator has the following concentration property around the true estimator.

Lemma 8 (Lemma 2 and 3 in [19]).

Let $\widetilde{\theta}_{t}^{\textup{\tiny MoM}}$ be the MOMLSE defined (54) in with $k$ subsamples with $\{r_{s,i}\}_{(s,i)\in[t]\times[k]}$ rewards and corresponding actions $\{a_{s}\}_{s\in[t]}$ . Assume that the noise of all rewards has bounded variance, i.e $\operatorname{\mathbb{E}}\left[\epsilon^{2}_{s,i}|\mathcal{F}_{t-1}\right]\leq 1$ for all $s\in[t]$ and $i\in[k]$ . Then we have

\displaystyle\mathrm{Pr}\left(\|\theta-\widetilde{\theta}_{t}^{\text{\tiny MoM% }}\|^{2}_{V_{t}}\leq 9\left(\sqrt{9d}+\lambda\|\theta\|_{2}\right)^{2}\right)% \geq 1-\exp\left(\frac{-k}{24}\right).

(56)

We will use a slight modification of the above result with a weighted least squares estimator like the one used in [16]. The weights will be related to a variance estimator of the noise for action $a\in\mathcal{A}$ that at each time step $t$ can be generally defined as

\displaystyle\hat{\sigma}^{2}_{t}:\mathcal{H}_{t-1}\times A\rightarrow\mathbb{% R}_{>0},

(57)

where $\mathcal{H}_{t-1}=\{r_{s,i}\}_{(s,i)\in[t-1]\times[k]}\cup\{a_{s}\}_{s\in[t-1]}$ contains the past information of rewards and actions played. For our purposes we will use only the information of the past actions and in order to simplify notation we will use $\hat{\sigma}^{2}_{t}(a)$ to denote an estimator of the variance for the reward associated action $a\in\mathcal{A}$ with the information collected up to time step $t-1$ . Then the corresponding weighted versions with $k$ subsamples are defined as

\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}r_{s,i}a_{s}\quad\text{for }i\in[k],

(58)

with the weighted design matrix

\displaystyle V_{t}=\lambda\mathbb{I}+\sum_{s=1}^{t}\frac{1}{\hat{\sigma}^{2}_% {s}(a_{s})}a_{s}a_{s}^{\mathsf{T}}.

(59)

Then the weighted version of the median of means linear estimator is defined analogously to (54) with the corresponding weighted versions (58)(59) and we will denote it as $\widetilde{\theta}_{t}^{\text{\tiny wMOM}}$ . In our algorithm analysis we will use the following analogous concentration bound under the condition that the estimators $\hat{\sigma}^{2}_{t}$ overestimate the true variance.

Corollary 9.

Let $\widetilde{\theta}_{t}^{\textup{\tiny wMOM}}$ be the weighted version of the MOMLSE with $k$ subsamples, $\{r_{s,i}\}_{(s,i)\in[t]\times[k]}$ rewards with corresponding actions $\{a_{s}\}_{s\in[t]}$ and variance estimator $\hat{\sigma}^{2}_{t}$ . Define the following event

\displaystyle G_{t}:=\{\big{(}\mathcal{H}_{t-1},a_{t}\big{)}:\operatorname{% \mathbb{V}}[\epsilon_{s,i}]\leq\hat{\sigma}^{2}(a_{s})\ \forall s,i\in[t]% \times[k]\}.

(60)

Then we have

\displaystyle\mathrm{Pr}\left(\|\theta-\widetilde{\theta}_{t}^{\textup{\tiny wMOM% }}\|^{2}_{V_{t}}\leq\beta\mid G_{t}\right)\geq 1-\exp\left(\frac{-k}{24}\right),

(61)

where

\displaystyle\beta:=9\left(\sqrt{9d}+\lambda\|\theta\|_{2}\right)^{2}.

(62)

Proof.

The result follows from applying Lemma 8 to the sequences of re-normalized rewards $\{\frac{r_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{(s,i)\in[t]\times[k]}$ and actions $\{\frac{a_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{s\in[t]}$ . We only need to check that the sequence $\{\frac{\epsilon_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{(s,i)\in[t]\times[k]}$ has finite variance. Conditioning with the event $G_{t}$ and the fact that by definition $\hat{\sigma}^{2}_{s}(a_{s})$ only depend on the past $s-1$ action and rewards we have that the re-normalized noise has bounded variance since

\displaystyle\operatorname{\mathbb{E}}\left[\left(\frac{\epsilon_{s,i}}{\hat{% \sigma}_{s}(a_{s})}\right)^{2}\Bigg{|}\mathcal{F}_{t-1}\right]=\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}\operatorname{\mathbb{E}}[\epsilon^{2}_{s,i}|\mathcal{F% }_{t-1}]=\frac{\operatorname{\mathbb{V}}[\epsilon_{s,i}]}{\hat{\sigma}^{2}_{s}% (a_{s})}\leq 1.

(63)

∎

4.2 Algorithm

The algorithm that we design for linear bandits with linearly variance vanishing noise is LinUCB-VVN (LinUCB vanishing variance noise) stated in Algorithm 1. The algorithm updates the actions in batches of lenght $2k(d-1)$ . For every batch it outputs $2(d-1)$ actions and samples $k$ independent reward with each action. We use a slightly abuse of notation and label each batch with $t$ . For each batch $t\geq 1$ the actions are updated as

\displaystyle a^{\pm}_{t,i}:=\frac{\widetilde{a}^{\pm}_{t,i}}{\|\widetilde{a}^% {\pm}_{t,i}\|_{2}},\quad\tilde{a}^{\pm}_{t,i}=\frac{\widetilde{\theta}^{\text{% wMoM}}_{t-1}}{\|\widetilde{\theta}^{\text{wMoM}}_{t-1}\|_{2}}\pm\frac{1}{\sqrt% {\lambda_{\min}(V_{t-1})}}v_{t-1,i},

(64)

for $i\in[d-1]$ , $v_{t-1.i}$ is the normalized eigenvector of $V_{t-1}$ with eigenvalue $\lambda_{i}(V_{t-1})$ and $\widetilde{\theta}^{\text{wMoM}}_{t}$ is the weighted MOMLSE defined as in Section 4.1 that is build with the $k$ sampled rewards of each action. The design matrix $V_{t}$ is updated as

\displaystyle V_{t}=V_{t-1}+\omega(V_{t-1})\sum_{i=1}^{d-1}\left(a^{+}_{t,i}(a% ^{+}_{t,i})^{\mathsf{T}}+a^{-}_{t,i}(a^{-}_{t,i})^{\mathsf{T}}\right)

(65)

where the weights $\omega$ and variance estimator are chosen as

\displaystyle\omega(V_{t-1}):=\frac{\sqrt{\lambda_{\max}(V_{t-1})}}{12\sqrt{d-% 1}\beta},\quad\hat{\sigma}^{2}_{t}(a^{\pm}_{t,i}):=\frac{1}{\omega(V_{t-1})}.

(66)

We note that the definition for $\hat{\sigma}^{2}_{t}(a^{\pm}_{t,i})$ fulfills the definition of variance estimator (57) stated in the previous section since it only depends on the past history $\mathcal{H}_{t-1}$ .

Require:

\lambda_{0}\in\mathbb{R}_{>0}

k\in\mathbb{N}

\omega:\text{P}^{d}_{+}\rightarrow\mathbb{R}_{\geq 0}

Set initial design matrix

V_{0}\leftarrow\lambda_{0}\mathbb{I}_{d\times d}

Choose initial estimator

{\theta}_{0}\in\mathbb{S}^{d}

for

\theta

at random

for $t=1,2,\cdots$ do

Optimistic action selection

for $i=1,2,\cdots d-1$ do

Select actions

a^{+}_{t,i}

and

a^{-}_{t,i}

according to Eq. (64)

Sample

k

independent rewards for each

a^{\pm}_{t,i}

for $j=1,...,k$ do

Receive associated rewards

r^{+}_{t,i,j}

and

r^{-}_{t,i,j}

end for

Update variance estimator for

a^{+}_{t,i}

\hat{\sigma}^{2}_{t}\leftarrow\frac{1}{\omega(V_{t-1}(\lambda_{0}))}

for

t\geq 2

\hat{\sigma}^{2}_{t}\leftarrow 1

for

t=1

Update design matrix

V_{t}\leftarrow V_{t-1}+\frac{1}{\hat{\sigma}_{t}^{2}}\sum_{i=1}^{d-1}\left(a^% {+}_{t,i}(a^{+}_{t,i})^{\mathsf{T}}+a^{-}_{t,i}(a^{-}_{t,i})^{\mathsf{T}}\right)

Update LSE for each subsample

for $j=1,2,...,k$ : do

\widetilde{\theta}_{t,j}^{\text{w}}\leftarrow V_{t}^{-1}\sum_{s=1}^{t}\frac{1}% {\hat{\sigma}_{t}^{2}}\sum_{i=1}^{d-1}(a^{+}_{s,i}r^{+}_{s,i,j}+a^{-}_{s,i}r^{% -}_{s,i,j})

end for

Compute

\widetilde{\theta}_{t}^{\text{\tiny wMOM}}

using

\{\widetilde{\theta}_{t,j}^{\text{w}}\}_{j=1}^{k}

end for

Algorithm 1 LinUCB-VVN

4.3 Regret analysis

In this Section we present the analysis of the regret for Algorithm 1. The analysis is similar to the LinUCB-VN presented in [16][Appendix C.1]. Thus, we focus on the changes respect to LinUCB-VN and although we present a complete proof we refer to [16] for more detailed computations. The main result we use from [16] is a theorem that quantifies the growth of the maximum and minimum eigenvalues of the design matrix $V_{t}$ (65).

Theorem 10 (Theorem 3 in [16]).

Let $\{c_{t}\}_{t=0}^{\infty}\subset\mathbb{S}^{d-1}$ be a sequence of normalized vectors and $\omega:\textup{P}^{d}_{+}\rightarrow\mathbb{R}_{\geq 0}$ a function such that

\displaystyle\omega(X)\leq C\sqrt{\|X\|_{\infty}},

(67)

for a constant $C>0$ and any $X\in\textup{P}^{d}_{+}$ . Let $\lambda_{0}\geq\max\big{\{}2,\sqrt{\frac{2}{3(d-1)}}2dC+\frac{2}{3(d-1)}\big{\}}$ , and define a sequence of matrices $\{V_{t}\}_{t=0}^{\infty}\subset\mathbb{R}^{d\times d}$ as

\displaystyle V_{0}:=\lambda_{0}\mathbb{I}_{d\times d},\quad V_{t+1}:=V_{t}+% \omega(V_{t})\sum_{i=1}^{d-1}P_{t,i},

(68)

where

	$\displaystyle P_{t,i}:=a^{+}_{t+1,i}(a^{+}_{t+1,i})^{\mathsf{T}}+a^{-}_{t+1,i}% (a^{-}_{t+1,i})^{\mathsf{T}},$		(69)
	$\displaystyle a^{\pm}_{t+1,i}:=\frac{\tilde{a}^{\pm}_{t+1,i}}{\\|\tilde{a}^{\pm% }_{t+1,i}\\|_{2}},\quad\tilde{a}^{\pm}_{t+1,i}:=c_{t}\pm\frac{1}{\sqrt{\lambda_% {t,1}}}v_{t,i},$		(70)

with $\lambda_{t,i}=\lambda_{i}(V_{t})$ the eigenvalues of $V_{t}$ with corresponding normalized eigenvectors $v_{t,1},...,v_{t,d}\in\mathbb{S}^{d-1}$ . Then we have

\displaystyle\lambda_{\min}(V_{t})\geq\sqrt{\frac{2}{3(d-1)}\lambda_{\max}(V_{% t})}\quad\text{for all}\quad t\geq 0.

(71)

For the proof of the above Theorem we refer to the original reference. Then using this Theorem and the concentration bound for MOMLSE given in Corollary 9 we can provide the following regret analysis for a stochastic linear bandit with vanishing variance noise.

Theorem 11.

Let $d\geq 2$ , $k\in\mathbb{N}$ and $T=2(d-1)k\widetilde{T}$ for some $\widetilde{T}\in\mathbb{N}$ , $\widetilde{T}\geq 2$ . Let $\omega(X)$ defined as in (66) using $\lambda_{0}$ satisfying the constraints in Theorem 10. Then if we apply LinUCB-VVN 1( $\lambda_{0},k,\omega$ ) to a $d$ dimensional stochastic linear bandit with variance as in (48) with probability at least $(1-\exp(-k/24))^{\widetilde{T}}$ the regret satisfies

	$\displaystyle\textup{Regret}(T)\leq 4k(d-1)+144d(d-1)k\beta^{2}\log\left(\frac% {T}{2(d-1)k}\right)$		(72)
	$\displaystyle+24(d-1)^{\frac{3}{2}}k\beta\log\left(\frac{T}{2(d-1)k}\right),$		(73)

and at each time step $t\in[T]$ with the same probability it can output an estimator $\hat{\theta}_{t}\in\mathbb{S}^{d-1}$ such that

\displaystyle\|\theta-\hat{\theta}_{t}\|_{2}^{2}\leq\frac{576d^{2}\beta^{2}k+9% 6d\sqrt{d-1}\beta k}{t},

(74)

with $\beta$ defined as in (62).

From the above Theorem we have that if we set $k=\lceil 24\log\left(\frac{\widetilde{T}}{\delta}\right)\rceil$ for some $\delta\in\left(0,1\right)$ then with probability at least $1-\delta$ LinUCB-VNN achieves

\displaystyle\text{Regret}(T)=O\left(d^{4}\log^{2}(T)\right),\quad\|\theta-% \hat{\theta}_{t}\|_{2}^{2}=O\left(\frac{\log(T)}{t}\right).

(75)

Proof.

From the expression of the regret (49) we have that to give an upper bound it suffices to gives an upper bound between the distance of the unknown parameter $\theta$ and the actions $a^{\pm}_{t,i}$ selected by the algorithm (64). We denote the step $\tilde{t}\in[\widetilde{T}]$ to run over the batches the algorithm updates the MoM estimator $\widetilde{\theta}_{t}^{\text{\tiny wMOM}}$ . First we will do the computation assuming that the event

\displaystyle E_{\tilde{t}}:=\{\mathcal{H}_{\tilde{t}}:\forall s\in[\tilde{t}]% ,\theta\in\mathcal{C}_{s}\},

(76)

holds where $\mathcal{C}_{s}=\{\theta^{\prime}\in\mathbb{R}^{d}:\|\theta^{\prime}-% \widetilde{\theta}^{wMOM}_{\tilde{t}}\|^{2}_{V_{s}}\leq\beta\}$ . Here the history $\mathcal{H}_{\tilde{t}}$ is defined with the previous outcomes and actions of our algorithm i.e

\displaystyle\mathcal{H}_{\tilde{t}}:=\left(r^{+}_{s,i,j},a^{+}_{s,i},r^{-}_{s% ,i,j},a^{-}_{s,i}\right)_{(s,i,j)\in[\tilde{t}]\times[d-1]\times[k]}

(77)

Later we will quantify the probability that this event always hold. Using the definition of the actions (64), $\theta,\widetilde{\theta}^{\text{\tiny wMOM}}_{\tilde{t}}\in\mathbb{S}^{d-1}$ and the arguments from [16][Appendix C.1, Eq. (165)] we have that

\displaystyle\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{2}\leq\frac{9\beta}{\lambda% _{\min}(V_{\tilde{t}-1})}.

(78)

Then using that the design matrix $V_{\tilde{t}}$ (65) is updated as in Theorem 10 and the choice of weights (66) we fix

\displaystyle\lambda_{0}\geq\max\left\{2,2\sqrt{\frac{2}{3(d-1)}}\frac{d}{12% \sqrt{d-1}\beta}+\frac{2}{3(d-1)}\right\}

(79)

and we have that $\lambda_{\min}(V_{\tilde{t}})\geq\sqrt{\frac{2}{3(d-1)}\lambda_{\max}(V_{% \tilde{t}})}$ applying Theorem 10. Inserting this into the above we have

\displaystyle\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{2}\leq\frac{12\sqrt{d-1}% \beta}{\sqrt{\lambda_{\max}(V_{\tilde{t}})}}.

(80)

Thus, it remains to provide a lower bound on $\lambda_{\max}(V_{\tilde{t}})$ . We note that in [16][Appendix C.1] they also had to provide an upper bound but this was because the constant $\beta$ beta they use depends on $t$ . From the definition of $V_{t}$ (65) we can bound the trace as

	$\displaystyle\mathrm{Tr}(V_{\tilde{t}})$	$\displaystyle\geq\sum_{s=2}^{\tilde{t}}2(d-1)\omega(V_{s-1})$		(81)
		$\displaystyle=\frac{\sqrt{d-1}}{6\beta}\sum_{s=1}^{\tilde{t}-1}\sqrt{\lambda_{% \max}(V_{s})}.$		(82)

Then using the bound $\mathrm{Tr}(V_{\tilde{t}})\geq\lambda_{\max}(V_{\tilde{t}})/d$ and some algebra we arrive at

\displaystyle\lambda_{\max}(V_{\tilde{t}})\geq\frac{1}{1+6\frac{d}{\sqrt{d-1}}% \beta}\sum_{s=1}^{\tilde{t}}\sqrt{\lambda_{\max}(V_{s})}.

(83)

Now we have an inequality with the function $\lambda_{\max}(V_{s})$ at both sides. In order to solve it we use the technique from [16][Appendix C.1, Eqs. (197)–(208)] which consist on extending $\lambda_{\max}(V_{\tilde{t}})$ to the continuous with a linear interpolation and then transforming the sum to an integral which leads to a differential inequality. Solving this leads to

\displaystyle\lambda_{\max}(V_{\tilde{t}})\geq\frac{\tilde{t}^{2}}{4(1+6\frac{% d}{\sqrt{d-1}}\beta)^{2}}.

(84)

Now we can insert the above into (80) and we have

	$\displaystyle\\|\theta-a^{\pm}_{\tilde{t},i}\\|_{2}^{2}$	$\displaystyle\leq\frac{24\sqrt{d-1}\beta(1+6\frac{d}{\sqrt{d-1}}\beta)}{\tilde% {t}-1}$		(85)
		$\displaystyle=\frac{144d\beta^{2}+24\sqrt{d-1}\beta}{\tilde{t}-1}.$		(86)

Thus, we can inserted the above bound into the regret expression (49) and we have

	$\displaystyle\text{Regret}(T)=\frac{1}{2}\sum_{t=1}^{T}\\|\theta-a_{t}\\|_{2}^{2}$		(87)
	$\displaystyle=\frac{1}{2}\sum_{\tilde{t}=1}^{\tilde{T}}\sum_{i=1}^{d-1}\sum_{j% =1}^{k}\left(\\|\theta-a^{+}_{\tilde{t},i}\\|_{2}^{2}+\\|\theta-a^{-}_{\tilde{t},% i}\\|_{2}^{2}\right)$		(88)
	$\displaystyle\leq 4k(d-1)+\frac{1}{2}\sum_{\tilde{t}=2}^{\tilde{T}}\sum_{i=1}^% {d-1}\sum_{j=1}^{k}\left(\\|\theta-a^{+}_{\tilde{t},i}\\|_{2}^{2}+\\|\theta-a^{-}% _{\tilde{t},i}\\|_{2}^{2}\right)$		(89)
	$\displaystyle\leq 4k(d-1)+(144d(d-1)k\beta^{2}+24(d-1)^{\frac{3}{2}}k\beta)% \sum_{\tilde{t}=2}^{\tilde{T}}\frac{1}{t-1}$		(90)
	$\displaystyle\leq 4k(d-1)+144d(d-1)k\beta^{2}\log\widetilde{T}+24(d-1)^{\frac{% 3}{2}}k\beta\log\widetilde{T}$		(91)
	$\displaystyle=4k(d-1)+144d(d-1)k\beta^{2}\log\left(\frac{T}{2(d-1)k}\right)$
	$\displaystyle\quad+24(d-1)^{\frac{3}{2}}k\beta\log\left(\frac{T}{2(d-1)k}% \right).$		(92)

It remains to quantify the probability that the event $E_{\tilde{t}}$ holds. For that we will use the concentration bounds of the median of means for least squares estimator stated in Corollary 9. From the variance condition of our model (48) we have

\displaystyle\operatorname{\mathbb{V}}[\epsilon^{\pm}_{\tilde{t},i,j}|\mathcal% {F}_{\tilde{t}-1}]\leq 1-\langle\theta,a^{\pm}_{\tilde{t},i}\rangle^{2}\leq 2(% 1-\langle\theta,a^{\pm}_{\tilde{t},i}))=\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{% 2},

(93)

where we used $1+\langle\theta,a^{\pm}_{\tilde{t},i}\rangle\leq 2$ . Thus from our choice of weights (66) and (85) we have that

\displaystyle\text{if }\theta\in\mathcal{C}_{s-1}\Rightarrow\operatorname{% \mathbb{V}}[\epsilon^{\pm}_{\tilde{t},i,j}|\mathcal{F}_{\tilde{t}-1}]\leq\hat{% \sigma}_{s}^{2}(a^{\pm}_{s,i}).

(94)

Then in order to apply Corollary 9 we note that from the choice $\hat{\sigma}_{s}^{2}(a^{\pm}_{1,i})=1$ the event $G_{\tilde{t}}$ at $\tilde{t}=1$ is always satisfied i.e $\mathrm{Pr}(G_{1})=1$ . Then applying Bayes theorem, union bound over the events $G_{1},E_{1},...,G_{t-1},E_{t}$ and Corollary 9 we have

\displaystyle\mathrm{Pr}(E_{\widetilde{T}}\cap G_{\widetilde{T}})\geq\left(1-% \exp(-k/24)\right)^{\widetilde{T}}.

(95)

This probability also quantifies the probability that (85) holds since the only assumption we used is $\theta\in\mathcal{C}_{\tilde{t}-1}$ . Then we can take simply one of the actions $a^{\pm}_{\tilde{t},i}$ as the estimator $\hat{\theta}_{t}$ and the result follows using the relabeling $t=2(d-1)k\tilde{t}$ and the inequality $1/(\tilde{t}-1)\leq 2/\tilde{t}$ for $\tilde{t}\geq 2$ . A more detailed analogous computation of the above probability can be found in [16][Appendix C.1]. ∎

In the previous Theorem we did not set a specific value for the parameter $k$ or the number of subsamples per action. We note that the regret scales linearly with $k$ but since the success probability scales exponentially with $k$ it will suffice to set $k\sim\log(T)$ such that in expectation we get the $\log^{2}(T)$ behavior. We formalize this in the following Corollary.

Corollary 12.

Under the same assumptions of Theorem 11 we can fix $k=\lceil 24\log(\widetilde{T}^{2})\rceil$ and we have that for $t\in[T]$ ,

\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]\leq 344(% d-1)\log\left(T\right)+\left(3546d(d-1)\beta^{2}+1152(d-1)^{\frac{3}{2}}\beta% \right)\log^{2}\left(T\right)

(96)

\displaystyle\operatorname{\mathbb{E}}\left[\|\theta-\hat{\theta}_{t}\|_{2}^{2% }\right]\leq\frac{27648d^{2}\beta^{2}\log(T)+4608d\sqrt{d-1}\beta\log(T)}{t}+% \frac{4(d-1)\log(T)}{T}.

(97)

Using that $\beta=O(d)$ gives

\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]=O(d^{4}% \log^{2}(T)),\quad\operatorname{\mathbb{E}}\left[\|\theta-\hat{\theta}_{t}\|_{% 2}^{2}\right]=\tilde{O}\left(\frac{d^{4}}{t}\right).

(98)

Proof.

The result of Theorem 11 holds with probability at least $(1-\exp(-k/24))^{\widetilde{T}}$ . Setting $k=\lceil 24\log(\widetilde{T}^{2})\rceil$ gives

\displaystyle(1-\exp(-k/24))^{\widetilde{T}}\geq\left(1-\frac{1}{\widetilde{T}% ^{2}}\right)^{\widetilde{T}}\geq 1-\frac{1}{\widetilde{T}}.

(99)

Then given the event $R_{T}$ such that Algorithm 1 achieves the bounds given by Theorem 11 we have that the probability of failure is bounded by

\displaystyle\mathrm{Pr}(R^{C}_{T})\leq\frac{1}{\widetilde{T}},

(100)

where we used $1=\mathrm{Pr}(R_{T})+\mathrm{Pr}(R^{C}_{T})$ . Then the expectation of the bad events can be bounded as

	$\displaystyle\operatorname{\mathbb{E}}\left[\text{Regret}(T)\mathbb{I}\{R^{C}_% {T}\}\right]$	$\displaystyle\leq 4(d-1)k\widetilde{T}\mathrm{Pr}(R^{C}_{T})\leq 4(d-1)k$		(101)
	$\displaystyle\operatorname{\mathbb{E}}\left[\\|\theta-\hat{\theta}_{t}\\|_{2}^{2% }\mathbb{I}\{R^{C}_{T}\}\right]$	$\displaystyle\leq 4\mathrm{Pr}(R^{C}_{T})\leq\frac{4}{\widetilde{T}}$		(102)

where we used $\text{Regret}(T)\leq 2T=4(d-1)k\widetilde{T}$ , $\|\theta-\hat{\theta}_{t}\|_{2}^{2}\leq 4$ . Finally the result follows inserting the value of $k=24\log(\widetilde{T}^{2})$ into the bounds of Theorem 11 and using $\widetilde{T}\leq T$ . ∎

5 Algorithm for qubit PSMAQB and numerical experiments

In this Section we prove our main result that is a regret bound for LinUCB-VVN when applied to the qubit PSMAQB problem.

\begin{overpic}[percent,width=260.17464pt]{psmaqb_regret_infidelity.png} \put(-6.0,35.0){\rotatebox{90.0}{Regret($T$)}} \put(50.0,-2.0){$T$} \put(32.0,25.0){\rotatebox{90.0}{\tiny$\log\left(1-F(\Pi,\Pi_{t})\right)$}} \put(60.0,8.0){\tiny$\log\left(\frac{t}{\log(t)}\right)$} \end{overpic}

Figure 2: Expected regret vs the number or rounds

T

for the LinUCB-VNN algorithm. We run

T=4\cdot 10^{4}

rounds with

k=10

subsamples for the median of means construction. We use

100

independents experiments and average over them. We obtain results for each round but only plot (red crosses) few for clarity of the figure. We fit the regression

\text{Regret}(T)=m_{1}\log^{2}T+b_{1}

with

m_{1}=3.2164\pm 0.0009

and

b_{1}=0.84\pm 0.016

. In the inset plot we plot the expected infidelity of the output estimator at each rounds

t\in[T]

versus the number of rounds

t

. We take

\Pi_{t}=\Pi_{\theta^{\text{wMoM}}_{t}}

as the estimator given by the median of means linear least squares estimator. We fit the regression

1-F(\Pi,\Pi_{t})=b_{2}\left(\frac{\log t}{t}\right)^{m_{2}}

and we obtain

m_{2}=-0.996\pm 0.002

b_{2}=0.112\pm 0.007

. We note that the number of subsamples of the theoretical results is very conservative in comparison with the value we take for the simulations.

Theorem 13.

Let $\widetilde{T}\in\mathbb{N}$ and fix $T=\lceil 96\widetilde{T}\log(\widetilde{T}^{2})\rceil$ . Then given a PSMAQB with action set $\mathcal{A}=\mathcal{S}^{*}_{2}$ and environment $\Pi_{\theta}\in\mathcal{S}^{*}_{2}$ (qubits) we can apply Algorithm 1 for $d=3$ and it achieves

\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]\leq C_{1% }\log\left(T\right)+C_{2}\log^{2}\left(T\right).

(103)

for some universal constants $C_{1},C_{2}\geq 0$ . Also at each time step $t\in[T]$ it outputs an estimator $\hat{\Pi}_{t}\in\mathcal{S}^{*}_{2}$ of $\Pi_{\theta}$ with infidelity scaling

\displaystyle\operatorname{\mathbb{E}}\left[1-F\left(\Pi_{\theta},\hat{\Pi}_{t% }\right)\right]\leq\frac{C_{3}\log(T)}{t},

(104)

for some universal constant $C_{3}\geq 0$ .

Proof.

In order to apply Algorithm 1 to a PSMAQB we set $d=3$ (dimension for a classical linear stochastic bandit) and the actions that we select will be given by $\Pi_{a^{\pm}_{t,i}}$ where $a^{\pm}_{t,i}$ are updated as in (64). Note that they are valid action since $a^{\pm}_{t,i}\in\mathbb{S}^{2}$ imply $\Pi_{a^{\pm}_{t,i}}\in\mathcal{S}^{*}_{2}$ . The rewards received by the algorithm follow (3.4) with the normalization given in (44). This model fits into the linear bandit with linearly vanishing variance noise model explained in Section 3.5 and thus we can apply the guarantees established in Theorem 11 and Corollary 12.

The algorithm is set with $k=\lceil 24\log(\widetilde{T}^{2})\rceil$ batches for the MoM construction. We set $\lambda_{0}=2$ , and using $\|\theta\|_{2}=1$ we have that the constant $\beta$ given in (62) has the value

\displaystyle\beta=9\left(3\sqrt{3}+2\right)^{2}=279+108\sqrt{3}.

(105)

Then we can check that for $d=3$ the condition (79) for the input parameter $\lambda_{0}$ for Theorem 11 to hold is satisfied since

\displaystyle\lambda_{0}=2\geq\max\left\{2,\frac{1}{3}+\frac{1}{2\sqrt{6}(279+% 108\sqrt{3})}\right\}=2.

(106)

In the above we just substituted all numerical values. Then we are under the assumptions of Theorem 11 and Corollary 12 and the result follows applying both results with the relation of regrets between the classical and quantum model given in (47), the relation

\displaystyle\|\theta-\hat{\theta}_{t}\|_{2}^{2}=4\left(1-F\left(\Pi_{\theta},% \Pi_{\hat{\theta}_{t}}\right)\right),

(107)

and substituting all numerical values. We take the estimator $\hat{\theta}_{t}$ given in Theorem 11 for $d=3$ . We use also the bound $\widetilde{T}\leq T$ and reabsorb all the constants into $C_{1},C_{2},C_{3}$ . ∎

Remark 1. The constant dependence can be slightly improved taking the estimator for $\Pi_{\theta}$ as $\Pi_{\theta^{\text{wMoM}}_{t}}$ with $\theta^{\text{wMoM}}_{t}$ defined in (64).

Remark 2. The result of Theorem 13 also holds with high probability. In particular for the choice of batches $k=24\log(\widetilde{T}^{2})$ with probability at least $1-\frac{1}{\widetilde{T}}$ .

6 Regret lower bound for PSMAQB

While the algorithm for PSMAQB presented above is inspired by classical bandit theory, the lower bound on the regret that we derive is essentially based on quantum information theory. The key insight here is that a policy for PSMAQB can be viewed as a sequence of state tomographies. The expected fidelity of these tomographies is linked to the regret. Hence, existing upper bounds on tomography fidelity also provide a lower bound for the expected regret of the policy.

6.1 Average fidelity bound for pure state tomography

In its most general form, a tomography procedure takes $n$ copies of an unknown state $\Pi\in\mathcal{S}_{d}^{*}$ and performs a joint measurement on the state $\Pi^{\otimes n}$ . This is captured in the following definition. Let $(\mathcal{S}_{d}^{*},\Sigma)$ be a $\sigma$ -algebra. A tomography scheme is a positive operator-valued measure (POVM) $\mathcal{T}:\Sigma\to\operatorname{End}(\mathcal{H}^{\otimes n})$ such that $\mathcal{T}(\mathcal{S}_{d}^{*})=\Pi^{+}_{n}$ , where $\Pi^{+}_{n}$ is the symmetrization operator on $\mathcal{H}^{\otimes n}$ . For any $\rho\in\operatorname{End}(\mathcal{H}^{\otimes n})$ , this POVM gives rise to a complex-valued measure

P_{\mathcal{T},\rho}(A)=\operatorname{Tr}(\mathcal{T}(A)\rho)

(108)

for $A\in\Sigma$ . $P_{\mathcal{T},\rho}$ becomes a probability measure if $\rho$ satisfies $\rho\geq 0,\ \Pi^{+}_{n}\rho=\rho\Pi^{+}_{n}=\rho$ , and $\operatorname{Tr}\rho=1$ . Given $n$ copies of $\Pi$ , the tomography scheme produces the distribution $P_{\mathcal{T},\Pi^{\otimes n}}$ of the predicted states. Note that $\Pi^{\otimes n}$ satisfies the properties above, so $P_{\mathcal{T},\Pi^{\otimes n}}$ is indeed a probability distribution. The fidelity of this distribution is given by

F(\mathcal{T},\Pi)=\int\operatorname{Tr}(\Pi\sigma)dP_{\mathcal{T},\Pi^{% \otimes n}}(\sigma).

(109)

Finally, the average fidelity of the tomography scheme is defined as

F(\mathcal{T})=\int F(\mathcal{T},|\psi\rangle\langle\psi|)d\psi,

(110)

where the integration is taken with respect to the normalized uniform measure over all pure states. In the following, $\int d\psi$ will always imply this measure. We will provide a lower bound on $F(\mathcal{T})$ in terms of $d$ and $n$ , following the proof technique from [10]. In [10], the proof is only presented for tomography schemes producing a finite number of predictions. For our definition, we will require more general measure-theoretic tools. Before we introduce the upper bound on the fidelity, we will prove some auxiliary lemmas about the nature of the measure $P_{\mathcal{T},\rho}$ .

Lemma 14.

Let $(\Omega,\Sigma)$ be a $\sigma$ -algebra, and let $O:\Sigma\to\operatorname{End}(\widetilde{\mathcal{H}})$ be a POVM with values acting on a finite-dimensional Hilbert space $\widetilde{\mathcal{H}}$ with $\operatorname{dim}\widetilde{\mathcal{H}}=\tilde{d}$ s.t. $O(\Omega)\leq\mathbbm{1}$ , where $\mathbbm{1}$ is the identity operator. Further, let $P_{O,\sigma}:\Sigma\to\mathbb{C}$ be a complex-valued measure, defined for any $\sigma\in\operatorname{End}(\widetilde{\mathcal{H}})$ as

P_{O,\sigma}(A)=\operatorname{Tr}[O(A)\sigma].

(111)

Then, there exists a set of functions $\{f_{\sigma}\}$ indexed by $\sigma\in\operatorname{End}\widetilde{\mathcal{H}}$ that are linear w.r.t. $\sigma$ for all $\omega$ and that satisfy

f_{\sigma}:\Omega\to\mathbb{C}\quad\text{s.t.}\quad\forall A\in\Sigma\ \ P_{O,% \sigma}(A)=\int_{A}f_{\sigma}(\omega)dP_{O,\mathbbm{1}}(\omega).

(112)

We purposefully formulated this lemma with slightly more general objects than the ones used in the definition of tomography. That is, $\Omega$ does not need to be $\mathcal{S}_{d}^{*}$ , and $\widetilde{\mathcal{H}}$ does not need to be the n-th power $\mathcal{H}^{\otimes n}$ , although we will focus on this case.

Proof.

Let $\{\ket{i}\}_{i=1}^{\tilde{d}}$ be a basis of $\widetilde{\mathcal{H}}$ We will first show that $P_{O,\sigma}$ is dominated by $P_{O,\mathbbm{1}}$ for all $\sigma$ . Indeed, let $A\in\Sigma$ . Assume that $P_{O,\mathbbm{1}}(A)=0$ . This gives us

\operatorname{Tr}[O(A)\mathbbm{1}]=\operatorname{Tr}[O(A)]=0,

(113)

and, because $O(A)\geq 0$ , we also have $O(A)=0$ . Therefore,

P_{O,\sigma}(A)=\operatorname{Tr}[O(A)\sigma]=0.

(114)

Hence, for any $\ket{i},\ket{j}$ from the basis we can introduce the Radon-Nikodym derivatives $f_{\ket{i}\bra{j}}$ , which will satisfy (112). Then, for any $\sigma\in\operatorname{End}\widetilde{\mathcal{H}}$ we can define

f_{\sigma}(\omega)=\sum_{i,j=1}^{\tilde{d}}\bra{i}\sigma\ket{j}f_{\ket{i}\bra{% j}}(\omega).

(115)

These $f_{\sigma}$ are linear in $\sigma$ by definition. A direct calculation shows that they also satisfy (112). ∎

Note that for $\sigma\geq 0$ , the measure $P_{O,\sigma}$ is finite and nonnegative, but nonnegativity (and even real-valuedness) do not hold for a general $\sigma\in\operatorname{End}(\widetilde{\mathcal{H}})$ .

By our definition of $f_{\sigma}(\omega)$ , it can be written as

f_{\sigma}(\omega)=\operatorname{Tr}\left[K(\omega)\sigma\right],\quad\text{% where}\ K(\omega)=\sum_{i,j=1}^{\tilde{d}}f_{|i\rangle\langle j|}(\omega)|j% \rangle\langle i|.

(116)

As the following lemma demonstrates, $K(\omega)\geq 0$ for $P_{O,\mathbbm{1}}$ -almost every $\omega$ :

Lemma 15.

Let $(\Omega,\Sigma,\mu)$ be a measurable space and $V:\Omega\to\operatorname{End}(\widetilde{\mathcal{H}})$ be a measurable operator-valued function with values acting on a finite-dimensional Hilbert space $\widetilde{\mathcal{H}}$ such that

\forall A\in\Sigma\quad\int_{A}V(\omega)d\mu(\omega)\geq 0.

(117)

Then, $V(\omega)\geq 0$ $\mu$ -almost everywhere.

Proof.

Let $\ket{\psi}\in\widetilde{\mathcal{H}}$ and define

g_{\psi}(\omega)=\bra{\psi}V(\omega)\ket{\psi}.

(118)

By the given condition, for any $A\in\Sigma$

\int_{A}g_{\psi}(\omega)d\mu(\omega)=\bra{\psi}\int_{A}V(\omega)d\mu(\omega)% \ket{\psi}\geq 0.

(119)

It follows that $g_{\psi}(\omega)\geq 0$ $\mu$ -almost everywhere. Let

Z_{\psi}=\{\omega\in\Omega\text{ s.t. }g_{\psi}(\omega)<0\}

(120)

We have shown that $\mu(Z_{\psi})=0$ . Next, since $\widetilde{\mathcal{H}}$ is finite-dimensional, it is separable. Therefore, there exists a countable set $\{\ket{\psi_{k}}\}_{k}$ dense in $\widetilde{\mathcal{H}}$ . Let

Z=\bigcup_{k}Z_{\psi_{k}}.

(121)

We have that $\mu(Z)=0$ . Finally, let $\omega\in\Omega\setminus Z$ and $\ket{\psi}\in\widetilde{\mathcal{H}}$ . Because $\{\ket{\psi_{k}}\}$ is dense in $\widetilde{\mathcal{H}}$ , there exists a sequence $\{\ket{\psi_{k_{i}}}\}$ converging to $\ket{\psi}$ . Then,

0\leq\bra{\psi_{k_{i}}}V(\omega)\ket{\psi_{k_{i}}}\xrightarrow{i\to\infty}\bra% {\psi}V(\omega)\ket{\psi}.

(122)

Overall, we get that

\forall\omega\in\Omega\setminus Z,\ \ket{\psi}\in\widetilde{\mathcal{H}}\quad% \bra{\psi}V(\omega)\ket{\psi}\geq 0.

(123)

Together with $\mu(Z)=0$ , this gives the desired result. ∎

Now we can apply this analysis to the POVM corresponding to our tomography scheme, and get the desired upper bound on the fidelity.

Theorem 16.

For any tomography scheme $\mathcal{T}$ utilizing $n$ copies of the input state, the average fidelity is bounded by

F(\mathcal{T})\leq\frac{n+1}{n+d}.

(124)

Proof.

We will introduce the density $K(\omega)$ from (116) for our tomography scheme $\mathcal{T}$ and the corresponding measure $P_{\mathcal{T},\sigma}$ . Lemma 14 allows us to introduce for any $\sigma\in\operatorname{End}(\mathcal{H}^{\otimes n})$ the density $f_{\sigma}:\Omega\to\mathbb{C}$ s.t.

\forall A\in\Sigma\ \ P_{\mathcal{T},\sigma}(A)=\int_{A}f_{\sigma}(\omega)dP_{% \mathcal{T},\mathbbm{1}}(\omega).

(125)

This density can be written as $f_{\sigma}(\omega)=\operatorname{Tr}\left(K(\omega)\sigma\right)$ for some $K(\omega)\in\operatorname{End}(\mathcal{H}^{\otimes n})$ . $K(\omega)$ can be considered as the operator-valued density of $\mathcal{T}$ w.r.t. $P_{\mathcal{T},\mathbbm{1}}$ :

\forall A\in\Sigma\quad\mathcal{T}(A)=\int_{A}K(\omega)dP_{\mathcal{T},% \mathbbm{1}}(\omega).

(126)

Since $\mathcal{T}(A)\geq 0$ , it follows by Lemma 15 that $K(\omega)\geq 0$ for $P_{\mathcal{T},\mathbbm{1}}$ -almost all $\omega$ . Furthermore, as $\mathcal{T}(\mathcal{S}_{d}^{*})=\Pi^{+}_{n}$ , we have that for all $A\in\Sigma$ , $\mathcal{T}(A)\leq\Pi^{+}_{n}$ . Therefore, $\mathcal{T}(A)\Pi^{+}_{n}=\Pi^{+}_{n}\mathcal{T}(A)=\mathcal{T}(A)$ . This means that $\tilde{K}(\omega)=\Pi^{+}_{n}K(\omega)\Pi^{+}_{n}$ would also satisfy (126). In the following, we will without loss of generality assume that

K(\omega)=\Pi^{+}_{n}K(\omega)=K(\omega)\Pi^{+}_{n}.

(127)

With these tools at hand, we are ready to adapt the proof from [10] to the general case of POVM tomography schemes. We begin by rewriting the expression (109) for average fidelity:

	$\displaystyle F(\mathcal{T})$	$\displaystyle=\int d\psi\int dP_{\mathcal{T},(\ket{\psi}\!\bra{\psi})^{\otimes n% }}(\sigma)\operatorname{Tr}(\sigma\ket{\psi}\!\bra{\psi})$		(128)
		$\displaystyle=\int d\psi\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)\operatorname% {Tr}(\ket{\psi}\!\bra{\psi}\sigma)\operatorname{Tr}\left(K(\sigma)(\ket{\psi}% \!\bra{\psi})^{\otimes n}\right).$		(129)

Since fidelity is nonnegative and its average is bounded by 1, we can change the order of integration. Following [10], we introduce notation

\sigma_{n}(k)=\mathbbm{1}^{\otimes(k-1)}\otimes\sigma\otimes\mathbbm{1}^{% \otimes(n-k)}\in\mathcal{H}^{\otimes n}.

(130)

The product of traces in (129) can be rewritten in the following manner:

\displaystyle\begin{split}F(\mathcal{T})&=\int dP_{\mathcal{T},\mathbbm{1}}(% \sigma)\int d\psi\\ &\quad\quad\operatorname{Tr}\left((K(\sigma)\otimes\mathbbm{1})(\ket{\psi}\!% \bra{\psi})^{\otimes(n+1)}\sigma_{n+1}(n+1)\right).\end{split}

(131)

We can now take the inner integral in closed form. As shown in [10, Eq. (4)],

\int d\psi(\ket{\psi}\!\bra{\psi})^{\otimes n}=\frac{\Pi^{+}_{n}}{D_{n}},

(132)

where $D_{n}=\binom{n+d-1}{d}$ . Another useful result in this paper is [10, Eq. (8)]:

\operatorname{Tr}_{n+1}\left(\Pi^{+}_{n+1}\sigma_{n+1}(n+1)\right)=\frac{1}{n+% 1}\Pi^{+}_{n}\left(\mathbbm{1}+\sum_{k=1}^{n}\sigma_{n}(k)\right),

(133)

where $\operatorname{Tr}_{n+1}:\operatorname{End}(\mathcal{H}^{\otimes(n+1)})\to% \operatorname{End}(\mathcal{H}^{\otimes n})$ is the partial trace on the $(n+1)$ -st copy of the system. These expressions allow us to rewrite (131) as follows:

$\displaystyle F(\mathcal{T})$	$\displaystyle=\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left((K(\sigma)\otimes\mathbbm{1})\Pi^{+}_{n+1}\sigma_{n+1}(% n+1)\right)$	(134)
	$\displaystyle=\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left(K(\sigma)\operatorname{Tr}_{n+1}\left(\Pi^{+}_{n+1}% \sigma_{n+1}(n+1)\right)\right)$	(135)
	$\displaystyle=\frac{1}{(n+1)D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left(K(\sigma)\left(\mathbbm{1}+\sum_{k=1}^{n}\sigma_{n}(k)% \right)\right).$	(136)

Finally, $\sigma_{n}(k)\leq\mathbbm{1}$ , so $\operatorname{Tr}(K(\sigma)\sigma_{n}(k))\leq\operatorname{Tr}(K(\sigma))$ , and we can bound the above as

	$\displaystyle F(\mathcal{T})\leq\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm% {1}}(\sigma)\operatorname{Tr}\left(K(\sigma)\right)$
	$\displaystyle=\frac{\operatorname{Tr}\Pi^{+}_{n}}{D_{n+1}}=\frac{D_{n}}{D_{n+1% }}=\frac{n+1}{n+d}.$		(137)

∎

6.2 Bandit policy as a sequence of tomographies

Theorem 17.

Given a $d$ -dimensional pure state general multi-armed quantum bandit we have that for any policy $\pi$ the average expected regret is bounded by

\displaystyle\int d\psi\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi},\pi}% \left[\textup{Regret}(T,\pi,\ket{\psi}\!\bra{\psi})\right]\geq(d-1)\log\left(% \frac{T}{d+1}\right),

(138)

where the expectation is taken w.r.t. the measure (34) over actions taken by the bandit, and the regret is defined in (37).

The above Theorem gives $\operatorname{\mathbb{E}}\left[\text{Regret}(T)\right]=\Omega(d\log\frac{T}{d})$ . In the case of qubit environments, we have $d=2$ and ${\operatorname{\mathbb{E}}\left[\text{Regret}(T)\right]=\Omega(\log T)}$ .

Proof.

Given a policy $\pi$ , we can introduce a POVM $E_{t}:(\Sigma\times\{0,1\})^{\times t}\to\operatorname{End}(\mathcal{H}^{% \otimes t})$ such that

P^{t}_{\ket{\psi}\!\bra{\psi},\pi}(A_{1},r_{1},\dotsc,A_{t},r_{t})=% \operatorname{Tr}\left((\ket{\psi}\!\bra{\psi})^{\otimes t}E_{t}(A_{1},r_{1},% \dotsc,A_{t},r_{t})\right),

(139)

where $P^{t}_{\ket{\psi}\!\bra{\psi},\pi}$ is the probability measure defined by (34), but only for actions and rewards until step $t$ . The construction of this POVM is presented in the proof of Lemma 9 in [14]. We will also define the coordinate mapping

\Psi_{t}(\Pi_{1},r_{1},\dotsc,\Pi_{t},r_{t})=\Pi_{t},

(140)

where $\Pi_{i}\in\mathcal{A}$ are actions and $r_{i}\in\{0,1\}$ are rewards of the PSMAQB. Now we can for each step $t$ define a tomography scheme $\mathcal{T}_{t}=E_{t}\circ\Psi_{t}^{-1}$ as the pushforward POVM from $E_{t}$ to the space $(\mathcal{A},\Sigma)$ . Informally, this tomography scheme takes $t$ copies of the state, runs the policy $\pi$ on them, and outputs the $t$ -th action of the policy as the predicted state. For $A\in\Sigma$ , we can rewrite the tomography’s distribution on predictions as

	$\displaystyle P_{\mathcal{T},(\ket{\psi}\!\bra{\psi})^{\otimes t}}(A)=% \operatorname{Tr}\left(\mathcal{T}_{t}(A)(\ket{\psi}\!\bra{\psi})^{\otimes t}\right)$
	$\displaystyle=\operatorname{Tr}\left(E_{t}(\Psi_{t}^{-1}(A))(\ket{\psi}\!\bra{% \psi})^{\otimes t}\right)=\left(P^{t}_{\ket{\psi}\!\bra{\psi},\pi}\circ\Psi^{-% 1}\right)(A).$		(141)

Then, the fidelity of $\mathcal{T}_{t}$ on the input $\ket{\psi}\!\bra{\psi}$ can be rewritten as

	$\displaystyle F(\mathcal{T}_{t},\ket{\psi}\!\bra{\psi})=\int\langle\psi\|\rho\|% \psi\rangle dP_{\mathcal{T}_{t},(\ket{\psi}\!\bra{\psi})^{\otimes t}}(\rho)$		(142)
	$\displaystyle=\int\langle\psi\|\Psi_{t}(\Pi_{1},r_{1},\cdots,\Pi_{t},r_{t})\|% \psi\rangle dP^{t}_{\ket{\psi}\!\bra{\psi},\pi}(\Pi_{1},r_{1},\cdots,\Pi_{t},r% _{t})$		(143)
	$\displaystyle=\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi},\pi}\left[% \langle\psi\|\Pi_{t}\|\psi\rangle\right].$		(144)

Using the bound for average tomography fidelity on $\mathcal{T}_{t}$ from Theorem 16, we can now bound the average regret of $\pi$ :

	$\displaystyle\int\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi}}\left[% \textup{Regret}(T,\pi,\ket{\psi}\!\bra{\psi})\right]d\psi$		(145)
	$\displaystyle=T-\sum_{t=1}^{T}\int\ \mathbb{E}_{\ket{\psi}\!\bra{\psi}}\left[% \langle\psi\|\Pi_{t}\|\psi\rangle\right]d\psi$		(146)

	$\displaystyle=T-\sum_{t=1}^{T}F(\mathcal{T}_{t})\geq\sum_{t=1}^{T}1-\frac{t+1}% {t+d}$		(147)
	$\displaystyle=\sum_{t=1}^{T}\frac{d-1}{t+d}\geq(d-1)\log\left(\frac{T}{d+1}% \right),$		(148)

where the last inequality follows from bounding the sum with the integral of the function $f(t)=1/(t+d)$ . ∎

Acknowledgements:

JL thanks Jan Seyfried and Yanglin Hu for comments and suggestions, Erkka Happasalo for discussions about disturbance and Roberto Rubboli for many technical discussions. Mikhail Terekhov is grateful to be supported by the EDIC Fellowship from the School of Computer Science at EPFL. Josep Lumbreras and Marco Tomammichel are supported by the National Research Foundation, Singapore and A*STAR under its CQT Bridging Grant and the Quantum Engineering Programme grant NRF2021-QEP2-02-P05.

References

[1] S. Aaronson and G. N. Rothblum. “Gentle measurement of quantum states and differential privacy”. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 322–333, (2019).
[2] Y. Aharonov, D. Z. Albert, and L. Vaidman. “How the result of a measurement of a component of the spin of a spin-1/2 particle can turn out to be 100”. Physical review letters 60(14): 1351 (1988).
[3] S. Brahmachari, J. Lumbreras, and M. Tomamichel. “Quantum contextual bandits and recommender systems for quantum data”. Quantum Machine Intelligence 6(2): 58 (2024).
[4] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. “Bandits With Heavy Tail”. IEEE Transactions on Information Theory 59(11): 7711–7717 (2013).
[5] P. Busch, P. Lahti, and R. F. Werner. “Proof of Heisenberg’s error-disturbance relation”. Physical Review Letters 111(16): 160405 (2013).
[6] C. Butucea, J. Johannes, and H. Stein. “Sample-optimal learning of quantum states using gentle measurements”. arXiv preprint arXiv:2505.24587 , (2025).
[7] M. S. Byrd and N. Khaneja. “Characterization of the positivity of the density matrix in terms of the coherence vector representation”. Physical Review A 68(6): 062322, (2003).
[8] M. Guţă, J. Kahn, R. Kueng, and J. A. Tropp. “Fast state tomography with optimal error bounds”. Journal of Physics A: Mathematical and Theoretical 53(20): 204001 (2020).
[9] J. Haah, A. W. Harrow, Z. Ji, X. Wu, and N. Yu. “Sample-optimal tomography of quantum states”. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 913–925, (2016).
[10] A. Hayashi, T. Hashimoto, and M. Horibe. “Reexamination of optimal quantum state estimation of pure states”. Physical review A 72(3): 032325, (2005).
[11] R. Kueng, H. Rauhut, and U. Terstiege. “Low rank matrix recovery from rank one measurements”. Applied and Computational Harmonic Analysis 42(1): 88–116 (2017).
[12] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press (2020).
[13] M. Lerasle. “Lecture notes: Selected topics on robust statistical learning theory”. arXiv:1908.10761 , (2019).
[14] J. Lumbreras, E. Haapasalo, and M. Tomamichel. “Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states”. Quantum 6: 749, (2022).
[15] J. Lumbreras, R. C. Huang, Y. Hu, M. Gu, and M. Tomamichel. “Quantum state-agnostic work extraction (almost) without dissipation”. arXiv preprint arXiv:2505.09456 , (2025).
[16] J. Lumbreras and M. Tomamichel. “Linear bandits with polylogarithmic minimax regret”. In Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 3644–3682, (2024).
[17] A. M. Medina and S. Yang. “No-regret algorithms for heavy-tailed linear bandits”. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1642–1650, (2016).
[18] M. Ozawa. “Universally valid reformulation of the Heisenberg uncertainty principle on noise and disturbance in measurement”. Physical Review A 67(4): 042105 (2003).
[19] H. Shao, X. Yu, I. King, and M. R. Lyu. “Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs”. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 8430–8439, Red Hook, NY, USA(2018).
[20] A. Winter. “Coding theorem and strong converse for quantum channels”. IEEE Transactions on information theory 45(7): 2481–2485 (2002).