Learning pure quantum states (almost) without regret

Josep Lumbreras1 josep.lumbreras@u.nus.edu    Mikhail Terekhov2 mikhail.terekhov@epfl.ch    Marco Tomamichel1,3 marco.tomamichel@nus.edu.sg 1Centre for Quantum Technologies, National University of Singapore, Singapore 2School of Computer and Communication Sciences, EPFL, Switzerland 3Department of Electrical and Computer Engineering, National University of Singapore, Singapore
(June 5, 2025)
Abstract

We initiate the study of sample-optimal quantum state tomography with minimal disturbance to the samples. Can we efficiently learn a precise description of a quantum state through sequential measurements of samples while at the same time making sure that the post-measurement state of the samples is only minimally perturbed? Defining regret as the cumulative disturbance of all samples, the challenge is to find a balance between the most informative sequence of measurements on the one hand and measurements incurring minimal regret on the other. Here we answer this question for qubit states by exhibiting a protocol that for pure states achieves maximal precision while incurring a regret that grows only polylogarithmically with the number of samples, a scaling that we show to be optimal.

1 Introduction

In this work, we approach quantum state tomography from a new angle. Given sequential access to a finite number of samples of a quantum state, our goal is not only to accurately learn a classical description of the state but also to use measurements that disturb the samples as little as possible. Generally these two goals are incompatible, and we are thus interested in tomography algorithms that find an optimal balance between them. We call this setting quantum state tomography with minimal regret.

Minimizing disturbance is important in many real-world scenarios where the samples that we use for tomography are in fact resources for another tasks — and we thus want to learn the state in a way that is as non-intrusive as possible, ensuring that the post-measurement states remain useful for their intended purpose. An example of this occurs in quantum key distribution, where tomography can be used to keep reference frames aligned during a run but any disturbance due to tomographic measurements will induce bit errors in the correlations used to extract a secret key. Disturbance is also relevant for state-agnostic resource distillation where resourceful states might be destroyed by tomographic measurements but learning the unknown state is crucial since optimal extraction protocols generally depend on its description.

In both cases we encounter a fundamental trade-off between exploration (learning the state) and exploitation (using the samples for another purpose). These types of trade-offs are fundamental to the study of adaptive algorithms in machine learning, and our work establishes a strong link between quantum tomography and the classical multi-armed bandit model in reinforcement learning. In fact, one of the main technical ingredients of the present work is a classical bandit algorithm by some of us [16], originally inspired by shot noise in quantum mechanics.

To illustrate this connection from a physics perspective, it helps to reflect on how measurements disturb quantum systems. A defining feature of quantum theory is that measurements generally disturb the system being measured. But what does this disturbance intuitively mean? To illustrate this, consider a qubit prepared in the pure state

|ψ=1ϵ2|0+ϵ|1,ket𝜓1superscriptitalic-ϵ2ket0italic-ϵket1\displaystyle|\psi\rangle=\sqrt{1-\epsilon^{2}}|0\rangle+\epsilon|1\rangle,| italic_ψ ⟩ = square-root start_ARG 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | 0 ⟩ + italic_ϵ | 1 ⟩ , (1)

with ϵ[0,1]italic-ϵ01\epsilon\in[0,1]italic_ϵ ∈ [ 0 , 1 ]. A projective measurement in the computational basis {|0,|1}ket0ket1\{|0\rangle,|1\rangle\}{ | 0 ⟩ , | 1 ⟩ } collapses the state to |0ket0|0\rangle| 0 ⟩ with probability 1ϵ21superscriptitalic-ϵ21-\epsilon^{2}1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and to |1ket1|1\rangle| 1 ⟩ with probability ϵ2superscriptitalic-ϵ2\epsilon^{2}italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 or ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1, the post-measurement state coincides with the initial state with certainty—there is no disturbance. More generally, when ϵ0italic-ϵ0\epsilon\approx 0italic_ϵ ≈ 0 or ϵ1italic-ϵ1\epsilon\approx 1italic_ϵ ≈ 1, the post-measurement state remains close to the original one with high probability, indicating low disturbance. In contrast, for ϵ=1/2italic-ϵ12\epsilon=1/\sqrt{2}italic_ϵ = 1 / square-root start_ARG 2 end_ARG, the state is “maximally far” in the measurement basis, and the outcome is maximally uncertain; the post-measurement state is always far to the initial state, signifying maximal disturbance. This simple example illustrates how disturbance is linked to randomness: measurements that induce minimal disturbance tend to yield more deterministic outcomes, while those that induce maximal disturbance produce outcomes with higher variance.

But how can one perform low-disturbance measurements without prior knowledge of the state? With access to only a single copy, it is fundamentally impossible to design a measurement that avoids disturbing the state while still extracting useful information. However, the situation changes when multiple identical copies are available. In that case, one could strategically use some of them to gain partial information about the state and adapt future measurements to be less disturbing. This naturally leads to the central question of our work:

Given access to a finite sequence of an unknown qubit system, what is the best strategy for performing single-copy projective measurements that extract as much information as possible while minimizing the overall disturbance?

The notion of disturbance is a foundational concept in quantum mechanics and has been explored from various perspectives, notably in works that reformulate the uncertainty principle to quantify the trade-off between measurement-induced disturbance and information gain [18, 5]. Another framework where disturbance is studied are weak measurements, which aim to minimally disturb the quantum system while still providing partial information about it. This idea dates back to the seminal work [2], and has since become a central tool in understanding the interplay between information gain and quantum disturbance. However, performing these measurements typically comes at the cost of low information gain: the less disturbing the measurement, the less informative it is about the quantum state, making weak measurements unsuitable for tasks that demand accurate estimation. In contrast, projective measurements—which are the focus of this work—provide maximal information about the system but often cause significant disturbance, collapsing the state entirely.

In our setting, we are not concerned with the disturbance of a single copy, but rather with the cumulative disturbance across a sequence of identically prepared quantum states. Our goal is to use each copy as effectively as possible to extract information and achieve sample-optimal estimation of the underlying state. This naturally motivates the design of adaptive measurement strategies that balance the tradeoff between information gain and disturbance over time.

Another related concept is that of gentle measurements, which were formalized recently in [1], but have their roots in earlier work, notably the “gentle measurement” lemma introduced in [20]. These are measurements that guarantee, for certain sets of states, that the post-measurement state remains close to the original one, while still allowing useful information to be extracted. Although this is related to weak measurements, an important distinction is that a gentle measurement is considered weak only if it remains non-disturbing across all states (not only a set). Moreover, this framework does not address how to adaptively learn an unknown state using a sequence of projective measurements, which are typically more informative than gentle measurements [6]. Our contribution does not lie in proposing a new class of measurements, but rather in developing adaptive strategies that employ projective measurements in a way that minimizes cumulative disturbance across the sequence of quantum states.

Formally, we consider a sequential decision-making scenario in which the learner has access to T𝑇Titalic_T independent copies of an unknown qubit state ρ𝜌\rhoitalic_ρ. At each round t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], the learner selects a measurement direction |ψtketsubscript𝜓𝑡|\psi_{t}\rangle| italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ and performs a projective measurement in the basis {|ψt,|ψtc}ketsubscript𝜓𝑡ketsuperscriptsubscript𝜓𝑡𝑐\{|\psi_{t}\rangle,|\psi_{t}^{c}\rangle\}{ | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ , | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⟩ }. The outcome rt{0,1}subscript𝑟𝑡01r_{t}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } is sampled according to Born’s rule, Pr[rt=1]=pt:=ψt|ρ|ψt.Prsubscript𝑟𝑡1subscript𝑝𝑡assignquantum-operator-productsubscript𝜓𝑡𝜌subscript𝜓𝑡\Pr[r_{t}=1]=p_{t}:=\langle\psi_{t}|\rho|\psi_{t}\rangle.roman_Pr [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ] = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ⟨ italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ρ | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ . The measurement outcome determines the post-measurement state ψ~t{|ψt,|ψtc}subscript~𝜓𝑡ketsubscript𝜓𝑡ketsuperscriptsubscript𝜓𝑡𝑐\tilde{\psi}_{t}\in\{|\psi_{t}\rangle,|\psi_{t}^{c}\rangle\}over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ , | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⟩ }, with rt=1subscript𝑟𝑡1r_{t}=1italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 corresponding to the state collapsing to |ψtketsubscript𝜓𝑡|\psi_{t}\rangle| italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩, and rt=0subscript𝑟𝑡0r_{t}=0italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 to its orthogonal complement.

We quantify the disturbance introduced by the learner through the cumulative expected infidelity between the unknown state ρ𝜌\rhoitalic_ρ and the resulting post-measurement state ψ~t{|ψt,|ψtc}subscript~𝜓𝑡ketsubscript𝜓𝑡ketsubscriptsuperscript𝜓𝑐𝑡\tilde{\psi}_{t}\in\{|\psi_{t}\rangle,|\psi^{c}_{t}\rangle\}over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ , | italic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ }, compared to the minimal possible disturbance, which occurs when the measurement direction ψtsubscript𝜓𝑡\psi_{t}italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is aligned with the eigenvector corresponding to the largest eigenvalue of ρ𝜌\rhoitalic_ρ. Formally, we define the cumulative disturbance as

Disturbance(T):=t=1T(𝔼[1F(ρ,ψ~t)]minψ~𝔼[1F(ρ,ψ~)]),assignDisturbance𝑇superscriptsubscript𝑡1𝑇𝔼delimited-[]1𝐹𝜌subscript~𝜓𝑡subscript~𝜓𝔼1𝐹𝜌~𝜓\displaystyle\textnormal{Disturbance}(T):=\sum_{t=1}^{T}\left(\mathbb{E}[1-F(% \rho,\tilde{\psi}_{t})]-\min_{\tilde{\psi}}\operatorname{\mathbb{E}}[1-F(\rho,% \tilde{\psi})]\right),Disturbance ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( blackboard_E [ 1 - italic_F ( italic_ρ , over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - roman_min start_POSTSUBSCRIPT over~ start_ARG italic_ψ end_ARG end_POSTSUBSCRIPT blackboard_E [ 1 - italic_F ( italic_ρ , over~ start_ARG italic_ψ end_ARG ) ] ) , (2)

where F(ρ,σ):=(Trρσρ)2assign𝐹𝜌𝜎superscriptTr𝜌𝜎𝜌2F(\rho,\sigma):=\left(\operatorname{Tr}\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}% \right)^{2}italic_F ( italic_ρ , italic_σ ) := ( roman_Tr square-root start_ARG square-root start_ARG italic_ρ end_ARG italic_σ square-root start_ARG italic_ρ end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the quantum fidelity. Note that the second term in the definition of disturbance, minψ(1F(ρ,ψ))subscript𝜓1𝐹𝜌𝜓\min_{\psi}(1-F(\rho,\psi))roman_min start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( 1 - italic_F ( italic_ρ , italic_ψ ) ), is constant across rounds and acts as a normalization ensuring that the disturbance vanishes when the learner selects the optimal, least-disturbing measurement direction. In particular, when ρ𝜌\rhoitalic_ρ is pure, this minimum is zero, as an optimal measurement does not alter the state. The expression for the disturbance simplifies to the following closed form,

Disturbance(T)=t=1T2(λmax(ρ)pt)(λmax(ρ)+pt1),Disturbance𝑇superscriptsubscript𝑡1𝑇2subscript𝜆𝜌subscript𝑝𝑡subscript𝜆𝜌subscript𝑝𝑡1\displaystyle\textnormal{Disturbance}(T)=\sum_{t=1}^{T}2(\lambda_{\max}(\rho)-% p_{t})(\lambda_{\max}(\rho)+p_{t}-1),Disturbance ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 2 ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) + italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 ) , (3)

where λmax(ρ)subscript𝜆𝜌\lambda_{\max}(\rho)italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) denotes the largest eigenvalue of ρ𝜌\rhoitalic_ρ.

While the above notion of disturbance is defined with respect to the observed outcome, we could also define it as the cumulative infidelity between the unknown state and the average post-measurement state ρt=ptψt+(1pt)ψtcsubscript𝜌𝑡subscript𝑝𝑡subscript𝜓𝑡1subscript𝑝𝑡superscriptsubscript𝜓𝑡𝑐\rho_{t}=p_{t}\psi_{t}+(1-p_{t})\psi_{t}^{c}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, i.e

Disturbance(T):=assignsuperscriptDisturbance𝑇absent\displaystyle\textnormal{Disturbance}^{*}(T):=Disturbance start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ) := t=1T1F(ρ,ρt).superscriptsubscript𝑡1𝑇1𝐹𝜌subscript𝜌𝑡\displaystyle\sum_{t=1}^{T}1-F(\rho,\rho_{t}).∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1 - italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (4)

Although these two notions of disturbance differ in their interpretation—depending on whether the measurement outcomes are observed or not—their behavior is qualitatively the same: both vanish when the measurement direction ψtsubscript𝜓𝑡\psi_{t}italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is aligned with the state ρ𝜌\rhoitalic_ρ. In particular, when ρ𝜌\rhoitalic_ρ is a pure state, the two disturbances coincide. Both quantities are controlled by a simpler quantity, the regret, defined as

Regret(T):=t=1T(λmax(ρ)ψt|ρ|ψt),assignRegret𝑇superscriptsubscript𝑡1𝑇subscript𝜆𝜌quantum-operator-productsubscript𝜓𝑡𝜌subscript𝜓𝑡\displaystyle\textnormal{Regret}(T):=\sum_{t=1}^{T}\left(\lambda_{\max}(\rho)-% \langle\psi_{t}|\rho|\psi_{t}\rangle\right),Regret ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - ⟨ italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ρ | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) , (5)

since it can be checked that both disturbances satisfy

Disturbance(T)=Θ(Regret(T))andDisturbance(T)=Θ(Regret(T)),formulae-sequenceDisturbance𝑇ΘRegret𝑇andsuperscriptDisturbance𝑇ΘRegret𝑇\displaystyle\textnormal{Disturbance}(T)=\Theta(\textnormal{Regret}(T))\quad% \text{and}\quad\textnormal{Disturbance}^{*}(T)=\Theta(\textnormal{Regret}(T)),Disturbance ( italic_T ) = roman_Θ ( Regret ( italic_T ) ) and Disturbance start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ) = roman_Θ ( Regret ( italic_T ) ) , (6)

which means that minimizing disturbance is essentially equivalent to minimizing regret. Intuitively, regret remains small when the chosen probe directions are closely aligned with the dominant eigenvector of ρ𝜌\rhoitalic_ρ, highlighting that learning the structure of the unknown state is necessary to keep the disturbance low. However, since we are also interested in reconstructing the state, we further require that the learner outputs a final estimate ρ^Tsubscript^𝜌𝑇\hat{\rho}_{T}over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with high fidelity to the true state after T𝑇Titalic_T rounds. That is, in addition to minimizing cumulative disturbance or regret, the algorithm must also achieve low estimation error defined as

Err(T):=1F(ρ,ρ^T).assignErr𝑇1𝐹𝜌subscript^𝜌𝑇\displaystyle\textnormal{Err}(T):=1-F(\rho,\hat{\rho}_{T}).Err ( italic_T ) := 1 - italic_F ( italic_ρ , over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) . (7)

The regret admits a direct physical interpretation in certain quantum thermodynamic scenarios. In particular in [15] some of the present authors established a connection between regret and the cumulative energy dissipation in quantum state-agnostic work extraction protocols.

Consider a setting in which an unknown source emits identically prepared quantum systems in a fixed state ρ𝜌\rhoitalic_ρ. One can design a battery system that sequentially interacts with each quantum copy to extract work from the system and transfer energy into the battery. If the state ρ𝜌\rhoitalic_ρ is known, the protocol can be tailored to extract work optimally at every step. However, when ρ𝜌\rhoitalic_ρ is unknown, each interaction entails a probability of failure due to the mismatch between the protocol and the true state.

This mismatch can be modeled as performing a projective measurement in a guessed direction (corresponding to the control applied to the battery), and success depends on the alignment of this direction with the actual state. In this context, the regret quantifies the cumulative free energy that is wasted due to not applying the optimal work extraction strategy. This interpretation shows the necessity of performing non-invasive tomography on the fly, using each quantum copy not only as a source of energy but also as a source of information about the unknown state.

We emphasize that the notion of regret defined in (6) is not merely a formal construction, but a meaningful quantity that captures the cumulative disturbance caused by projective measurements. Moreover, it admits a concrete physical interpretation in quantum thermodynamics, where it corresponds to the total energy dissipation in agnostic work-extraction protocols. Thus, regret serves as both an operational and physical measure of performance in settings that require minimally disturbing projective measurements.

Challenges. We note that the task of minimizing the regret (6) is captured by the multi-armed quantum bandit (MAQB) framework introduced in [14] (see also [3]). This framework was the first to formalize the exploration–exploitation trade-off in online learning of quantum state properties using classical algorithms. In particular, it was shown that when the unknown state ρ𝜌\rhoitalic_ρ is mixed, the regret suffers a fundamental lower bound of order Regret(T)=Θ(T)Regret𝑇Θ𝑇\textnormal{Regret}(T)=\Theta(\sqrt{T})Regret ( italic_T ) = roman_Θ ( square-root start_ARG italic_T end_ARG ), which is nearly tight, as there exist protocols achieving Regret(T)=O~(T)Regret𝑇~𝑂𝑇\textnormal{Regret}(T)=\tilde{O}(\sqrt{T})Regret ( italic_T ) = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) by reducing the problem to a linear stochastic bandit [12] and applying classical bandit algorithms in that setting.

However, this lower bound does not apply when ρ=|ψψ|𝜌ket𝜓bra𝜓\rho=|\psi\rangle\!\langle\psi|italic_ρ = | italic_ψ ⟩ ⟨ italic_ψ | is a pure state. The reason is that the lower bound relies on having vanishing statistical noise in the random outcomes rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, whereas in our setting the shot noise vanishes as the measurement direction |ψtketsubscript𝜓𝑡|\psi_{t}\rangle| italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ approaches the target state |ψket𝜓|\psi\rangle| italic_ψ ⟩. In this case, the regret simplifies to

Regret(T)=t=1T(1F(ψ,ψt)),Regret𝑇superscriptsubscript𝑡1𝑇1𝐹𝜓subscript𝜓𝑡\displaystyle\textnormal{Regret}(T)=\sum_{t=1}^{T}\left(1-F(\psi,\psi_{t})% \right),Regret ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 1 - italic_F ( italic_ψ , italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (8)

so minimizing regret becomes equivalent to performing online quantum state tomography with minimal infidelity, where the goal is to align each probe direction ψtsubscript𝜓𝑡\psi_{t}italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as closely as possible to the unknown pure state ψ𝜓\psiitalic_ψ. It is important to emphasize that this notion of regret minimization is not addressed by standard quantum state tomography algorithms, which typically aim to design measurement schemes optimized to output a single classical estimator ψ^Tsubscript^𝜓𝑇\hat{\psi}_{T}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT minimizing the final estimation error (7), rather than controlling the cumulative error across all measurement rounds. This motivates the following question:

  • Question 1. Can we perform single copy sample-optimal state tomography in infidelity and achieve at the same time sub-linear regret for unknown pure states? How much adaptiveness is needed for this task?

It is important to note that adaptiveness plays a crucial role for algorithms that aim to minimise the cumulative disturbance of the post-measured state. One could try to use one of the existing sample-optimal algorithms in the incoherent setting such as [9, 11, 8], which for T𝑇Titalic_T samples achieve infidelity Err(T)=O(1/T)Err𝑇𝑂1𝑇\textnormal{Err}(T)=O(1/T)Err ( italic_T ) = italic_O ( 1 / italic_T ). However, since these algorithms either use fixed bases or randomized measurements, this inevitably leads to a linear scaling Regret(T)=O(T)Regret𝑇𝑂𝑇\textnormal{Regret}(T)=O(T)Regret ( italic_T ) = italic_O ( italic_T ). A natural next step is to consider a simple strategy with one round of adaptiveness, where we use a fraction α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] of the copies for state tomography to produce a good estimate ψ^^𝜓\hat{\psi}over^ start_ARG italic_ψ end_ARG of the unknown ψ𝜓\psiitalic_ψ, and use the remaining copies to measure along the estimated direction. Using sample-optimal state tomography algorithms this leads to a regret scaling

Regret(T)=O(αT+(TαT)1αT),Regret𝑇𝑂𝛼𝑇𝑇𝛼𝑇1𝛼𝑇\displaystyle\textnormal{Regret}(T)=O\left(\alpha T+(T-\alpha T)\frac{1}{% \alpha T}\right),Regret ( italic_T ) = italic_O ( italic_α italic_T + ( italic_T - italic_α italic_T ) divide start_ARG 1 end_ARG start_ARG italic_α italic_T end_ARG ) , (9)

which, optimized over α𝛼\alphaitalic_α, gives Regret(T)=O(T)Regret𝑇𝑂𝑇\text{Regret}(T)=O(\sqrt{T})Regret ( italic_T ) = italic_O ( square-root start_ARG italic_T end_ARG ), but results in a sub-optimal error Err(T)=O(1/T)Err𝑇𝑂1𝑇\textnormal{Err}(T)=O(1/\sqrt{T})Err ( italic_T ) = italic_O ( 1 / square-root start_ARG italic_T end_ARG ).

In [14] it was left open the question whether if for pure states one can find an algorithms with a scaling better than O(T)𝑂𝑇O(\sqrt{T})italic_O ( square-root start_ARG italic_T end_ARG ) or find a matching lower bound. Since our problem is closely related to the MAQB framework; we name it the pure-state multi-armed quantum bandit (PSMAQB) and use it to address the following questions at the intersection of quantum state tomography and linear stochastic bandits [12]:

  • Question 2. Can we break the square root barrier for pure states by showing that Regret(T)=o(T)Regret𝑇𝑜𝑇\textnormal{Regret}(T)=o(\sqrt{T})Regret ( italic_T ) = italic_o ( square-root start_ARG italic_T end_ARG ) for the PSMAQB problem?

Achieving a better scaling for the PSMAQB problem would provide a physically motivated linear bandit setting where the square root barrier can be surpassed. The linear bandit model with a noise structure studied in [16] is inspired by shot noise in quantum mechanics; however, as we will discuss later, this setting does not align with the PSMAQB problem. The main challenge lies in designing a new algorithm and techniques that exploit the specific structure of the PSMAQB setting compared to the standard linear bandit problem.

2 Main results

In this work, we provide affirmative answers at the same time to Questions 1 and 2 through the following Theorem.

Theorem 1 (informal).

For any unknown pure qubit state |ψket𝜓|\psi\rangle| italic_ψ ⟩, we present an algorithm that achieves

𝔼[Regret(T)]=O(log2(T)).𝔼delimited-[]Regret𝑇𝑂superscript2𝑇\displaystyle\mathbb{E}\left[\textup{Regret}(T)\right]=O\big{(}\log^{2}(T)\big% {)}.blackboard_E [ Regret ( italic_T ) ] = italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ) . (10)

Moreover, at each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], our algorithm outputs an online estimate |ψ^tketsubscript^𝜓𝑡|\hat{\psi}_{t}\rangle| over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ with infidelity scaling as

𝔼[1|ψ|ψ^t|2]=O~(1t)𝔼delimited-[]1superscriptinner-product𝜓subscript^𝜓𝑡2~𝑂1𝑡\displaystyle\mathbb{E}\left[1-|\langle\psi|\hat{\psi}_{t}\rangle|^{2}\right]=% \widetilde{O}\left(\frac{1}{t}\right)blackboard_E [ 1 - | ⟨ italic_ψ | over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = over~ start_ARG italic_O end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ) (11)

Both statements also holds with high probability.

To prove Theorem 1 we provide an almost fully adaptive adaptive quantum state tomography algorithm that uses O(T/log(T))𝑂𝑇𝑇O(T/\log(T))italic_O ( italic_T / roman_log ( italic_T ) ) rounds of adaptiveness. The exact algorithm and Theorem can be found in Sections 4 and 5. We say that our algorithm is “online” because it is able to output at each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] an estimator with the almost optimal infidelity scaling O(1t)𝑂1𝑡O(\frac{1}{t})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ) up to logarithmic factors. Now we sketch the main idea of how our algorithm updates the measurements.

  1. 1.

    Estimation. At each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] we use the past information of measurements on the direction of |ψa1,,|ψat1ketsubscript𝜓subscript𝑎1ketsubscript𝜓subscript𝑎𝑡1\ket{\psi_{a_{1}}},...,\ket{\psi_{a_{t-1}}}| start_ARG italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ , … , | start_ARG italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ and associated outcomes r1,,rt1{0,1}t1subscript𝑟1subscript𝑟𝑡1superscript01tensor-productabsent𝑡1r_{1},...,r_{t-1}\in\{0,1\}^{\otimes t-1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT ⊗ italic_t - 1 end_POSTSUPERSCRIPT to build a high probability confidence region 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the unknown environment |ψket𝜓|\psi\rangle| italic_ψ ⟩.

  2. 2.

    Exploration-exploitation. A batch of measurements is performed, given by the directions of maximum uncertainty of 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that they give enough information to construct 𝒞t+1subscript𝒞𝑡1\mathcal{C}_{t+1}caligraphic_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (exploration) and also minimise the regret (6) (exploitation).

For the estimation part, we work with the Bloch sphere representation of the unknown state Π=|ψψ|=I+θσ2Πket𝜓bra𝜓𝐼𝜃𝜎2\Pi=|\psi\rangle\!\langle\psi|=\frac{I+\theta\cdot\sigma}{2}roman_Π = | italic_ψ ⟩ ⟨ italic_ψ | = divide start_ARG italic_I + italic_θ ⋅ italic_σ end_ARG start_ARG 2 end_ARG where θ𝕊2𝜃superscript𝕊2\theta\in\mathbb{S}^{2}italic_θ ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and for σ𝜎\sigmaitalic_σ we can take the standard Pauli Basis i.e σ=(σx,σy,σz)𝜎subscript𝜎𝑥subscript𝜎𝑦subscript𝜎𝑧\sigma=(\sigma_{x},\sigma_{y},\sigma_{z})italic_σ = ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ). For each measurement direction Πat=|ψatψat|subscriptΠsubscript𝑎𝑡ketsubscript𝜓subscript𝑎𝑡brasubscript𝜓subscript𝑎𝑡\Pi_{a_{t}}=|\psi_{a_{t}}\rangle\!\langle\psi_{a_{t}}|roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ⟨ italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT |, our algorithm performs k𝑘kitalic_k independent measurements using the same direction, and it builds the following k𝑘kitalic_k online weighted least squares estimators of θ𝜃\thetaitalic_θ,

θ~t,i=Vt1s=1t1σ^s2(as)rs,iasfor i[k],formulae-sequencesubscript~𝜃𝑡𝑖superscriptsubscript𝑉𝑡1superscriptsubscript𝑠1𝑡1subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠subscript𝑟𝑠𝑖subscript𝑎𝑠for 𝑖delimited-[]𝑘\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}r_{s,i}a_{s}\quad\text{for }i\in[k],over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for italic_i ∈ [ italic_k ] , (12)

where rs,i{0,1}subscript𝑟𝑠𝑖01r_{s,i}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } is the outcome of the measurement (up to some renormalization) using the projector ΠassubscriptΠsubscript𝑎𝑠\Pi_{a_{s}}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT with Bloch vector as3subscript𝑎𝑠superscript3a_{s}\in\mathbb{R}^{3}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, Vt=𝕀+s=1t1σ^s2(as)asas𝖳subscript𝑉𝑡𝕀superscriptsubscript𝑠1𝑡1subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠subscript𝑎𝑠superscriptsubscript𝑎𝑠𝖳V_{t}=\mathbb{I}+\sum_{s=1}^{t}\frac{1}{\hat{\sigma}^{2}_{s}(a_{s})}a_{s}a_{s}% ^{\mathsf{T}}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_I + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT is the design matrix and σ^s2(as)subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠\hat{\sigma}^{2}_{s}(a_{s})over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is a variance estimator of the real variance associated to the outcome rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The key point where we take advantage from the structure of the quantum problem is that the variance of the outcome rasubscript𝑟𝑎r_{a}italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT associated to the action ΠasubscriptΠ𝑎\Pi_{a}roman_Π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT can be bounded as 𝕍[ra]1|ψ|ψa|2𝕍subscript𝑟𝑎1superscriptinner-product𝜓subscript𝜓𝑎2\operatorname{\mathbb{V}}[r_{a}]\leq 1-|\langle\psi|\psi_{a}\rangle|^{2}blackboard_V [ italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] ≤ 1 - | ⟨ italic_ψ | italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The idea is that through a careful choice of actions we can make the terms 1/σ^s2(as)1subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠1/\hat{\sigma}^{2}_{s}(a_{s})1 / over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) arbitrarily large and “boost” the confidence on the directions assubscript𝑎𝑠a_{s}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the estimators (12) that are close to θ𝜃\thetaitalic_θ. However, this comes at a price, and is that in order to get good concentration bounds for our estimator we need to deal with unbounded random variables and finite variance. We address this issue using the new ideas of median of means (MoM) for online least squares estimators introduced in [4, 17, 19]. The construction takes inspiration from the old method of median of means [13, Chapter 3] for real random variables with unbounded support and bounded variance but requires non-trivial adaptation for online linear least squares estimators. Similarly to the real case we use the k𝑘kitalic_k independent estimators (12) in order to construct the MoM estimator θ~twMoMsubscriptsuperscript~𝜃wMoM𝑡\widetilde{\theta}^{\text{\tiny wMoM}}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that we can build a confidence region with concentration bounds scaling as 1exp(k)1𝑘1-\exp(-k)1 - roman_exp ( - italic_k ). In particular, the reference we cite for the median of means [19] is a recent theoretical contribution that explicitly posed as an open question the range of settings where this approach could be applied; our work provides a concrete and significant answer in the quantum domain. We give the exact construction in Section 4.1.

\begin{overpic}[percent,width=216.81pt]{algorithm_psmaqb.png} \put(50.0,70.0){\rotatebox{80.0}{$|\psi^{+}_{a_{t}}\rangle$}} \put(70.0,42.0){\rotatebox{0.0}{$|\psi^{-}_{a_{t}}\rangle$}} \put(75.0,85.0){\rotatebox{0.0}{$\mathcal{C}_{t}$}} \put(82.0,70.0){{$|\widehat{\psi}_{t}\rangle$}} \put(90.0,60.0){{$|\psi\rangle$}} \par\end{overpic}
Figure 1: The algorithm at each time step outputs an estimator |ψ^tketsubscript^𝜓𝑡|\widehat{\psi}_{t}\rangle| over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ and builds a high-probability confidence region 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (shaded region) around the unknown state |ψket𝜓|\psi\rangle| italic_ψ ⟩ on the Bloch sphere representation. Then uses the optimistic principle to output measurement directions |ψat±ketsubscriptsuperscript𝜓plus-or-minussubscript𝑎𝑡|\psi^{\pm}_{a_{t}}\rangle| italic_ψ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ that are close the unknown state |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩ projecting into the Bloch sphere the extreme points of the largest principal axis of 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This particular choice allows optimal learning of |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩ (exploration) and simultaneously minimizes the regret (exploitation).

For the exploration-exploitation part, we take the ideas that we develop in [16] (see Figure 1). We give the precise action choice in Section 4.2, and here we sketch the main points. We take inspiration from the optimistic principle for bandit algorithms which in short tells us to choose the most rewarding actions with the available information. In order to use this idea, we use the confidence region that we build in the estimation part and we select measurements that align with the (unknown) direction of |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩. See Figure 1. Our algorithm also achieves the relation 1|ψ|ψat|2=O(1/λmin(Vt))1superscriptinner-product𝜓subscript𝜓subscript𝑎𝑡2𝑂1subscript𝜆subscript𝑉𝑡1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}=O\left(1/\lambda_{\min}(V_{t})\right)1 - | ⟨ italic_ψ | italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( 1 / italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), where the minimum eigenvalue λmin(Vt)subscript𝜆subscript𝑉𝑡\lambda_{\min}(V_{t})italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) quantifies the direction of maximum uncertainty (exploration) of our estimator. The maximum eigenvalue λmax(Vt)subscript𝜆subscript𝑉𝑡\lambda_{\max}(V_{t})italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) quantifies the amount of exploitation. We can relate these two concepts through the Theorem we formally state and prove in [16, Theorem 3], which states that for our particular measurement choice we have λmin(Vt)=Ω(λmax(Vt))subscript𝜆subscript𝑉𝑡Ωsubscript𝜆subscript𝑉𝑡\lambda_{\min}(V_{t})=\Omega(\sqrt{\lambda_{\max}(V_{t})})italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Ω ( square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ). Using this relation and a careful analysis, we can show that λmax(Vt)=Ω(t2)subscript𝜆subscript𝑉𝑡Ωsuperscript𝑡2\lambda_{\max}(V_{t})=\Omega(t^{2})italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Ω ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) which gives λmin(Vt)=Ω(t)subscript𝜆subscript𝑉𝑡Ω𝑡\lambda_{\min}(V_{t})=\Omega(t)italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Ω ( italic_t ) and the scaling 1|ψ|ψat|2=O(1/t)1superscriptinner-product𝜓subscript𝜓subscript𝑎𝑡2𝑂1𝑡1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}=O(1/t)1 - | ⟨ italic_ψ | italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( 1 / italic_t ). We emphasize that the key point that allows to achieve the rate λmin(Vt)=Ω(t)subscript𝜆subscript𝑉𝑡Ω𝑡\lambda_{\min}(V_{t})=\Omega(t)italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Ω ( italic_t ) is the fact that the variance estimators σ^s2subscriptsuperscript^𝜎2𝑠\hat{\sigma}^{2}_{s}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can get as close as possible to zero since the variance of the rewards goes to zero if we select measurements close to |ψket𝜓|\psi\rangle| italic_ψ ⟩.

To check the optimality of the regret, we derive a minimax expected regret lower bound based on the optimal quantum state tomography for pure state results in [10]. The proof does not follow directly from [10], and we have to adapt it to the bandit setting.

Theorem 2 (informal).

The cumulative expected regret for any strategy is bounded by

𝔼[Regret(T)]=Ω(logT),𝔼delimited-[]Regret𝑇Ω𝑇\displaystyle\mathbb{E}\left[\textup{Regret}(T)\right]=\Omega(\log T),blackboard_E [ Regret ( italic_T ) ] = roman_Ω ( roman_log italic_T ) , (13)

where the expectation is taken over the probability distribution of rewards and actions induced by the learner strategy and also uniformly over the set of pure state environments.

This result is formally derived in Section 6. There it is also generalized to the d𝑑ditalic_d-dimensional case, in which case the bound is given by 𝔼[Regret(T)]=Ω(dlog(T/d))𝔼delimited-[]Regret𝑇Ω𝑑𝑇𝑑\mathbb{E}\left[\textup{Regret}(T)\right]=\Omega(d\log(T/d))blackboard_E [ Regret ( italic_T ) ] = roman_Ω ( italic_d roman_log ( italic_T / italic_d ) ). The proof relies on the fact that individual actions of a strategy at time t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] can be viewed as quantum state tomographies using t𝑡titalic_t copies of the state. A relation between the fidelity of these tomographies and the regret of the strategy allows us to convert the fidelity upper bound from [10] to a regret lower bound. We use measure-theoretic tools to adapt the proof from [10] to a more general case where the tomography can output an arbitrary distribution of states. We remark that this is a noteworthy result since in [16] they argue how regret lower bound techniques for classical linear bandits fail for noise models with vanishing variance.

2.1 Outlook and open problems

From a quantum state tomography perspective, our work introduces completely new techniques for the adaptive setting, such as the median of means online least squares estimator or the optimistic principle. We expect these techniques to be useful in other quantum learning settings that require adaptiveness, particularly when quantum states serve as resources and must be minimally disturbed during the learning process, such as the state-agnostic work extraction protocols in [15]. Our algorithm achieves a polylogarithmic regret, which is an exponential improvement over all previously known algorithms for quantum tomography which can only achieve such a fidelity by accumulating a linear regret. At a fundamental level, our algorithm goes beyond traditional tomography ideas and shows that is enough to project near the state in order to optimally learn it with minimal disturbance to the samples. From a classical bandit perspective, it is surprising that the setting of learning pure quantum states gives the first non-trivial example of a linear bandit with continuous action sets that achieves polylogarithmic regret. This model motivated our classical work [16] and, jointly with the current work, we establish a bridge between the fields of quantum state tomography and linear stochastic bandits or, more generally, reinforcement learning.

We leave as an open problem the generalization of the algorithm beyond qubits. In particular, our approach relies on the one-to-one correspondence between pure qubit states and the Bloch sphere. While the chosen measurements are specifically designed to work with high-dimensional spheres, for d>2𝑑2d>2italic_d > 2 this correspondence is no longer an isomorphism, and it is not straightforward how to generalize the measurement directions.

We also leave open the question of whether other state tomography algorithms, especially those designed to minimize disturbance, such as weak or gentle measurements, can achieve sublinear regret—particularly polylogarithmic regret. We believe that adaptiveness plays a crucial role in any algorithm aiming to minimize the regret.

3 The model

In this section first connect the notions of disturbance and regret we formally state the PSMAQB problem and make a connection with a linear stochastic bandit problem. Then we define a slightly more general model where the key feature is that the variance of the rewards vanishes with the same behaviour as the PSMAQB problem.

3.1 Notation

First, we introduce some basic notation and conventions. Let [t]={1,2,,t}delimited-[]𝑡12𝑡[t]=\{1,2,...,t\}[ italic_t ] = { 1 , 2 , … , italic_t } for t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N. For real vectors x,yd𝑥𝑦superscript𝑑x,y\in\mathbb{R}^{d}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT we denote their inner product as x,y=x1y1++xdyd𝑥𝑦subscript𝑥1subscript𝑦1subscript𝑥𝑑subscript𝑦𝑑\langle x,y\rangle=x_{1}y_{1}+...+x_{d}y_{d}⟨ italic_x , italic_y ⟩ = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Given a real vector xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT we denote the 2-norm as x2subscriptnorm𝑥2\|x\|_{2}∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and for a real semi-positive definite matrix Ad×d𝐴superscript𝑑𝑑A\in\mathbb{R}^{d\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, A0𝐴0A\geq 0italic_A ≥ 0 the weighted norm with A𝐴Aitalic_A as xA2=x,Axsubscriptsuperscriptnorm𝑥2𝐴𝑥𝐴𝑥\|x\|^{2}_{A}=\langle x,Ax\rangle∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = ⟨ italic_x , italic_A italic_x ⟩. The set corresponding to the surface of the unit sphere is 𝕊d1={xd:x2=1}superscript𝕊𝑑1conditional-set𝑥superscript𝑑subscriptnorm𝑥21\mathbb{S}^{d-1}=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 }. For a real symmetric matrix Ad×d𝐴superscript𝑑𝑑A\in\mathbb{R}^{d\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT we denote λmax(A)subscript𝜆𝐴\lambda_{\max}(A)italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_A ), λmin(A)subscript𝜆𝐴\lambda_{\min}(A)italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_A ) its maximum and minimum eigenvalues respectively. We use the ordering λmin(A)λ2(A),.,λd1(A)λmax(A)\lambda_{\min}(A)\leq\lambda_{2}(A),....,\lambda_{d-1}(A)\leq\lambda_{\max}(A)italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_A ) ≤ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A ) , … . , italic_λ start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT ( italic_A ) ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_A ) for the i𝑖iitalic_i-th λi(A)subscript𝜆𝑖𝐴\lambda_{i}(A)italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A ) eigenvalue in increasing order. For a random variable X𝑋Xitalic_X (discrete or continuous) we denote 𝔼[X]𝔼𝑋\operatorname{\mathbb{E}}[X]blackboard_E [ italic_X ] and 𝕍[X]𝕍𝑋\operatorname{\mathbb{V}}[X]blackboard_V [ italic_X ] its expectation value and variance respectively. A random variable X𝑋Xitalic_X is σ𝜎\sigmaitalic_σ-subgaussian if λ,𝔼[exp(λX)]exp(λ2σ2/2)formulae-sequencefor-all𝜆𝔼𝜆𝑋superscript𝜆2superscript𝜎22\forall\lambda\in\mathbb{R},\operatorname{\mathbb{E}}\left[\exp(\lambda X)% \right]\leq\exp\left(\lambda^{2}\sigma^{2}/2\right)∀ italic_λ ∈ blackboard_R , blackboard_E [ roman_exp ( italic_λ italic_X ) ] ≤ roman_exp ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ).

Let 𝒮d={ρd×d:Tr(ρ)=1,ρ0}subscript𝒮𝑑conditional-set𝜌superscript𝑑𝑑formulae-sequenceTr𝜌1𝜌0\mathcal{S}_{d}=\{\rho\in\mathbb{C}^{d\times d}:\operatorname{Tr}(\rho)=1,\rho% \geq 0\}caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { italic_ρ ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT : roman_Tr ( italic_ρ ) = 1 , italic_ρ ≥ 0 } be the set of quantum states in a d𝑑ditalic_d-dimensional Hilbert space =dsuperscript𝑑\mathcal{H}=\mathbb{C}^{d}caligraphic_H = blackboard_C start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒮d={ρ𝒮d:ρ2=ρ}subscriptsuperscript𝒮𝑑conditional-set𝜌subscript𝒮𝑑superscript𝜌2𝜌\mathcal{S}^{*}_{d}=\{\rho\in\mathcal{S}_{d}:\rho^{2}=\rho\}caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { italic_ρ ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_ρ } the set of pure states or rank-1 projectors. We will use the parametrization given in [7] of a d𝑑ditalic_d-dimensional quantum state ρθ𝒮dsubscript𝜌𝜃subscript𝒮𝑑\rho_{\theta}\in\mathcal{S}_{d}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT,

ρθ=𝕀d+(d(d1)2d2)θσsubscript𝜌𝜃𝕀𝑑𝑑𝑑12superscript𝑑2𝜃𝜎\displaystyle\rho_{\theta}=\frac{\mathbb{I}}{d}+\left(\sqrt{\frac{d(d-1)}{2d^{% 2}}}\right)\theta\cdot\sigmaitalic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = divide start_ARG blackboard_I end_ARG start_ARG italic_d end_ARG + ( square-root start_ARG divide start_ARG italic_d ( italic_d - 1 ) end_ARG start_ARG 2 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) italic_θ ⋅ italic_σ (14)

where θd21𝜃superscriptsuperscript𝑑21\theta\in\mathbb{R}^{d^{2}-1}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and σ=(σ1,,σd21)𝜎subscript𝜎1subscript𝜎superscript𝑑21\sigma=(\sigma_{1},...,\sigma_{d^{2}-1})italic_σ = ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT ) is a vector of orthogonal, traceless, Hermitian matrices with the normalization condition Tr(σiσj)=2δi,jTrsubscript𝜎𝑖subscript𝜎𝑗2subscript𝛿𝑖𝑗\operatorname{Tr}(\sigma_{i}\sigma_{j})=2\delta_{i,j}roman_Tr ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 2 italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. We will use the subscript θ𝜃\thetaitalic_θ in the quantum state ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in order to denote the vector of the parametrization (14). In particular the normalization is taken such that θ221superscriptsubscriptnorm𝜃221\|\theta\|_{2}^{2}\leq 1∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 1 with equality if ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is pure. Note that the parametrization enforces ρθ=ρθsubscriptsuperscript𝜌𝜃subscript𝜌𝜃\rho^{\dagger}_{\theta}=\rho_{\theta}italic_ρ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Tr(ρθ)=1Trsubscript𝜌𝜃1\operatorname{Tr}(\rho_{\theta})=1roman_Tr ( italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 1. Also there are some extra conditions on the vector θ𝜃\thetaitalic_θ regarding the positivity of the density matrix ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT but we will not use them. For two quantum states ρ,σ𝒮d𝜌𝜎subscript𝒮𝑑\rho,\sigma\in\mathcal{S}_{d}italic_ρ , italic_σ ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT the fidelity is F(ρ,σ)=(Tr(σρσ))2𝐹𝜌𝜎superscriptTr𝜎𝜌𝜎2F(\rho,\sigma)=\left(\operatorname{Tr}(\sqrt{\sqrt{\sigma}\rho\sqrt{\sigma}})% \right)^{2}italic_F ( italic_ρ , italic_σ ) = ( roman_Tr ( square-root start_ARG square-root start_ARG italic_σ end_ARG italic_ρ square-root start_ARG italic_σ end_ARG end_ARG ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the infidelity 1F(ρ,σ)1𝐹𝜌𝜎1-F(\rho,\sigma)1 - italic_F ( italic_ρ , italic_σ ). For a Hilbert space \mathcal{H}caligraphic_H, the set of linear operators on it will be denoted by End()End\operatorname{End}(\mathcal{H})roman_End ( caligraphic_H ). The joint state of a system consisting of n𝑛nitalic_n copies of a pure state Πθ𝒮dsubscriptΠ𝜃superscriptsubscript𝒮𝑑\Pi_{\theta}\in\mathcal{S}_{d}^{*}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is given by the n𝑛nitalic_n-th tensor power ΠθnEnd(n)superscriptsubscriptΠ𝜃tensor-productabsent𝑛Endsuperscripttensor-productabsent𝑛\Pi_{\theta}^{\otimes n}\in\operatorname{End}(\mathcal{H}^{\otimes n})roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ∈ roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ). Using Dirac notation, we can express Πθ=|ψθψθ|subscriptΠ𝜃ketsubscript𝜓𝜃brasubscript𝜓𝜃\Pi_{\theta}=|\psi_{\theta}\rangle\!\langle\psi_{\theta}|roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = | italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ⟩ ⟨ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | for some normalized |ψθketsubscript𝜓𝜃|\psi_{\theta}\rangle\in\mathcal{H}| italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ⟩ ∈ caligraphic_H. Then, the span of all n𝑛nitalic_n-copy states of the form |ψθnsuperscriptketsubscript𝜓𝜃tensor-productabsent𝑛|\psi_{\theta}\rangle^{\otimes n}| italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT is called the symmetric subspace of nsuperscripttensor-productabsent𝑛\mathcal{H}^{\otimes n}caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT, denoted by +nsubscriptsuperscripttensor-productabsent𝑛\mathcal{H}^{\otimes n}_{+}caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Its dimension is Dn=(n+d1d)subscript𝐷𝑛binomial𝑛𝑑1𝑑D_{n}=\binom{n+d-1}{d}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( FRACOP start_ARG italic_n + italic_d - 1 end_ARG start_ARG italic_d end_ARG ). The symmetrization operator Πn+End(n)subscriptsuperscriptΠ𝑛Endsuperscripttensor-productabsent𝑛\Pi^{+}_{n}\in\operatorname{End}(\mathcal{H}^{\otimes n})roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ) is the projector onto +nsubscriptsuperscripttensor-productabsent𝑛\mathcal{H}^{\otimes n}_{+}caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT.

3.2 Cumulative disturbance and regret

Here we formally show that the notions of disturbance (2) and (4) for qubits are indeed controlled by the regret defined as in (6).

Lemma 3.

Consider the notion of disturbance defined in (2) then we have that

Disturbance(T)=Θ(Regret(T)),Disturbance𝑇ΘRegret𝑇\displaystyle\textnormal{Disturbance}(T)=\Theta(\textnormal{Regret}(T)),Disturbance ( italic_T ) = roman_Θ ( Regret ( italic_T ) ) , (15)

where regret is defined as in (6).

Proof.

First, without loss of generality we define 12pt=ψt|ρ|ψt12subscript𝑝𝑡quantum-operator-productsubscript𝜓𝑡𝜌subscript𝜓𝑡\frac{1}{2}\leq p_{t}=\langle\psi_{t}|\rho|\psi_{t}\rangledivide start_ARG 1 end_ARG start_ARG 2 end_ARG ≤ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ρ | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ and using ψc=𝕀ψsuperscript𝜓𝑐𝕀𝜓\psi^{c}=\mathbb{I}-\psiitalic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = blackboard_I - italic_ψ we can directly compute

𝔼[1F(ρ,ψ~t)]=2pt(1pt),𝔼delimited-[]1𝐹𝜌subscript~𝜓𝑡2subscript𝑝𝑡1subscript𝑝𝑡\displaystyle\mathbb{E}[1-F(\rho,\tilde{\psi}_{t})]=2p_{t}\left(1-p_{t}\right),blackboard_E [ 1 - italic_F ( italic_ρ , over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] = 2 italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (16)

and using ptλmax(ρ)subscript𝑝𝑡subscript𝜆𝜌p_{t}\leq\lambda_{\max}(\rho)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) we have

minψ𝔼[1F(ρ,ψ~)]=2λmax(ρ)(1λmax(ρ)).subscript𝜓𝔼1𝐹𝜌~𝜓2subscript𝜆𝜌1subscript𝜆𝜌\displaystyle\min_{\psi}\operatorname{\mathbb{E}}[1-F(\rho,\tilde{\psi})]=2% \lambda_{\max}(\rho)(1-\lambda_{\max}(\rho)).roman_min start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT blackboard_E [ 1 - italic_F ( italic_ρ , over~ start_ARG italic_ψ end_ARG ) ] = 2 italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ( 1 - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) . (17)

Then using the identity 1=x2+(1x)2+2x(1x)1superscript𝑥2superscript1𝑥22𝑥1𝑥1=x^{2}+(1-x)^{2}+2x(1-x)1 = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x ( 1 - italic_x ) we have

2pt(1pt)2λmax(ρ)(1λmax(ρ))=2(λmax(ρ)pt)(λmax(ρ)+pt1).2subscript𝑝𝑡1subscript𝑝𝑡2subscript𝜆𝜌1subscript𝜆𝜌2subscript𝜆𝜌subscript𝑝𝑡subscript𝜆𝜌subscript𝑝𝑡1\displaystyle 2p_{t}\left(1-p_{t}\right)-2\lambda_{\max}(\rho)(1-\lambda_{\max% }(\rho))=2(\lambda_{\max}(\rho)-p_{t})(\lambda_{\max}(\rho)+p_{t}-1).2 italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 2 italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ( 1 - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) = 2 ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) + italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 ) . (18)

By using ptλmax(ρ)1subscript𝑝𝑡subscript𝜆𝜌1p_{t}\leq\lambda_{\max}(\rho)\leq 1italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ≤ 1 we have

2(λmax(ρ)pt)(λmax(ρ)+pt1)2(λmax(ρ)pt),2subscript𝜆𝜌subscript𝑝𝑡subscript𝜆𝜌subscript𝑝𝑡12subscript𝜆𝜌subscript𝑝𝑡\displaystyle 2(\lambda_{\max}(\rho)-p_{t})(\lambda_{\max}(\rho)+p_{t}-1)\leq 2% (\lambda_{\max}(\rho)-p_{t}),2 ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) + italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 ) ≤ 2 ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (19)

which leads to Disturbance(T)2Regret(T)Disturbance𝑇2Regret𝑇\textnormal{Disturbance}(T)\leq 2\textnormal{Regret}(T)Disturbance ( italic_T ) ≤ 2 Regret ( italic_T ). Then the converse bound follows simply by using pt12subscript𝑝𝑡12p_{t}\geq\frac{1}{2}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG in the second factor of (18). ∎

Lemma 4.

Consider the notion of disturbance defined in (4), then we have that

Disturbance(T)=Θ(Regret(T)),superscriptDisturbance𝑇ΘRegret𝑇\displaystyle\textnormal{Disturbance}^{*}(T)=\Theta(\textnormal{Regret}(T)),Disturbance start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ) = roman_Θ ( Regret ( italic_T ) ) , (20)

where regret is defined as in (6).

Proof.

First, without loss of generality we define 12pt=ψt|ρ|ψt12subscript𝑝𝑡quantum-operator-productsubscript𝜓𝑡𝜌subscript𝜓𝑡\frac{1}{2}\leq p_{t}=\langle\psi_{t}|\rho|\psi_{t}\rangledivide start_ARG 1 end_ARG start_ARG 2 end_ARG ≤ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ρ | italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ and then using the clossed formula for the fidelity for qubits and ρt=ptψ+(1pt)ψtcsubscript𝜌𝑡subscript𝑝𝑡𝜓1subscript𝑝𝑡subscriptsuperscript𝜓𝑐𝑡\rho_{t}=p_{t}\psi+(1-p_{t})\psi^{c}_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ψ + ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have

F(ρ,ρt)𝐹𝜌subscript𝜌𝑡\displaystyle F(\rho,\rho_{t})italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Tr(ρρt)+2detρdetρtabsentTr𝜌subscript𝜌𝑡2𝜌subscript𝜌𝑡\displaystyle=\operatorname{Tr}(\rho\rho_{t})+2\sqrt{\det\rho\det\rho_{t}}= roman_Tr ( italic_ρ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 square-root start_ARG roman_det italic_ρ roman_det italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (21)
=pt2+(1pt)2+2λmax(ρ)(1λmax(ρ))pt(1pt).absentsuperscriptsubscript𝑝𝑡2superscript1subscript𝑝𝑡22subscript𝜆𝜌1subscript𝜆𝜌subscript𝑝𝑡1subscript𝑝𝑡\displaystyle=p_{t}^{2}+(1-p_{t})^{2}+2\sqrt{\lambda_{\max}(\rho)(1-\lambda_{% \max}(\rho))p_{t}(1-p_{t})}.= italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ( 1 - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG . (22)

Using that ptλmax(ρ)subscript𝑝𝑡subscript𝜆𝜌p_{t}\leq\lambda_{\max}(\rho)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) we have

F(ρ,ρt)𝐹𝜌subscript𝜌𝑡\displaystyle F(\rho,\rho_{t})italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) pt2+(1pt)2+2pt(1λmax(ρ))absentsuperscriptsubscript𝑝𝑡2superscript1subscript𝑝𝑡22subscript𝑝𝑡1subscript𝜆𝜌\displaystyle\geq p_{t}^{2}+(1-p_{t})^{2}+2p_{t}(1-\lambda_{\max}(\rho))≥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) (23)
=1+2pt(ptλmax(ρ)).absent12subscript𝑝𝑡subscript𝑝𝑡subscript𝜆𝜌\displaystyle=1+2p_{t}(p_{t}-\lambda_{\max}(\rho)).= 1 + 2 italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) . (24)

Then we can upper bound the infidelity as

1F(ρ,ρt)(λmax(ρ)pt)pt2(λmax(ρ)pt),1𝐹𝜌subscript𝜌𝑡subscript𝜆𝜌subscript𝑝𝑡subscript𝑝𝑡2subscript𝜆𝜌subscript𝑝𝑡\displaystyle 1-F(\rho,\rho_{t})\leq(\lambda_{\max}(\rho)-p_{t})p_{t}\leq 2(% \lambda_{\max}(\rho)-p_{t}),1 - italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 2 ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (25)

which leads to Disturbance(T)2Regret(T)superscriptDisturbance𝑇2Regret𝑇\textnormal{Disturbance}^{*}(T)\leq 2\textnormal{Regret}(T)Disturbance start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ) ≤ 2 Regret ( italic_T ). For the other bound we can use the geometric mean 2xyx+y2𝑥𝑦𝑥𝑦2\sqrt{xy}\leq x+y2 square-root start_ARG italic_x italic_y end_ARG ≤ italic_x + italic_y in (21) and we have

F(ρ,ρt)𝐹𝜌subscript𝜌𝑡\displaystyle F(\rho,\rho_{t})italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) pt2+(1pt)2+pt(1pt)+λmax(ρ)(1λmax(ρ))absentsuperscriptsubscript𝑝𝑡2superscript1subscript𝑝𝑡2subscript𝑝𝑡1subscript𝑝𝑡subscript𝜆𝜌1subscript𝜆𝜌\displaystyle\leq p_{t}^{2}+(1-p_{t})^{2}+p_{t}(1-p_{t})+\lambda_{\max}(\rho)(% 1-\lambda_{\max}(\rho))≤ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ( 1 - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) (26)
=1+(ptλmax(ρ))(pt+λmax(ρ)1)1+(ptλmax(ρ)),absent1subscript𝑝𝑡subscript𝜆𝜌subscript𝑝𝑡subscript𝜆𝜌11subscript𝑝𝑡subscript𝜆𝜌\displaystyle=1+(p_{t}-\lambda_{\max}(\rho))(p_{t}+\lambda_{\max}(\rho)-1)\leq 1% +(p_{t}-\lambda_{\max}(\rho)),= 1 + ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - 1 ) ≤ 1 + ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ) , (27)

where we used pt,λmax(ρ)1subscript𝑝𝑡subscript𝜆𝜌1p_{t},\lambda_{\max}(\rho)\leq 1italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) ≤ 1. This gives

1F(ρ,ρt)λmax(ρ)pt,1𝐹𝜌subscript𝜌𝑡subscript𝜆𝜌subscript𝑝𝑡\displaystyle 1-F(\rho,\rho_{t})\geq\lambda_{\max}(\rho)-p_{t},1 - italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ρ ) - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (28)

which leads to Disturbance(T)Regret(T)superscriptDisturbance𝑇Regret𝑇\textnormal{Disturbance}^{*}(T)\geq\textnormal{Regret}(T)Disturbance start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ) ≥ Regret ( italic_T ). ∎

3.3 Multi-armed quantum bandit for pure states

The model that we are interested in is the general multi-armed quantum bandit model described in [14][Section 2.3] with the action set being all rank-1 projectors and with pure state environments. For completeness, we state the basic definitions for this particular case for any dimension.

Definition 5.

Let d𝑑d\in\mathbb{N}italic_d ∈ blackboard_N. A d𝑑ditalic_d-dimensional pure state multi-armed quantum bandit (PSMAQB) is given by a measurable space (𝒜,Σ)𝒜Σ(\mathcal{A},\Sigma)( caligraphic_A , roman_Σ ), where 𝒜=𝒮d𝒜superscriptsubscript𝒮𝑑\mathcal{A}=\mathcal{S}_{d}^{*}caligraphic_A = caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the action set and ΣΣ\Sigmaroman_Σ is a σ𝜎\sigmaitalic_σ-algebra of subsets of 𝒜𝒜\mathcal{A}caligraphic_A. The bandit is in an environment, a quantum state Πθ𝒮dsubscriptΠ𝜃superscriptsubscript𝒮𝑑\Pi_{\theta}\in\mathcal{S}_{d}^{*}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, that is unknown.

The interaction with the PSMAQB is done by a learner that interacts sequentially over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] rounds with the unknown environment Πθ𝒮dsubscriptΠ𝜃subscriptsuperscript𝒮𝑑\Pi_{\theta}\in\mathcal{S}^{*}_{d}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. At each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]:

  1. 1.

    The learner selects an action Πat𝒜subscriptΠsubscript𝑎𝑡𝒜\Pi_{a_{t}}\in\mathcal{A}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_A.

  2. 2.

    Performs a measurement on the unknown environment ΠθsubscriptΠ𝜃\Pi_{\theta}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using the two-outcome POVM {Πat,Id×dΠat}subscriptΠsubscript𝑎𝑡subscript𝐼𝑑𝑑subscriptΠsubscript𝑎𝑡\{\Pi_{a_{t}},I_{d\times d}-\Pi_{a_{t}}\}{ roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT - roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and receives a reward rt{0,1}subscript𝑟𝑡01r_{t}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } sampled according to the Born’s rule, i.e

    PrΠθ(rt|Πat)={Tr(ΠθΠat)ifrt=1,1Tr(ΠθΠat)ifrt=0,0else.subscriptPrsubscriptΠ𝜃conditionalsubscript𝑟𝑡subscriptΠsubscript𝑎𝑡casesTrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡ifsubscript𝑟𝑡1otherwise1TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡ifsubscript𝑟𝑡0otherwise0elseotherwise\displaystyle\mathrm{Pr}_{\Pi_{\theta}}\left(r_{t}|\Pi_{a_{t}}\right)=\begin{% cases}\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\quad\text{if}\quad r_{t}=1,\\ 1-\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\quad\text{if}\quad r_{t}=0,\\ 0\quad\text{else}.\end{cases}roman_Pr start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) if italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) if italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 else . end_CELL start_CELL end_CELL end_ROW (29)

We note that the reward at time step t𝑡titalic_t after selecting Πat𝒜subscriptΠsubscript𝑎𝑡𝒜\Pi_{a_{t}}\in\mathcal{A}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_A can be written as

rt=Tr(ΠθΠat)+ϵt,subscript𝑟𝑡TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡subscriptitalic-ϵ𝑡\displaystyle r_{t}=\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})+\epsilon_{t},italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (30)

where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a Bernoulli random variable with values ϵt{1Tr(ΠθΠat),Tr(ΠθΠat)}subscriptitalic-ϵ𝑡1TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡\epsilon_{t}\in\{1-\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}}),-\operatorname{% Tr}(\Pi_{\theta}\Pi_{a_{t}})\}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } such that

𝔼[ϵt|t1]𝔼conditionalsubscriptitalic-ϵ𝑡subscript𝑡1\displaystyle\operatorname{\mathbb{E}}\left[\epsilon_{t}|\mathcal{F}_{t-1}\right]blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] =0,absent0\displaystyle=0,= 0 , (31)
𝕍[ϵt|t1]𝕍conditionalsubscriptitalic-ϵ𝑡subscript𝑡1\displaystyle\operatorname{\mathbb{V}}\left[\epsilon_{t}|\mathcal{F}_{t-1}\right]blackboard_V [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] =Tr(ΠθΠat)(1Tr(ΠθΠat)),absentTrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡1TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡\displaystyle=\operatorname{Tr}(\Pi_{\theta}\Pi_{a_{t}})\left(1-\operatorname{% Tr}(\Pi_{\theta}\Pi_{a_{t}})\right),= roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( 1 - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (32)

where t1:={r1,Πa1,,rt1,Πat1,Πat}assignsubscript𝑡1subscript𝑟1subscriptΠsubscript𝑎1subscript𝑟𝑡1subscriptΠsubscript𝑎𝑡1subscriptΠsubscript𝑎𝑡\mathcal{F}_{t-1}:=\{r_{1},\Pi_{a_{1}},...,r_{t-1},\Pi_{a_{t-1}},\Pi_{a_{t}}\}caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT := { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT } is a σ𝜎\sigmaitalic_σ-filtration.

Formally the learner is described by a policy.

Definition 6.

A policy π𝜋\piitalic_π is a set of conditional probability measures {πt}tsubscriptsubscript𝜋𝑡𝑡\{\pi_{t}\}_{t\in\mathbb{N}}{ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT on the action set 𝒜𝒜\mathcal{A}caligraphic_A of the form

πt(|r1,Πa1,,rt1,Πat1):Σ[0,1].\displaystyle\pi_{t}(\cdot|r_{1},\Pi_{a_{1}},...,r_{t-1},\Pi_{a_{t-1}}):\Sigma% \rightarrow[0,1].italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) : roman_Σ → [ 0 , 1 ] . (33)

Then the policy interacting with the environment ΠθsubscriptΠ𝜃\Pi_{\theta}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT defines the probability measure over the set of actions and rewards PΠθ,Π:(Σ×{0,1})×T[0,1]:subscript𝑃subscriptΠ𝜃ΠsuperscriptΣ01absent𝑇01P_{\Pi_{\theta},\Pi}:\left(\Sigma\times\{0,1\}\right)^{\times T}\rightarrow[0,1]italic_P start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_Π end_POSTSUBSCRIPT : ( roman_Σ × { 0 , 1 } ) start_POSTSUPERSCRIPT × italic_T end_POSTSUPERSCRIPT → [ 0 , 1 ] as

PrΠθ(rT|ΠaT)πT(dΠT|r1,Πa1,,rT1,ΠaT1)PrΠθ(r1|Πa1)π1(dΠa1),subscriptPrsubscriptΠ𝜃conditionalsubscript𝑟𝑇subscriptΠsubscript𝑎𝑇subscript𝜋𝑇conditional𝑑subscriptΠ𝑇subscript𝑟1subscriptΠsubscript𝑎1subscript𝑟𝑇1subscriptΠsubscript𝑎𝑇1subscriptPrsubscriptΠ𝜃conditionalsubscript𝑟1subscriptΠsubscript𝑎1subscript𝜋1𝑑subscriptΠsubscript𝑎1\displaystyle\int\cdots\int\mathrm{Pr}_{\Pi_{\theta}}\left(r_{T}|\Pi_{a_{T}}% \right)\pi_{T}(d\Pi_{T}|r_{1},\Pi_{a_{1}},...,r_{T-1},\Pi_{a_{T-1}})\cdots% \mathrm{Pr}_{\Pi_{\theta}}\left(r_{1}|\Pi_{a_{1}}\right)\pi_{1}\left(d\Pi_{a_{% 1}}\right),∫ ⋯ ∫ roman_Pr start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_d roman_Π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋯ roman_Pr start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (34)

where the integrals are taken with respect to the corresponding subsets of actions.

The goal of the learner is to efficiently learn a classical description of the environment Πθ=|ψθψθ|𝒮dsubscriptΠ𝜃ketsubscript𝜓𝜃brasubscript𝜓𝜃superscriptsubscript𝒮𝑑\Pi_{\theta}=\ket{\psi_{\theta}}\!\bra{\psi_{\theta}}\in\mathcal{S}_{d}^{*}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = | start_ARG italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ⟩ ⟨ start_ARG italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG | ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT while minimizing the disturbance of the post-measured state ψ~t𝒮dsubscript~𝜓𝑡subscriptsuperscript𝒮𝑑\tilde{\psi}_{t}\in\mathcal{S}^{*}_{d}over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT that is distributed accordingly to

Pr(Rt|Πat)={Tr(ΠθΠat)if Rt=Πat1Tr(ΠθΠat)if Rt=Πatc0else,Prconditionalsubscript𝑅𝑡subscriptΠsubscript𝑎𝑡casesTrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡if subscript𝑅𝑡subscriptΠsubscript𝑎𝑡otherwise1TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡if subscript𝑅𝑡subscriptsuperscriptΠ𝑐subscript𝑎𝑡otherwise0elseotherwise\displaystyle\mathrm{Pr}\left(R_{t}|\Pi_{a_{t}}\right)=\begin{cases}% \operatorname{Tr}\left(\Pi_{\theta}\Pi_{a_{t}}\right)\quad\text{if }\quad R_{t% }=\Pi_{a_{t}}\\ 1-\operatorname{Tr}\left(\Pi_{\theta}\Pi_{a_{t}}\right)\quad\text{if }\quad R_% {t}=\Pi^{c}_{a_{t}}\\ 0\quad\text{else},\end{cases}roman_Pr ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) if italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) if italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 else , end_CELL start_CELL end_CELL end_ROW (35)

where Πat=|ψatψat|subscriptΠsubscript𝑎𝑡ketsubscript𝜓subscript𝑎𝑡brasubscript𝜓subscript𝑎𝑡\Pi_{a_{t}}=|\psi_{a_{t}}\rangle\!\bra{\psi_{a_{t}}}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ⟨ start_ARG italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG | and

Πatc=|ψatcψatc|,|ψatc=|ψψat|ψ|ψat1|ψ|ψat|2.formulae-sequencesubscriptsuperscriptΠ𝑐subscript𝑎𝑡ketsubscriptsuperscript𝜓𝑐subscript𝑎𝑡brasubscriptsuperscript𝜓𝑐subscript𝑎𝑡ketsubscriptsuperscript𝜓𝑐subscript𝑎𝑡ket𝜓inner-productsubscript𝜓subscript𝑎𝑡𝜓ketsubscript𝜓subscript𝑎𝑡1superscriptinner-product𝜓subscript𝜓subscript𝑎𝑡2\displaystyle\Pi^{c}_{a_{t}}=|\psi^{c}_{a_{t}}\rangle\!\bra{\psi^{c}_{a_{t}}},% \quad|\psi^{c}_{a_{t}}\rangle=\frac{\ket{\psi}-\langle\psi_{a_{t}}|\psi\rangle% \ket{\psi_{a_{t}}}}{\sqrt{1-|\langle\psi|\psi_{a_{t}}\rangle|^{2}}}.roman_Π start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ⟨ start_ARG italic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG | , | italic_ψ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ = divide start_ARG | start_ARG italic_ψ end_ARG ⟩ - ⟨ italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_ψ ⟩ | start_ARG italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ end_ARG start_ARG square-root start_ARG 1 - | ⟨ italic_ψ | italic_ψ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (36)

The task of the learner is captured by minimizing the cumulative regret, our figure of merit that is defined as follows.

Definition 7.

Given a d𝑑ditalic_d-dimensional pure state multi-armed quantum bandit, a policy π𝜋\piitalic_π, and unknown environment Πθ𝒮dsubscriptΠ𝜃subscriptsuperscript𝒮𝑑\Pi_{\theta}\in\mathcal{S}^{*}_{d}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and T𝑇T\in\mathbb{N}italic_T ∈ blackboard_N, the cumulative regret is defined as

Regret(T,π,Πθ)Regret𝑇𝜋subscriptΠ𝜃\displaystyle\textup{Regret}(T,\pi,\Pi_{\theta})Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) :=t=1T1Tr(Πθ,Πat).assignabsentsuperscriptsubscript𝑡1𝑇1TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡\displaystyle:=\sum_{t=1}^{T}1-\operatorname{Tr}(\Pi_{\theta},\Pi_{a_{t}}).:= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1 - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (37)

We note that the regret quantifies the cumulative infidelity between the unknown environment and the post-measured state. And this notion of regret is consistent with the one introduced in the introduction (6) since λmax(Πθ)=1subscript𝜆subscriptΠ𝜃1\lambda_{\max}(\Pi_{\theta})=1italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 1.

Note that indeed minimizing the regret (37) implies selecting actions ΠatsubscriptΠsubscript𝑎𝑡\Pi_{a_{t}}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT that have high fidelity respect to the environment (learning the environment) but at the same time minimizing the cumulative infidelity of the post-measured states. In general the goal of the learner is to minimize the expected cumulative regret that is simply defined as 𝔼Πθ[Regret(T,π,Πθ)]subscript𝔼subscriptΠ𝜃Regret𝑇𝜋subscriptΠ𝜃\operatorname{\mathbb{E}}_{\Pi_{\theta}}[\text{Regret}(T,\pi,\Pi_{\theta})]blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ] where the expectation 𝔼Πθsubscript𝔼subscriptΠ𝜃\operatorname{\mathbb{E}}_{\Pi_{\theta}}blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT is taken over the probability measure (34). When the context is clear, we will use the notation Regret(T)Regret𝑇\text{Regret}(T)Regret ( italic_T ). Moreover the expression of the regret (37) coincides with the notion of regret introduced for general multi-armed bandits [14, Section 2.3]. For that reason we refer to the PSMAQB problem as the task of finding a policy that minimizes the expected regret 𝔼Πθ[Regret(T,π,Πθ)]subscript𝔼subscriptΠ𝜃Regret𝑇𝜋subscriptΠ𝜃\operatorname{\mathbb{E}}_{\Pi_{\theta}}[\text{Regret}(T,\pi,\Pi_{\theta})]blackboard_E start_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ]. Minimizing the regret means achieving sublinear regret on T𝑇Titalic_T since Regret(T)TRegret𝑇𝑇\text{Regret}(T)\leq TRegret ( italic_T ) ≤ italic_T holds for any policy.

3.4 Classical model

In order to study the PSMAQB it is helpful to study it using the linear stochastic bandit framework. The idea will be to express the actions and unknown quantum states as real vectors using the parametrization (14).

In the linear stochastic bandit model, the action set is a subset of real vectors i.e 𝒜d𝒜superscript𝑑\mathcal{A}\subseteq\mathbb{R}^{d}caligraphic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and the reward at time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] after selecting action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A is given by

rt=at,θ+ϵtsubscript𝑟𝑡subscript𝑎𝑡𝜃subscriptitalic-ϵ𝑡\displaystyle r_{t}=\langle a_{t},\theta\rangle+\epsilon_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ⟩ + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (38)

where θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the unknown parameter and ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is some bounded σlimit-from𝜎\sigma-italic_σ -subgaussian noise that in general can depend on θ𝜃\thetaitalic_θ and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The regret for this model is given by

Regretcl(T,π,θ):=t=1Tmaxa𝒜θ,aθ,at,assignsubscriptRegret𝑐𝑙𝑇𝜋𝜃superscriptsubscript𝑡1𝑇subscript𝑎𝒜𝜃𝑎𝜃subscript𝑎𝑡\displaystyle\text{Regret}_{cl}(T,\pi,\theta):=\sum_{t=1}^{T}\max_{a\in% \mathcal{A}}\langle\theta,a\rangle-\langle\theta,a_{t}\rangle,Regret start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_T , italic_π , italic_θ ) := ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_θ , italic_a ⟩ - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ , (39)

where the policy π𝜋\piitalic_π is defined analogously to Definition 6. We used the subscript cl𝑐𝑙clitalic_c italic_l to differentiate between quantum and classical model.

In order to express the PSMAQB model as a linear stochastic bandit we can use the parametrization (14) and express the expected reward for action Πat𝒮dsubscriptΠsubscript𝑎𝑡subscriptsuperscript𝒮𝑑\Pi_{a_{t}}\in\mathcal{S}^{*}_{d}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as

Tr(ΠatΠθ)=1d(1+(d1)at,θ).TrsubscriptΠsubscript𝑎𝑡subscriptΠ𝜃1𝑑1𝑑1subscript𝑎𝑡𝜃\displaystyle\operatorname{Tr}(\Pi_{a_{t}}\Pi_{\theta})=\frac{1}{d}\left(1+% \left(d-1\right)\langle a_{t},\theta\rangle\right).roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ( 1 + ( italic_d - 1 ) ⟨ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ⟩ ) . (40)

Inverting the above expression we have

at,θ=dTr(ΠθΠat)1d1.subscript𝑎𝑡𝜃𝑑TrsubscriptΠ𝜃subscriptΠsubscript𝑎𝑡1𝑑1\displaystyle\langle a_{t},\theta\rangle=\frac{d\operatorname{Tr}(\Pi_{\theta}% \Pi_{a_{t}})-1}{d-1}.⟨ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ⟩ = divide start_ARG italic_d roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 end_ARG start_ARG italic_d - 1 end_ARG . (41)

Let’s quickly revisit the regret expression and use the above identities in order to connect the quantum and classical versions of the regret. We denote Πa=argmaxΠ𝒜Tr(ΠΠθ)subscriptΠsuperscript𝑎subscriptΠ𝒜TrΠsubscriptΠ𝜃\Pi_{a^{*}}=\operatorname*{\arg\!\max}_{\Pi\in\mathcal{A}}\operatorname{Tr}(% \Pi\Pi_{\theta})roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT roman_Π ∈ caligraphic_A end_POSTSUBSCRIPT roman_Tr ( roman_Π roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) the optimal action and recall that 1=Tr(ΠaΠθ)1TrsubscriptΠsuperscript𝑎subscriptΠ𝜃1=\operatorname{Tr}(\Pi_{a^{*}}\Pi_{\theta})1 = roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). Then we have

Regret(T,π,Πθ)Regret𝑇𝜋subscriptΠ𝜃\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) =t=1TTr(ΠaΠθ)Tr(ΠatΠθ)absentsuperscriptsubscript𝑡1𝑇TrsubscriptΠsuperscript𝑎subscriptΠ𝜃TrsubscriptΠsubscript𝑎𝑡subscriptΠ𝜃\displaystyle=\sum_{t=1}^{T}\operatorname{Tr}(\Pi_{a^{*}}\Pi_{\theta})-% \operatorname{Tr}(\Pi_{a_{t}}\Pi_{\theta})= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
=d1dt=1Tθ,aat.absent𝑑1𝑑superscriptsubscript𝑡1𝑇𝜃superscript𝑎subscript𝑎𝑡\displaystyle=\frac{d-1}{d}\sum_{t=1}^{T}\langle\theta,a^{*}-a_{t}\rangle.= divide start_ARG italic_d - 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ italic_θ , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ .

Note that by the normalization (14) we have that for ρθsubscript𝜌𝜃\rho_{\theta}italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΠatsubscriptΠsubscript𝑎𝑡\Pi_{a_{t}}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT the corresponding real vecotrs are normalized θ2=at=1subscriptnorm𝜃2normsubscript𝑎𝑡1\|\theta\|_{2}=\|a_{t}\|=1∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ = 1. Thus, since a=θsuperscript𝑎𝜃a^{*}=\thetaitalic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_θ the regret can be written as

Regret(T,π,Πθ)Regret𝑇𝜋subscriptΠ𝜃\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) =d1dt=1T(1θ,at)absent𝑑1𝑑superscriptsubscript𝑡1𝑇1𝜃subscript𝑎𝑡\displaystyle=\frac{d-1}{d}\sum_{t=1}^{T}\left(1-\langle\theta,a_{t}\rangle\right)= divide start_ARG italic_d - 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 1 - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) (42)
=d12dt=1Tθat22.absent𝑑12𝑑superscriptsubscript𝑡1𝑇superscriptsubscriptnorm𝜃subscript𝑎𝑡22\displaystyle=\frac{d-1}{2d}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}.= divide start_ARG italic_d - 1 end_ARG start_ARG 2 italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_θ - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (43)

Now we want to formulate a classical bandit such that the environment and actions are given by the real vectors that parameterize the quantum states (14). In order to have an expected linear reward that is linear with respect to θ𝜃\thetaitalic_θ and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT it is sufficient to define a renormalized reward as

r~t=drt1d1{1,1d1},subscript~𝑟𝑡𝑑subscript𝑟𝑡1𝑑111𝑑1\displaystyle\tilde{r}_{t}=\frac{dr_{t}-1}{d-1}\in\left\{1,\frac{-1}{d-1}% \right\},over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_d italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_d - 1 end_ARG ∈ { 1 , divide start_ARG - 1 end_ARG start_ARG italic_d - 1 end_ARG } , (44)

where we used the reward of the quantum model rt{0,1}subscript𝑟𝑡01r_{t}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } given by 29. Using 𝔼[rt|t1]=Tr(Πatρθ)𝔼conditionalsubscript𝑟𝑡subscript𝑡1TrsubscriptΠsubscript𝑎𝑡subscript𝜌𝜃\operatorname{\mathbb{E}}[r_{t}|\mathcal{F}_{t-1}]=\operatorname{Tr}(\Pi_{a_{t% }}\rho_{\theta})blackboard_E [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = roman_Tr ( roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) and (40) it is easy to see that

𝔼[r~t|t1]=θ,at,𝔼conditionalsubscript~𝑟𝑡subscript𝑡1𝜃subscript𝑎𝑡\displaystyle\operatorname{\mathbb{E}}[\tilde{r}_{t}|\mathcal{F}_{t-1}]=% \langle\theta,a_{t}\rangle,blackboard_E [ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ , (45)

where naturally we use t1={r~1,a1,,r~t1,at1,at}.subscript𝑡1subscript~𝑟1subscript𝑎1subscript~𝑟𝑡1subscript𝑎𝑡1subscript𝑎𝑡\mathcal{F}_{t-1}=\{\tilde{r}_{1},a_{1},...,\tilde{r}_{t-1},a_{t-1},a_{t}\}.caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } . Thus, we can write the reward in the form (38)

rt~=θ,at+ϵt,𝔼[ϵt|t1]=0,formulae-sequence~subscript𝑟𝑡𝜃subscript𝑎𝑡subscriptitalic-ϵ𝑡𝔼conditionalsubscriptitalic-ϵ𝑡subscript𝑡10\displaystyle\tilde{r_{t}}=\langle\theta,a_{t}\rangle+\epsilon_{t},\quad% \operatorname{\mathbb{E}}[\epsilon_{t}|\mathcal{F}_{t-1}]=0,over~ start_ARG italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = 0 ,
𝕍[ϵt|t1]=(1θ,at)(1+(d1)θ,at),𝕍conditionalsubscriptitalic-ϵ𝑡subscript𝑡11𝜃subscript𝑎𝑡1𝑑1𝜃subscript𝑎𝑡\displaystyle\operatorname{\mathbb{V}}[\epsilon_{t}|\mathcal{F}_{t-1}]=\left(1% -\langle\theta,a_{t}\rangle\right)\left(1+(d-1)\langle\theta,a_{t}\rangle% \right),blackboard_V [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = ( 1 - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) ( 1 + ( italic_d - 1 ) ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) ,

where the expectation and variance follow from a direct calculation. Then we can study a d𝑑ditalic_d-dimensional PSMAQB as a linear stochastic bandit choosing the action set

𝒜dquantum:={ad21:Πa𝒮d}assignsubscriptsuperscript𝒜quantum𝑑conditional-set𝑎superscriptsuperscript𝑑21subscriptΠ𝑎subscriptsuperscript𝒮𝑑\displaystyle\mathcal{A}^{\text{quantum}}_{d}:=\{a\in\mathbb{R}^{d^{2}-1}:\Pi_% {a}\in\mathcal{S}^{*}_{d}\}caligraphic_A start_POSTSUPERSCRIPT quantum end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := { italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : roman_Π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } (46)

with unknown parameter θd21𝜃superscriptsuperscript𝑑21\theta\in\mathbb{R}^{d^{2}-1}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT such that Πθ𝒮dsubscriptΠ𝜃subscriptsuperscript𝒮𝑑\Pi_{\theta}\in\mathcal{S}^{*}_{d}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The regret of this linear model is given by Regretcl=12t=1Tθat22subscriptRegret𝑐𝑙12superscriptsubscript𝑡1𝑇superscriptsubscriptnorm𝜃subscript𝑎𝑡22\text{Regret}_{cl}=\frac{1}{2}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}Regret start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_θ - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and we have the following relation with the quantum model:

Regret(T,π,Πθ)=d1dRegretcl(T,π,θ),Regret𝑇𝜋subscriptΠ𝜃𝑑1𝑑subscriptRegret𝑐𝑙𝑇𝜋𝜃\displaystyle\text{Regret}(T,\pi,\Pi_{\theta})=\frac{d-1}{d}\text{Regret}_{cl}% (T,\pi,\theta),Regret ( italic_T , italic_π , roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = divide start_ARG italic_d - 1 end_ARG start_ARG italic_d end_ARG Regret start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_T , italic_π , italic_θ ) , (47)

where we take the same strategy π𝜋\piitalic_π in both sides since we can identify the actions of both bandits through the parametrization (14) and the relation between rewards given by (44). When the context is clear we will simply use Regret(T)Regret𝑇\text{Regret}(T)Regret ( italic_T ) for both quantum and classical model.

3.5 Linear bandit with linearly vanishing variance noise

In [16] some of the present authors introduced the framework of stochastic linear bandits with linear vanishing noise where the setting is a linear bandit with action set 𝒜=𝕊d𝒜superscript𝕊𝑑\mathcal{A}=\mathbb{S}^{d}caligraphic_A = blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, unknown parameter θ𝕊d𝜃superscript𝕊𝑑\theta\in\mathbb{S}^{d}italic_θ ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and reward rt=θ,at+ϵtsubscript𝑟𝑡𝜃subscript𝑎𝑡subscriptitalic-ϵ𝑡r_{t}=\langle\theta,a_{t}\rangle+\epsilon_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-subgaussian with 𝔼[ϵt|t1]=0𝔼conditionalsubscriptitalic-ϵ𝑡subscript𝑡10\operatorname{\mathbb{E}}[\epsilon_{t}|\mathcal{F}_{t-1}]=0blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = 0 and the property of vanishing noise σt21θ,at2subscriptsuperscript𝜎2𝑡1superscript𝜃subscript𝑎𝑡2\sigma^{2}_{t}\leq 1-\langle\theta,a_{t}\rangle^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 1 - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In order to study a PSMAQB we will relax the condition on the subgaussian noise and we will replace it by the following condition on the noise

𝔼[ϵt|t1]=0,𝕍[ϵt|t1]1θ,at2.formulae-sequence𝔼conditionalsubscriptitalic-ϵ𝑡subscript𝑡10𝕍conditionalsubscriptitalic-ϵ𝑡subscript𝑡11superscript𝜃subscript𝑎𝑡2\displaystyle\operatorname{\mathbb{E}}\left[\epsilon_{t}|\mathcal{F}_{t-1}% \right]=0,\quad\operatorname{\mathbb{V}}\left[\epsilon_{t}|\mathcal{F}_{t-1}% \right]\leq 1-\langle\theta,a_{t}\rangle^{2}.blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = 0 , blackboard_V [ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ≤ 1 - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (48)

As in the classical model of the previous section using that maxa𝒜θ,a=1subscript𝑎𝒜𝜃𝑎1\max_{a\in\mathcal{A}}\langle\theta,a\rangle=1roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_θ , italic_a ⟩ = 1 we have that the regret is given by

Regret(T)=t=1T1θ,at=12t=1Tθat22.Regret𝑇superscriptsubscript𝑡1𝑇1𝜃subscript𝑎𝑡12superscriptsubscript𝑡1𝑇superscriptsubscriptnorm𝜃subscript𝑎𝑡22\displaystyle\text{Regret}(T)=\sum_{t=1}^{T}1-\langle\theta,a_{t}\rangle=\frac% {1}{2}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}.Regret ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1 - ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_θ - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (49)

We note that finding a strategy that minimizes regret for the above model will also work for a d=2𝑑2d=2italic_d = 2 PSMAQB with unknown Πθ𝒮2subscriptΠ𝜃subscriptsuperscript𝒮2\Pi_{\theta}\in\mathcal{S}^{*}_{2}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using the relations of last sections since

𝒜2quantum={a3:a2=1}=𝕊2,superscriptsubscript𝒜2quantumconditional-set𝑎superscript3subscriptnorm𝑎21superscript𝕊2\displaystyle\mathcal{A}_{2}^{\text{quantum}}=\{a\in\mathbb{R}^{3}:\|a\|_{2}=1% \}=\mathbb{S}^{2},caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT quantum end_POSTSUPERSCRIPT = { italic_a ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT : ∥ italic_a ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 } = blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (50)

and the variance of the PSMAQB (3.4) fullfills the relation (48).

4 Algorithm for bandits with linearly vanishing variance noise

In this Section we are going to present an algorithm for the linear bandit model explained in Section 3.5 that is based on the algorithm LINUCB-VN studied in [16] for linear bandits with linearly vanishing noise. Later we will show how to use this algorithm for the qubit PSMAQB problem.

4.1 Median of means for an online least squares estimator

First we discuss the medians of means method for the online linear least squares estimator introduced in [19]. We are going to use this estimator later in order to design a strategy that minimizes the regret for the model introduced in Section 3.5. The reason we need this estimator is that in the analysis of our algorithm we need concentration bounds for linear least squares estimators where the random variables have bounded variance and a possibly unbounded subgaussian parameter. The condition of bounded variance is weaker than the usual assumption of bounded subgaussian noise, however we can recover similar concentration bounds of the estimator if we implement a median of means.

In order to build the median of means online least squares estimator for linear bandits we need to sample k𝑘kitalic_k independent rewards for each action. Specifically given an action set 𝒜d𝒜superscript𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, an unknown parameter θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, at each time step t𝑡titalic_t we select an action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A and sample k𝑘kitalic_k independent rewards using atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where the outcome rewards are distributed as

rt,i=θ,at+ϵt,ifor i[k],formulae-sequencesubscript𝑟𝑡𝑖𝜃subscript𝑎𝑡subscriptitalic-ϵ𝑡𝑖for 𝑖delimited-[]𝑘\displaystyle r_{t,i}=\langle\theta,a_{t}\rangle+\epsilon_{t,i}\quad\text{for % }i\in[k],italic_r start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ⟨ italic_θ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_ϵ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT for italic_i ∈ [ italic_k ] , (51)

for some noise such that 𝔼[ϵt,i|t1]=0𝔼conditionalsubscriptitalic-ϵ𝑡𝑖subscript𝑡10\operatorname{\mathbb{E}}[\epsilon_{t,i}|\mathcal{F}_{t-1}]=0blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = 0. We refer to k𝑘kitalic_k as the number of subsamples per time step. Then at time step t𝑡titalic_t we define k𝑘kitalic_k least squares estimators as

θ~t,i=Vt1s=1trs,iasfor i[k],formulae-sequencesubscript~𝜃𝑡𝑖superscriptsubscript𝑉𝑡1superscriptsubscript𝑠1𝑡subscript𝑟𝑠𝑖subscript𝑎𝑠for 𝑖delimited-[]𝑘\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}r_{s,i}a_{s}% \quad\text{for }i\in[k],over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for italic_i ∈ [ italic_k ] , (52)

where Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the design matrix defined as

Vt=λ𝕀+s=1tasas𝖳,subscript𝑉𝑡𝜆𝕀superscriptsubscript𝑠1𝑡subscript𝑎𝑠superscriptsubscript𝑎𝑠𝖳\displaystyle V_{t}=\lambda\mathbb{I}+\sum_{s=1}^{t}a_{s}a_{s}^{\mathsf{T}},italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ blackboard_I + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT , (53)

with λ>0𝜆0\lambda>0italic_λ > 0 being a parameter that ensures invertibility of Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We note that the design matrix is independent of i𝑖iitalic_i. Then the median of means for least squares estimator (MOMLSE) is defined as

θ~tMoM:=θ~t,kwhere k=argminj[k]yj,formulae-sequenceassignsuperscriptsubscript~𝜃𝑡MoMsubscript~𝜃𝑡superscript𝑘where superscript𝑘subscript𝑗delimited-[]𝑘subscript𝑦𝑗\displaystyle\widetilde{\theta}_{t}^{\text{\tiny MoM}}:=\tilde{\theta}_{t,k^{*% }}\quad\text{where }k^{*}=\operatorname*{\arg\!\min}_{j\in[k]}y_{j},over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MoM end_POSTSUPERSCRIPT := over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ [ italic_k ] end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (54)

where

yj=median{θ~t,jθ~t,iVt:i[k]/j}for j[k].\displaystyle y_{j}=\text{median}\{\|\tilde{\theta}_{t,j}-\tilde{\theta}_{t,i}% \|_{V_{t}}:i\in[k]/j\}\quad\text{for }j\in[k].italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = median { ∥ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_i ∈ [ italic_k ] / italic_j } for italic_j ∈ [ italic_k ] . (55)

Using the results in [19] we have that the above estimator has the following concentration property around the true estimator.

Lemma 8 (Lemma 2 and 3 in [19]).

Let θ~tMoMsuperscriptsubscript~𝜃𝑡MoM\widetilde{\theta}_{t}^{\textup{\tiny MoM}}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MoM end_POSTSUPERSCRIPT be the MOMLSE defined (54) in with k𝑘kitalic_k subsamples with {rs,i}(s,i)[t]×[k]subscriptsubscript𝑟𝑠𝑖𝑠𝑖delimited-[]𝑡delimited-[]𝑘\{r_{s,i}\}_{(s,i)\in[t]\times[k]}{ italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ( italic_s , italic_i ) ∈ [ italic_t ] × [ italic_k ] end_POSTSUBSCRIPT rewards and corresponding actions {as}s[t]subscriptsubscript𝑎𝑠𝑠delimited-[]𝑡\{a_{s}\}_{s\in[t]}{ italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT. Assume that the noise of all rewards has bounded variance, i.e 𝔼[ϵs,i2|t1]1𝔼conditionalsubscriptsuperscriptitalic-ϵ2𝑠𝑖subscript𝑡11\operatorname{\mathbb{E}}\left[\epsilon^{2}_{s,i}|\mathcal{F}_{t-1}\right]\leq 1blackboard_E [ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ≤ 1 for all s[t]𝑠delimited-[]𝑡s\in[t]italic_s ∈ [ italic_t ] and i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ]. Then we have

Pr(θθ~tMoMVt29(9d+λθ2)2)1exp(k24).Prsubscriptsuperscriptnorm𝜃superscriptsubscript~𝜃𝑡MoM2subscript𝑉𝑡9superscript9𝑑𝜆subscriptnorm𝜃221𝑘24\displaystyle\mathrm{Pr}\left(\|\theta-\widetilde{\theta}_{t}^{\text{\tiny MoM% }}\|^{2}_{V_{t}}\leq 9\left(\sqrt{9d}+\lambda\|\theta\|_{2}\right)^{2}\right)% \geq 1-\exp\left(\frac{-k}{24}\right).roman_Pr ( ∥ italic_θ - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MoM end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 9 ( square-root start_ARG 9 italic_d end_ARG + italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ 1 - roman_exp ( divide start_ARG - italic_k end_ARG start_ARG 24 end_ARG ) . (56)

We will use a slight modification of the above result with a weighted least squares estimator like the one used in [16]. The weights will be related to a variance estimator of the noise for action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A that at each time step t𝑡titalic_t can be generally defined as

σ^t2:t1×A>0,:subscriptsuperscript^𝜎2𝑡subscript𝑡1𝐴subscriptabsent0\displaystyle\hat{\sigma}^{2}_{t}:\mathcal{H}_{t-1}\times A\rightarrow\mathbb{% R}_{>0},over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT × italic_A → blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT , (57)

where t1={rs,i}(s,i)[t1]×[k]{as}s[t1]subscript𝑡1subscriptsubscript𝑟𝑠𝑖𝑠𝑖delimited-[]𝑡1delimited-[]𝑘subscriptsubscript𝑎𝑠𝑠delimited-[]𝑡1\mathcal{H}_{t-1}=\{r_{s,i}\}_{(s,i)\in[t-1]\times[k]}\cup\{a_{s}\}_{s\in[t-1]}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ( italic_s , italic_i ) ∈ [ italic_t - 1 ] × [ italic_k ] end_POSTSUBSCRIPT ∪ { italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT contains the past information of rewards and actions played. For our purposes we will use only the information of the past actions and in order to simplify notation we will use σ^t2(a)subscriptsuperscript^𝜎2𝑡𝑎\hat{\sigma}^{2}_{t}(a)over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) to denote an estimator of the variance for the reward associated action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A with the information collected up to time step t1𝑡1t-1italic_t - 1. Then the corresponding weighted versions with k𝑘kitalic_k subsamples are defined as

θ~t,i=Vt1s=1t1σ^s2(as)rs,iasfor i[k],formulae-sequencesubscript~𝜃𝑡𝑖superscriptsubscript𝑉𝑡1superscriptsubscript𝑠1𝑡1subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠subscript𝑟𝑠𝑖subscript𝑎𝑠for 𝑖delimited-[]𝑘\displaystyle\widetilde{\theta}_{t,i}=V_{t}^{-1}\sum_{s=1}^{t}\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}r_{s,i}a_{s}\quad\text{for }i\in[k],over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for italic_i ∈ [ italic_k ] , (58)

with the weighted design matrix

Vt=λ𝕀+s=1t1σ^s2(as)asas𝖳.subscript𝑉𝑡𝜆𝕀superscriptsubscript𝑠1𝑡1subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠subscript𝑎𝑠superscriptsubscript𝑎𝑠𝖳\displaystyle V_{t}=\lambda\mathbb{I}+\sum_{s=1}^{t}\frac{1}{\hat{\sigma}^{2}_% {s}(a_{s})}a_{s}a_{s}^{\mathsf{T}}.italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ blackboard_I + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT . (59)

Then the weighted version of the median of means linear estimator is defined analogously to (54) with the corresponding weighted versions (58)(59) and we will denote it as θ~twMOMsuperscriptsubscript~𝜃𝑡wMOM\widetilde{\theta}_{t}^{\text{\tiny wMOM}}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT. In our algorithm analysis we will use the following analogous concentration bound under the condition that the estimators σ^t2subscriptsuperscript^𝜎2𝑡\hat{\sigma}^{2}_{t}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT overestimate the true variance.

Corollary 9.

Let θ~twMOMsuperscriptsubscript~𝜃𝑡wMOM\widetilde{\theta}_{t}^{\textup{\tiny wMOM}}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT be the weighted version of the MOMLSE with k𝑘kitalic_k subsamples, {rs,i}(s,i)[t]×[k]subscriptsubscript𝑟𝑠𝑖𝑠𝑖delimited-[]𝑡delimited-[]𝑘\{r_{s,i}\}_{(s,i)\in[t]\times[k]}{ italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ( italic_s , italic_i ) ∈ [ italic_t ] × [ italic_k ] end_POSTSUBSCRIPT rewards with corresponding actions {as}s[t]subscriptsubscript𝑎𝑠𝑠delimited-[]𝑡\{a_{s}\}_{s\in[t]}{ italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT and variance estimator σ^t2subscriptsuperscript^𝜎2𝑡\hat{\sigma}^{2}_{t}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Define the following event

Gt:={(t1,at):𝕍[ϵs,i]σ^2(as)s,i[t]×[k]}.assignsubscript𝐺𝑡conditional-setsubscript𝑡1subscript𝑎𝑡formulae-sequence𝕍subscriptitalic-ϵ𝑠𝑖superscript^𝜎2subscript𝑎𝑠for-all𝑠𝑖delimited-[]𝑡delimited-[]𝑘\displaystyle G_{t}:=\{\big{(}\mathcal{H}_{t-1},a_{t}\big{)}:\operatorname{% \mathbb{V}}[\epsilon_{s,i}]\leq\hat{\sigma}^{2}(a_{s})\ \forall s,i\in[t]% \times[k]\}.italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { ( caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) : blackboard_V [ italic_ϵ start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ] ≤ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∀ italic_s , italic_i ∈ [ italic_t ] × [ italic_k ] } . (60)

Then we have

Pr(θθ~twMOMVt2βGt)1exp(k24),Prsubscriptsuperscriptnorm𝜃superscriptsubscript~𝜃𝑡wMOM2subscript𝑉𝑡conditional𝛽subscript𝐺𝑡1𝑘24\displaystyle\mathrm{Pr}\left(\|\theta-\widetilde{\theta}_{t}^{\textup{\tiny wMOM% }}\|^{2}_{V_{t}}\leq\beta\mid G_{t}\right)\geq 1-\exp\left(\frac{-k}{24}\right),roman_Pr ( ∥ italic_θ - over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_β ∣ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 - roman_exp ( divide start_ARG - italic_k end_ARG start_ARG 24 end_ARG ) , (61)

where

β:=9(9d+λθ2)2.assign𝛽9superscript9𝑑𝜆subscriptnorm𝜃22\displaystyle\beta:=9\left(\sqrt{9d}+\lambda\|\theta\|_{2}\right)^{2}.italic_β := 9 ( square-root start_ARG 9 italic_d end_ARG + italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (62)
Proof.

The result follows from applying Lemma 8 to the sequences of re-normalized rewards {rs,iσ^s(as)}(s,i)[t]×[k]subscriptsubscript𝑟𝑠𝑖subscript^𝜎𝑠subscript𝑎𝑠𝑠𝑖delimited-[]𝑡delimited-[]𝑘\{\frac{r_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{(s,i)\in[t]\times[k]}{ divide start_ARG italic_r start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG } start_POSTSUBSCRIPT ( italic_s , italic_i ) ∈ [ italic_t ] × [ italic_k ] end_POSTSUBSCRIPT and actions {as,iσ^s(as)}s[t]subscriptsubscript𝑎𝑠𝑖subscript^𝜎𝑠subscript𝑎𝑠𝑠delimited-[]𝑡\{\frac{a_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{s\in[t]}{ divide start_ARG italic_a start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG } start_POSTSUBSCRIPT italic_s ∈ [ italic_t ] end_POSTSUBSCRIPT. We only need to check that the sequence {ϵs,iσ^s(as)}(s,i)[t]×[k]subscriptsubscriptitalic-ϵ𝑠𝑖subscript^𝜎𝑠subscript𝑎𝑠𝑠𝑖delimited-[]𝑡delimited-[]𝑘\{\frac{\epsilon_{s,i}}{\hat{\sigma}_{s}(a_{s})}\}_{(s,i)\in[t]\times[k]}{ divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG } start_POSTSUBSCRIPT ( italic_s , italic_i ) ∈ [ italic_t ] × [ italic_k ] end_POSTSUBSCRIPT has finite variance. Conditioning with the event Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the fact that by definition σ^s2(as)subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠\hat{\sigma}^{2}_{s}(a_{s})over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) only depend on the past s1𝑠1s-1italic_s - 1 action and rewards we have that the re-normalized noise has bounded variance since

𝔼[(ϵs,iσ^s(as))2|t1]=1σ^s2(as)𝔼[ϵs,i2|t1]=𝕍[ϵs,i]σ^s2(as)1.𝔼conditionalsuperscriptsubscriptitalic-ϵ𝑠𝑖subscript^𝜎𝑠subscript𝑎𝑠2subscript𝑡11subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠𝔼conditionalsubscriptsuperscriptitalic-ϵ2𝑠𝑖subscript𝑡1𝕍subscriptitalic-ϵ𝑠𝑖subscriptsuperscript^𝜎2𝑠subscript𝑎𝑠1\displaystyle\operatorname{\mathbb{E}}\left[\left(\frac{\epsilon_{s,i}}{\hat{% \sigma}_{s}(a_{s})}\right)^{2}\Bigg{|}\mathcal{F}_{t-1}\right]=\frac{1}{\hat{% \sigma}^{2}_{s}(a_{s})}\operatorname{\mathbb{E}}[\epsilon^{2}_{s,i}|\mathcal{F% }_{t-1}]=\frac{\operatorname{\mathbb{V}}[\epsilon_{s,i}]}{\hat{\sigma}^{2}_{s}% (a_{s})}\leq 1.blackboard_E [ ( divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG blackboard_E [ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = divide start_ARG blackboard_V [ italic_ϵ start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ] end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG ≤ 1 . (63)

4.2 Algorithm

The algorithm that we design for linear bandits with linearly variance vanishing noise is LinUCB-VVN (LinUCB vanishing variance noise) stated in Algorithm 1. The algorithm updates the actions in batches of lenght 2k(d1)2𝑘𝑑12k(d-1)2 italic_k ( italic_d - 1 ). For every batch it outputs 2(d1)2𝑑12(d-1)2 ( italic_d - 1 ) actions and samples k𝑘kitalic_k independent reward with each action. We use a slightly abuse of notation and label each batch with t𝑡titalic_t. For each batch t1𝑡1t\geq 1italic_t ≥ 1 the actions are updated as

at,i±:=a~t,i±a~t,i±2,a~t,i±=θ~t1wMoMθ~t1wMoM2±1λmin(Vt1)vt1,i,formulae-sequenceassignsubscriptsuperscript𝑎plus-or-minus𝑡𝑖subscriptsuperscript~𝑎plus-or-minus𝑡𝑖subscriptnormsubscriptsuperscript~𝑎plus-or-minus𝑡𝑖2subscriptsuperscript~𝑎plus-or-minus𝑡𝑖plus-or-minussubscriptsuperscript~𝜃wMoM𝑡1subscriptnormsubscriptsuperscript~𝜃wMoM𝑡121subscript𝜆subscript𝑉𝑡1subscript𝑣𝑡1𝑖\displaystyle a^{\pm}_{t,i}:=\frac{\widetilde{a}^{\pm}_{t,i}}{\|\widetilde{a}^% {\pm}_{t,i}\|_{2}},\quad\tilde{a}^{\pm}_{t,i}=\frac{\widetilde{\theta}^{\text{% wMoM}}_{t-1}}{\|\widetilde{\theta}^{\text{wMoM}}_{t-1}\|_{2}}\pm\frac{1}{\sqrt% {\lambda_{\min}(V_{t-1})}}v_{t-1,i},italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT := divide start_ARG over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = divide start_ARG over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∥ over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ± divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG end_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT , (64)

for i[d1]𝑖delimited-[]𝑑1i\in[d-1]italic_i ∈ [ italic_d - 1 ], vt1.isubscript𝑣formulae-sequence𝑡1𝑖v_{t-1.i}italic_v start_POSTSUBSCRIPT italic_t - 1 . italic_i end_POSTSUBSCRIPT is the normalized eigenvector of Vt1subscript𝑉𝑡1V_{t-1}italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with eigenvalue λi(Vt1)subscript𝜆𝑖subscript𝑉𝑡1\lambda_{i}(V_{t-1})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) and θ~twMoMsubscriptsuperscript~𝜃wMoM𝑡\widetilde{\theta}^{\text{wMoM}}_{t}over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the weighted MOMLSE defined as in Section 4.1 that is build with the k𝑘kitalic_k sampled rewards of each action. The design matrix Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is updated as

Vt=Vt1+ω(Vt1)i=1d1(at,i+(at,i+)𝖳+at,i(at,i)𝖳)subscript𝑉𝑡subscript𝑉𝑡1𝜔subscript𝑉𝑡1superscriptsubscript𝑖1𝑑1subscriptsuperscript𝑎𝑡𝑖superscriptsubscriptsuperscript𝑎𝑡𝑖𝖳subscriptsuperscript𝑎𝑡𝑖superscriptsubscriptsuperscript𝑎𝑡𝑖𝖳\displaystyle V_{t}=V_{t-1}+\omega(V_{t-1})\sum_{i=1}^{d-1}\left(a^{+}_{t,i}(a% ^{+}_{t,i})^{\mathsf{T}}+a^{-}_{t,i}(a^{-}_{t,i})^{\mathsf{T}}\right)italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_ω ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) (65)

where the weights ω𝜔\omegaitalic_ω and variance estimator are chosen as

ω(Vt1):=λmax(Vt1)12d1β,σ^t2(at,i±):=1ω(Vt1).formulae-sequenceassign𝜔subscript𝑉𝑡1subscript𝜆subscript𝑉𝑡112𝑑1𝛽assignsubscriptsuperscript^𝜎2𝑡subscriptsuperscript𝑎plus-or-minus𝑡𝑖1𝜔subscript𝑉𝑡1\displaystyle\omega(V_{t-1}):=\frac{\sqrt{\lambda_{\max}(V_{t-1})}}{12\sqrt{d-% 1}\beta},\quad\hat{\sigma}^{2}_{t}(a^{\pm}_{t,i}):=\frac{1}{\omega(V_{t-1})}.italic_ω ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := divide start_ARG square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG 12 square-root start_ARG italic_d - 1 end_ARG italic_β end_ARG , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_ω ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG . (66)

We note that the definition for σ^t2(at,i±)subscriptsuperscript^𝜎2𝑡subscriptsuperscript𝑎plus-or-minus𝑡𝑖\hat{\sigma}^{2}_{t}(a^{\pm}_{t,i})over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) fulfills the definition of variance estimator (57) stated in the previous section since it only depends on the past history t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

Require: λ0>0subscript𝜆0subscriptabsent0\lambda_{0}\in\mathbb{R}_{>0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, ω:P+d0:𝜔subscriptsuperscriptP𝑑subscriptabsent0\omega:\text{P}^{d}_{+}\rightarrow\mathbb{R}_{\geq 0}italic_ω : P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT
Set initial design matrix V0λ0𝕀d×dsubscript𝑉0subscript𝜆0subscript𝕀𝑑𝑑V_{0}\leftarrow\lambda_{0}\mathbb{I}_{d\times d}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT
Choose initial estimator θ0𝕊dsubscript𝜃0superscript𝕊𝑑{\theta}_{0}\in\mathbb{S}^{d}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for θ𝜃\thetaitalic_θ at random
for t=1,2,𝑡12t=1,2,\cdotsitalic_t = 1 , 2 , ⋯ do
       Optimistic action selection
      for i=1,2,d1𝑖12𝑑1i=1,2,\cdots d-1italic_i = 1 , 2 , ⋯ italic_d - 1 do
             Select actions at,i+subscriptsuperscript𝑎𝑡𝑖a^{+}_{t,i}italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and at,isubscriptsuperscript𝑎𝑡𝑖a^{-}_{t,i}italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT according to Eq. (64)
             Sample k𝑘kitalic_k independent rewards for each at,i±subscriptsuperscript𝑎plus-or-minus𝑡𝑖a^{\pm}_{t,i}italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT
            for j=1,,k𝑗1𝑘j=1,...,kitalic_j = 1 , … , italic_k do
                   Receive associated rewards rt,i,j+subscriptsuperscript𝑟𝑡𝑖𝑗r^{+}_{t,i,j}italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i , italic_j end_POSTSUBSCRIPT and rt,i,jsubscriptsuperscript𝑟𝑡𝑖𝑗r^{-}_{t,i,j}italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i , italic_j end_POSTSUBSCRIPT
             end for
            
       end for
      
       Update variance estimator for at,i+subscriptsuperscript𝑎𝑡𝑖a^{+}_{t,i}italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT
      σ^t21ω(Vt1(λ0))subscriptsuperscript^𝜎2𝑡1𝜔subscript𝑉𝑡1subscript𝜆0\hat{\sigma}^{2}_{t}\leftarrow\frac{1}{\omega(V_{t-1}(\lambda_{0}))}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_ω ( italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) end_ARG for t2𝑡2t\geq 2italic_t ≥ 2 or σ^t21subscriptsuperscript^𝜎2𝑡1\hat{\sigma}^{2}_{t}\leftarrow 1over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← 1 for t=1𝑡1t=1italic_t = 1.
       Update design matrix
      VtVt1+1σ^t2i=1d1(at,i+(at,i+)𝖳+at,i(at,i)𝖳)subscript𝑉𝑡subscript𝑉𝑡11superscriptsubscript^𝜎𝑡2superscriptsubscript𝑖1𝑑1subscriptsuperscript𝑎𝑡𝑖superscriptsubscriptsuperscript𝑎𝑡𝑖𝖳subscriptsuperscript𝑎𝑡𝑖superscriptsubscriptsuperscript𝑎𝑡𝑖𝖳V_{t}\leftarrow V_{t-1}+\frac{1}{\hat{\sigma}_{t}^{2}}\sum_{i=1}^{d-1}\left(a^% {+}_{t,i}(a^{+}_{t,i})^{\mathsf{T}}+a^{-}_{t,i}(a^{-}_{t,i})^{\mathsf{T}}\right)italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT )
       Update LSE for each subsample
      for j=1,2,,k𝑗12𝑘j=1,2,...,kitalic_j = 1 , 2 , … , italic_k: do
             θ~t,jwVt1s=1t1σ^t2i=1d1(as,i+rs,i,j++as,irs,i,j)superscriptsubscript~𝜃𝑡𝑗wsuperscriptsubscript𝑉𝑡1superscriptsubscript𝑠1𝑡1superscriptsubscript^𝜎𝑡2superscriptsubscript𝑖1𝑑1subscriptsuperscript𝑎𝑠𝑖subscriptsuperscript𝑟𝑠𝑖𝑗subscriptsuperscript𝑎𝑠𝑖subscriptsuperscript𝑟𝑠𝑖𝑗\widetilde{\theta}_{t,j}^{\text{w}}\leftarrow V_{t}^{-1}\sum_{s=1}^{t}\frac{1}% {\hat{\sigma}_{t}^{2}}\sum_{i=1}^{d-1}(a^{+}_{s,i}r^{+}_{s,i,j}+a^{-}_{s,i}r^{% -}_{s,i,j})over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ← italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i , italic_j end_POSTSUBSCRIPT + italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i , italic_j end_POSTSUBSCRIPT )
       end for
      Compute θ~twMOMsuperscriptsubscript~𝜃𝑡wMOM\widetilde{\theta}_{t}^{\text{\tiny wMOM}}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT using {θ~t,jw}j=1ksuperscriptsubscriptsuperscriptsubscript~𝜃𝑡𝑗w𝑗1𝑘\{\widetilde{\theta}_{t,j}^{\text{w}}\}_{j=1}^{k}{ over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
end for
Algorithm 1 LinUCB-VVN

4.3 Regret analysis

In this Section we present the analysis of the regret for Algorithm 1. The analysis is similar to the LinUCB-VN presented in [16][Appendix C.1]. Thus, we focus on the changes respect to LinUCB-VN and although we present a complete proof we refer to [16] for more detailed computations. The main result we use from [16] is a theorem that quantifies the growth of the maximum and minimum eigenvalues of the design matrix Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (65).

Theorem 10 (Theorem 3 in [16]).

Let {ct}t=0𝕊d1superscriptsubscriptsubscript𝑐𝑡𝑡0superscript𝕊𝑑1\{c_{t}\}_{t=0}^{\infty}\subset\mathbb{S}^{d-1}{ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ⊂ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT be a sequence of normalized vectors and ω:P+d0:𝜔subscriptsuperscriptP𝑑subscriptabsent0\omega:\textup{P}^{d}_{+}\rightarrow\mathbb{R}_{\geq 0}italic_ω : P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT a function such that

ω(X)CX,𝜔𝑋𝐶subscriptnorm𝑋\displaystyle\omega(X)\leq C\sqrt{\|X\|_{\infty}},italic_ω ( italic_X ) ≤ italic_C square-root start_ARG ∥ italic_X ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG , (67)

for a constant C>0𝐶0C>0italic_C > 0 and any XP+d𝑋subscriptsuperscriptP𝑑X\in\textup{P}^{d}_{+}italic_X ∈ P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Let λ0max{2,23(d1)2dC+23(d1)}subscript𝜆0223𝑑12𝑑𝐶23𝑑1\lambda_{0}\geq\max\big{\{}2,\sqrt{\frac{2}{3(d-1)}}2dC+\frac{2}{3(d-1)}\big{\}}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ roman_max { 2 , square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG end_ARG 2 italic_d italic_C + divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG }, and define a sequence of matrices {Vt}t=0d×dsuperscriptsubscriptsubscript𝑉𝑡𝑡0superscript𝑑𝑑\{V_{t}\}_{t=0}^{\infty}\subset\mathbb{R}^{d\times d}{ italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT as

V0:=λ0𝕀d×d,Vt+1:=Vt+ω(Vt)i=1d1Pt,i,formulae-sequenceassignsubscript𝑉0subscript𝜆0subscript𝕀𝑑𝑑assignsubscript𝑉𝑡1subscript𝑉𝑡𝜔subscript𝑉𝑡superscriptsubscript𝑖1𝑑1subscript𝑃𝑡𝑖\displaystyle V_{0}:=\lambda_{0}\mathbb{I}_{d\times d},\quad V_{t+1}:=V_{t}+% \omega(V_{t})\sum_{i=1}^{d-1}P_{t,i},italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT := italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ω ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , (68)

where

Pt,i:=at+1,i+(at+1,i+)𝖳+at+1,i(at+1,i)𝖳,assignsubscript𝑃𝑡𝑖subscriptsuperscript𝑎𝑡1𝑖superscriptsubscriptsuperscript𝑎𝑡1𝑖𝖳subscriptsuperscript𝑎𝑡1𝑖superscriptsubscriptsuperscript𝑎𝑡1𝑖𝖳\displaystyle P_{t,i}:=a^{+}_{t+1,i}(a^{+}_{t+1,i})^{\mathsf{T}}+a^{-}_{t+1,i}% (a^{-}_{t+1,i})^{\mathsf{T}},italic_P start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT := italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT , (69)
at+1,i±:=a~t+1,i±a~t+1,i±2,a~t+1,i±:=ct±1λt,1vt,i,formulae-sequenceassignsubscriptsuperscript𝑎plus-or-minus𝑡1𝑖subscriptsuperscript~𝑎plus-or-minus𝑡1𝑖subscriptnormsubscriptsuperscript~𝑎plus-or-minus𝑡1𝑖2assignsubscriptsuperscript~𝑎plus-or-minus𝑡1𝑖plus-or-minussubscript𝑐𝑡1subscript𝜆𝑡1subscript𝑣𝑡𝑖\displaystyle a^{\pm}_{t+1,i}:=\frac{\tilde{a}^{\pm}_{t+1,i}}{\|\tilde{a}^{\pm% }_{t+1,i}\|_{2}},\quad\tilde{a}^{\pm}_{t+1,i}:=c_{t}\pm\frac{1}{\sqrt{\lambda_% {t,1}}}v_{t,i},italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT := divide start_ARG over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_i end_POSTSUBSCRIPT := italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ± divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_λ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_ARG end_ARG italic_v start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , (70)

with λt,i=λi(Vt)subscript𝜆𝑡𝑖subscript𝜆𝑖subscript𝑉𝑡\lambda_{t,i}=\lambda_{i}(V_{t})italic_λ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) the eigenvalues of Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with corresponding normalized eigenvectors vt,1,,vt,d𝕊d1subscript𝑣𝑡1subscript𝑣𝑡𝑑superscript𝕊𝑑1v_{t,1},...,v_{t,d}\in\mathbb{S}^{d-1}italic_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. Then we have

λmin(Vt)23(d1)λmax(Vt)for allt0.formulae-sequencesubscript𝜆subscript𝑉𝑡23𝑑1subscript𝜆subscript𝑉𝑡for all𝑡0\displaystyle\lambda_{\min}(V_{t})\geq\sqrt{\frac{2}{3(d-1)}\lambda_{\max}(V_{% t})}\quad\text{for all}\quad t\geq 0.italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG for all italic_t ≥ 0 . (71)

For the proof of the above Theorem we refer to the original reference. Then using this Theorem and the concentration bound for MOMLSE given in Corollary 9 we can provide the following regret analysis for a stochastic linear bandit with vanishing variance noise.

Theorem 11.

Let d2𝑑2d\geq 2italic_d ≥ 2, k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N and T=2(d1)kT~𝑇2𝑑1𝑘~𝑇T=2(d-1)k\widetilde{T}italic_T = 2 ( italic_d - 1 ) italic_k over~ start_ARG italic_T end_ARG for some T~~𝑇\widetilde{T}\in\mathbb{N}over~ start_ARG italic_T end_ARG ∈ blackboard_N, T~2~𝑇2\widetilde{T}\geq 2over~ start_ARG italic_T end_ARG ≥ 2. Let ω(X)𝜔𝑋\omega(X)italic_ω ( italic_X ) defined as in (66) using λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfying the constraints in Theorem 10. Then if we apply LinUCB-VVN 1(λ0,k,ωsubscript𝜆0𝑘𝜔\lambda_{0},k,\omegaitalic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_k , italic_ω) to a d𝑑ditalic_d dimensional stochastic linear bandit with variance as in (48) with probability at least (1exp(k/24))T~superscript1𝑘24~𝑇(1-\exp(-k/24))^{\widetilde{T}}( 1 - roman_exp ( - italic_k / 24 ) ) start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT the regret satisfies

Regret(T)4k(d1)+144d(d1)kβ2log(T2(d1)k)Regret𝑇4𝑘𝑑1144𝑑𝑑1𝑘superscript𝛽2𝑇2𝑑1𝑘\displaystyle\textup{Regret}(T)\leq 4k(d-1)+144d(d-1)k\beta^{2}\log\left(\frac% {T}{2(d-1)k}\right)Regret ( italic_T ) ≤ 4 italic_k ( italic_d - 1 ) + 144 italic_d ( italic_d - 1 ) italic_k italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG 2 ( italic_d - 1 ) italic_k end_ARG ) (72)
+24(d1)32kβlog(T2(d1)k),24superscript𝑑132𝑘𝛽𝑇2𝑑1𝑘\displaystyle+24(d-1)^{\frac{3}{2}}k\beta\log\left(\frac{T}{2(d-1)k}\right),+ 24 ( italic_d - 1 ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_k italic_β roman_log ( divide start_ARG italic_T end_ARG start_ARG 2 ( italic_d - 1 ) italic_k end_ARG ) , (73)

and at each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] with the same probability it can output an estimator θ^t𝕊d1subscript^𝜃𝑡superscript𝕊𝑑1\hat{\theta}_{t}\in\mathbb{S}^{d-1}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT such that

θθ^t22576d2β2k+96dd1βkt,superscriptsubscriptnorm𝜃subscript^𝜃𝑡22576superscript𝑑2superscript𝛽2𝑘96𝑑𝑑1𝛽𝑘𝑡\displaystyle\|\theta-\hat{\theta}_{t}\|_{2}^{2}\leq\frac{576d^{2}\beta^{2}k+9% 6d\sqrt{d-1}\beta k}{t},∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 576 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k + 96 italic_d square-root start_ARG italic_d - 1 end_ARG italic_β italic_k end_ARG start_ARG italic_t end_ARG , (74)

with β𝛽\betaitalic_β defined as in (62).

From the above Theorem we have that if we set k=24log(T~δ)𝑘24~𝑇𝛿k=\lceil 24\log\left(\frac{\widetilde{T}}{\delta}\right)\rceilitalic_k = ⌈ 24 roman_log ( divide start_ARG over~ start_ARG italic_T end_ARG end_ARG start_ARG italic_δ end_ARG ) ⌉ for some δ(0,1)𝛿01\delta\in\left(0,1\right)italic_δ ∈ ( 0 , 1 ) then with probability at least 1δ1𝛿1-\delta1 - italic_δ LinUCB-VNN achieves

Regret(T)=O(d4log2(T)),θθ^t22=O(log(T)t).formulae-sequenceRegret𝑇𝑂superscript𝑑4superscript2𝑇superscriptsubscriptnorm𝜃subscript^𝜃𝑡22𝑂𝑇𝑡\displaystyle\text{Regret}(T)=O\left(d^{4}\log^{2}(T)\right),\quad\|\theta-% \hat{\theta}_{t}\|_{2}^{2}=O\left(\frac{\log(T)}{t}\right).Regret ( italic_T ) = italic_O ( italic_d start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ) , ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( divide start_ARG roman_log ( italic_T ) end_ARG start_ARG italic_t end_ARG ) . (75)
Proof.

From the expression of the regret (49) we have that to give an upper bound it suffices to gives an upper bound between the distance of the unknown parameter θ𝜃\thetaitalic_θ and the actions at,i±subscriptsuperscript𝑎plus-or-minus𝑡𝑖a^{\pm}_{t,i}italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT selected by the algorithm (64). We denote the step t~[T~]~𝑡delimited-[]~𝑇\tilde{t}\in[\widetilde{T}]over~ start_ARG italic_t end_ARG ∈ [ over~ start_ARG italic_T end_ARG ] to run over the batches the algorithm updates the MoM estimator θ~twMOMsuperscriptsubscript~𝜃𝑡wMOM\widetilde{\theta}_{t}^{\text{\tiny wMOM}}over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT. First we will do the computation assuming that the event

Et~:={t~:s[t~],θ𝒞s},assignsubscript𝐸~𝑡conditional-setsubscript~𝑡formulae-sequencefor-all𝑠delimited-[]~𝑡𝜃subscript𝒞𝑠\displaystyle E_{\tilde{t}}:=\{\mathcal{H}_{\tilde{t}}:\forall s\in[\tilde{t}]% ,\theta\in\mathcal{C}_{s}\},italic_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT := { caligraphic_H start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT : ∀ italic_s ∈ [ over~ start_ARG italic_t end_ARG ] , italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } , (76)

holds where 𝒞s={θd:θθ~t~wMOMVs2β}subscript𝒞𝑠conditional-setsuperscript𝜃superscript𝑑subscriptsuperscriptnormsuperscript𝜃subscriptsuperscript~𝜃𝑤𝑀𝑂𝑀~𝑡2subscript𝑉𝑠𝛽\mathcal{C}_{s}=\{\theta^{\prime}\in\mathbb{R}^{d}:\|\theta^{\prime}-% \widetilde{\theta}^{wMOM}_{\tilde{t}}\|^{2}_{V_{s}}\leq\beta\}caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_w italic_M italic_O italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_β }. Here the history t~subscript~𝑡\mathcal{H}_{\tilde{t}}caligraphic_H start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT is defined with the previous outcomes and actions of our algorithm i.e

t~:=(rs,i,j+,as,i+,rs,i,j,as,i)(s,i,j)[t~]×[d1]×[k]assignsubscript~𝑡subscriptsubscriptsuperscript𝑟𝑠𝑖𝑗subscriptsuperscript𝑎𝑠𝑖subscriptsuperscript𝑟𝑠𝑖𝑗subscriptsuperscript𝑎𝑠𝑖𝑠𝑖𝑗delimited-[]~𝑡delimited-[]𝑑1delimited-[]𝑘\displaystyle\mathcal{H}_{\tilde{t}}:=\left(r^{+}_{s,i,j},a^{+}_{s,i},r^{-}_{s% ,i,j},a^{-}_{s,i}\right)_{(s,i,j)\in[\tilde{t}]\times[d-1]\times[k]}caligraphic_H start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT := ( italic_r start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT ( italic_s , italic_i , italic_j ) ∈ [ over~ start_ARG italic_t end_ARG ] × [ italic_d - 1 ] × [ italic_k ] end_POSTSUBSCRIPT (77)

Later we will quantify the probability that this event always hold. Using the definition of the actions (64), θ,θ~t~wMOM𝕊d1𝜃subscriptsuperscript~𝜃wMOM~𝑡superscript𝕊𝑑1\theta,\widetilde{\theta}^{\text{\tiny wMOM}}_{\tilde{t}}\in\mathbb{S}^{d-1}italic_θ , over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT wMOM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and the arguments from [16][Appendix C.1, Eq. (165)] we have that

θat~,i±229βλmin(Vt~1).superscriptsubscriptnorm𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖229𝛽subscript𝜆subscript𝑉~𝑡1\displaystyle\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{2}\leq\frac{9\beta}{\lambda% _{\min}(V_{\tilde{t}-1})}.∥ italic_θ - italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 9 italic_β end_ARG start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG - 1 end_POSTSUBSCRIPT ) end_ARG . (78)

Then using that the design matrix Vt~subscript𝑉~𝑡V_{\tilde{t}}italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT (65) is updated as in Theorem 10 and the choice of weights (66) we fix

λ0max{2,223(d1)d12d1β+23(d1)}subscript𝜆02223𝑑1𝑑12𝑑1𝛽23𝑑1\displaystyle\lambda_{0}\geq\max\left\{2,2\sqrt{\frac{2}{3(d-1)}}\frac{d}{12% \sqrt{d-1}\beta}+\frac{2}{3(d-1)}\right\}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ roman_max { 2 , 2 square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG end_ARG divide start_ARG italic_d end_ARG start_ARG 12 square-root start_ARG italic_d - 1 end_ARG italic_β end_ARG + divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG } (79)

and we have that λmin(Vt~)23(d1)λmax(Vt~)subscript𝜆subscript𝑉~𝑡23𝑑1subscript𝜆subscript𝑉~𝑡\lambda_{\min}(V_{\tilde{t}})\geq\sqrt{\frac{2}{3(d-1)}\lambda_{\max}(V_{% \tilde{t}})}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≥ square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 ( italic_d - 1 ) end_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) end_ARG applying Theorem 10. Inserting this into the above we have

θat~,i±2212d1βλmax(Vt~).superscriptsubscriptnorm𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖2212𝑑1𝛽subscript𝜆subscript𝑉~𝑡\displaystyle\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{2}\leq\frac{12\sqrt{d-1}% \beta}{\sqrt{\lambda_{\max}(V_{\tilde{t}})}}.∥ italic_θ - italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 12 square-root start_ARG italic_d - 1 end_ARG italic_β end_ARG start_ARG square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) end_ARG end_ARG . (80)

Thus, it remains to provide a lower bound on λmax(Vt~)subscript𝜆subscript𝑉~𝑡\lambda_{\max}(V_{\tilde{t}})italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ). We note that in [16][Appendix C.1] they also had to provide an upper bound but this was because the constant β𝛽\betaitalic_β beta they use depends on t𝑡titalic_t. From the definition of Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (65) we can bound the trace as

Tr(Vt~)Trsubscript𝑉~𝑡\displaystyle\mathrm{Tr}(V_{\tilde{t}})roman_Tr ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) s=2t~2(d1)ω(Vs1)absentsuperscriptsubscript𝑠2~𝑡2𝑑1𝜔subscript𝑉𝑠1\displaystyle\geq\sum_{s=2}^{\tilde{t}}2(d-1)\omega(V_{s-1})≥ ∑ start_POSTSUBSCRIPT italic_s = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUPERSCRIPT 2 ( italic_d - 1 ) italic_ω ( italic_V start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ) (81)
=d16βs=1t~1λmax(Vs).absent𝑑16𝛽superscriptsubscript𝑠1~𝑡1subscript𝜆subscript𝑉𝑠\displaystyle=\frac{\sqrt{d-1}}{6\beta}\sum_{s=1}^{\tilde{t}-1}\sqrt{\lambda_{% \max}(V_{s})}.= divide start_ARG square-root start_ARG italic_d - 1 end_ARG end_ARG start_ARG 6 italic_β end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_t end_ARG - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG . (82)

Then using the bound Tr(Vt~)λmax(Vt~)/dTrsubscript𝑉~𝑡subscript𝜆subscript𝑉~𝑡𝑑\mathrm{Tr}(V_{\tilde{t}})\geq\lambda_{\max}(V_{\tilde{t}})/droman_Tr ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≥ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) / italic_d and some algebra we arrive at

λmax(Vt~)11+6dd1βs=1t~λmax(Vs).subscript𝜆subscript𝑉~𝑡116𝑑𝑑1𝛽superscriptsubscript𝑠1~𝑡subscript𝜆subscript𝑉𝑠\displaystyle\lambda_{\max}(V_{\tilde{t}})\geq\frac{1}{1+6\frac{d}{\sqrt{d-1}}% \beta}\sum_{s=1}^{\tilde{t}}\sqrt{\lambda_{\max}(V_{s})}.italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 1 + 6 divide start_ARG italic_d end_ARG start_ARG square-root start_ARG italic_d - 1 end_ARG end_ARG italic_β end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUPERSCRIPT square-root start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG . (83)

Now we have an inequality with the function λmax(Vs)subscript𝜆subscript𝑉𝑠\lambda_{\max}(V_{s})italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) at both sides. In order to solve it we use the technique from [16][Appendix C.1, Eqs. (197)–(208)] which consist on extending λmax(Vt~)subscript𝜆subscript𝑉~𝑡\lambda_{\max}(V_{\tilde{t}})italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) to the continuous with a linear interpolation and then transforming the sum to an integral which leads to a differential inequality. Solving this leads to

λmax(Vt~)t~24(1+6dd1β)2.subscript𝜆subscript𝑉~𝑡superscript~𝑡24superscript16𝑑𝑑1𝛽2\displaystyle\lambda_{\max}(V_{\tilde{t}})\geq\frac{\tilde{t}^{2}}{4(1+6\frac{% d}{\sqrt{d-1}}\beta)^{2}}.italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≥ divide start_ARG over~ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ( 1 + 6 divide start_ARG italic_d end_ARG start_ARG square-root start_ARG italic_d - 1 end_ARG end_ARG italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (84)

Now we can insert the above into (80) and we have

θat~,i±22superscriptsubscriptnorm𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖22\displaystyle\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{2}∥ italic_θ - italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 24d1β(1+6dd1β)t~1absent24𝑑1𝛽16𝑑𝑑1𝛽~𝑡1\displaystyle\leq\frac{24\sqrt{d-1}\beta(1+6\frac{d}{\sqrt{d-1}}\beta)}{\tilde% {t}-1}≤ divide start_ARG 24 square-root start_ARG italic_d - 1 end_ARG italic_β ( 1 + 6 divide start_ARG italic_d end_ARG start_ARG square-root start_ARG italic_d - 1 end_ARG end_ARG italic_β ) end_ARG start_ARG over~ start_ARG italic_t end_ARG - 1 end_ARG (85)
=144dβ2+24d1βt~1.absent144𝑑superscript𝛽224𝑑1𝛽~𝑡1\displaystyle=\frac{144d\beta^{2}+24\sqrt{d-1}\beta}{\tilde{t}-1}.= divide start_ARG 144 italic_d italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 24 square-root start_ARG italic_d - 1 end_ARG italic_β end_ARG start_ARG over~ start_ARG italic_t end_ARG - 1 end_ARG . (86)

Thus, we can inserted the above bound into the regret expression (49) and we have

Regret(T)=12t=1Tθat22Regret𝑇12superscriptsubscript𝑡1𝑇superscriptsubscriptnorm𝜃subscript𝑎𝑡22\displaystyle\text{Regret}(T)=\frac{1}{2}\sum_{t=1}^{T}\|\theta-a_{t}\|_{2}^{2}Regret ( italic_T ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_θ - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (87)
=12t~=1T~i=1d1j=1k(θat~,i+22+θat~,i22)absent12superscriptsubscript~𝑡1~𝑇superscriptsubscript𝑖1𝑑1superscriptsubscript𝑗1𝑘superscriptsubscriptnorm𝜃subscriptsuperscript𝑎~𝑡𝑖22superscriptsubscriptnorm𝜃subscriptsuperscript𝑎~𝑡𝑖22\displaystyle=\frac{1}{2}\sum_{\tilde{t}=1}^{\tilde{T}}\sum_{i=1}^{d-1}\sum_{j% =1}^{k}\left(\|\theta-a^{+}_{\tilde{t},i}\|_{2}^{2}+\|\theta-a^{-}_{\tilde{t},% i}\|_{2}^{2}\right)= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( ∥ italic_θ - italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_θ - italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (88)
4k(d1)+12t~=2T~i=1d1j=1k(θat~,i+22+θat~,i22)absent4𝑘𝑑112superscriptsubscript~𝑡2~𝑇superscriptsubscript𝑖1𝑑1superscriptsubscript𝑗1𝑘superscriptsubscriptnorm𝜃subscriptsuperscript𝑎~𝑡𝑖22superscriptsubscriptnorm𝜃subscriptsuperscript𝑎~𝑡𝑖22\displaystyle\leq 4k(d-1)+\frac{1}{2}\sum_{\tilde{t}=2}^{\tilde{T}}\sum_{i=1}^% {d-1}\sum_{j=1}^{k}\left(\|\theta-a^{+}_{\tilde{t},i}\|_{2}^{2}+\|\theta-a^{-}% _{\tilde{t},i}\|_{2}^{2}\right)≤ 4 italic_k ( italic_d - 1 ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( ∥ italic_θ - italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_θ - italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (89)
4k(d1)+(144d(d1)kβ2+24(d1)32kβ)t~=2T~1t1absent4𝑘𝑑1144𝑑𝑑1𝑘superscript𝛽224superscript𝑑132𝑘𝛽superscriptsubscript~𝑡2~𝑇1𝑡1\displaystyle\leq 4k(d-1)+(144d(d-1)k\beta^{2}+24(d-1)^{\frac{3}{2}}k\beta)% \sum_{\tilde{t}=2}^{\tilde{T}}\frac{1}{t-1}≤ 4 italic_k ( italic_d - 1 ) + ( 144 italic_d ( italic_d - 1 ) italic_k italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 24 ( italic_d - 1 ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_k italic_β ) ∑ start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t - 1 end_ARG (90)
4k(d1)+144d(d1)kβ2logT~+24(d1)32kβlogT~absent4𝑘𝑑1144𝑑𝑑1𝑘superscript𝛽2~𝑇24superscript𝑑132𝑘𝛽~𝑇\displaystyle\leq 4k(d-1)+144d(d-1)k\beta^{2}\log\widetilde{T}+24(d-1)^{\frac{% 3}{2}}k\beta\log\widetilde{T}≤ 4 italic_k ( italic_d - 1 ) + 144 italic_d ( italic_d - 1 ) italic_k italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log over~ start_ARG italic_T end_ARG + 24 ( italic_d - 1 ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_k italic_β roman_log over~ start_ARG italic_T end_ARG (91)
=4k(d1)+144d(d1)kβ2log(T2(d1)k)absent4𝑘𝑑1144𝑑𝑑1𝑘superscript𝛽2𝑇2𝑑1𝑘\displaystyle=4k(d-1)+144d(d-1)k\beta^{2}\log\left(\frac{T}{2(d-1)k}\right)= 4 italic_k ( italic_d - 1 ) + 144 italic_d ( italic_d - 1 ) italic_k italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG 2 ( italic_d - 1 ) italic_k end_ARG )
+24(d1)32kβlog(T2(d1)k).24superscript𝑑132𝑘𝛽𝑇2𝑑1𝑘\displaystyle\quad+24(d-1)^{\frac{3}{2}}k\beta\log\left(\frac{T}{2(d-1)k}% \right).+ 24 ( italic_d - 1 ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_k italic_β roman_log ( divide start_ARG italic_T end_ARG start_ARG 2 ( italic_d - 1 ) italic_k end_ARG ) . (92)

It remains to quantify the probability that the event Et~subscript𝐸~𝑡E_{\tilde{t}}italic_E start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT holds. For that we will use the concentration bounds of the median of means for least squares estimator stated in Corollary 9. From the variance condition of our model (48) we have

𝕍[ϵt~,i,j±|t~1]1θ,at~,i±22(1θ,at~,i±))=θat~,i±22,𝕍conditionalsubscriptsuperscriptitalic-ϵplus-or-minus~𝑡𝑖𝑗subscript~𝑡11superscript𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖221𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖superscriptsubscriptnorm𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖22\displaystyle\operatorname{\mathbb{V}}[\epsilon^{\pm}_{\tilde{t},i,j}|\mathcal% {F}_{\tilde{t}-1}]\leq 1-\langle\theta,a^{\pm}_{\tilde{t},i}\rangle^{2}\leq 2(% 1-\langle\theta,a^{\pm}_{\tilde{t},i}))=\|\theta-a^{\pm}_{\tilde{t},i}\|_{2}^{% 2},blackboard_V [ italic_ϵ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG - 1 end_POSTSUBSCRIPT ] ≤ 1 - ⟨ italic_θ , italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( 1 - ⟨ italic_θ , italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ) ) = ∥ italic_θ - italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (93)

where we used 1+θ,at~,i±21𝜃subscriptsuperscript𝑎plus-or-minus~𝑡𝑖21+\langle\theta,a^{\pm}_{\tilde{t},i}\rangle\leq 21 + ⟨ italic_θ , italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT ⟩ ≤ 2. Thus from our choice of weights (66) and (85) we have that

if θ𝒞s1𝕍[ϵt~,i,j±|t~1]σ^s2(as,i±).if 𝜃subscript𝒞𝑠1𝕍conditionalsubscriptsuperscriptitalic-ϵplus-or-minus~𝑡𝑖𝑗subscript~𝑡1superscriptsubscript^𝜎𝑠2subscriptsuperscript𝑎plus-or-minus𝑠𝑖\displaystyle\text{if }\theta\in\mathcal{C}_{s-1}\Rightarrow\operatorname{% \mathbb{V}}[\epsilon^{\pm}_{\tilde{t},i,j}|\mathcal{F}_{\tilde{t}-1}]\leq\hat{% \sigma}_{s}^{2}(a^{\pm}_{s,i}).if italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ⇒ blackboard_V [ italic_ϵ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG - 1 end_POSTSUBSCRIPT ] ≤ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ) . (94)

Then in order to apply Corollary 9 we note that from the choice σ^s2(a1,i±)=1superscriptsubscript^𝜎𝑠2subscriptsuperscript𝑎plus-or-minus1𝑖1\hat{\sigma}_{s}^{2}(a^{\pm}_{1,i})=1over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) = 1 the event Gt~subscript𝐺~𝑡G_{\tilde{t}}italic_G start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT at t~=1~𝑡1\tilde{t}=1over~ start_ARG italic_t end_ARG = 1 is always satisfied i.e Pr(G1)=1Prsubscript𝐺11\mathrm{Pr}(G_{1})=1roman_Pr ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1. Then applying Bayes theorem, union bound over the events G1,E1,,Gt1,Etsubscript𝐺1subscript𝐸1subscript𝐺𝑡1subscript𝐸𝑡G_{1},E_{1},...,G_{t-1},E_{t}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Corollary 9 we have

Pr(ET~GT~)(1exp(k/24))T~.Prsubscript𝐸~𝑇subscript𝐺~𝑇superscript1𝑘24~𝑇\displaystyle\mathrm{Pr}(E_{\widetilde{T}}\cap G_{\widetilde{T}})\geq\left(1-% \exp(-k/24)\right)^{\widetilde{T}}.roman_Pr ( italic_E start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ∩ italic_G start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) ≥ ( 1 - roman_exp ( - italic_k / 24 ) ) start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT . (95)

This probability also quantifies the probability that (85) holds since the only assumption we used is θ𝒞t~1𝜃subscript𝒞~𝑡1\theta\in\mathcal{C}_{\tilde{t}-1}italic_θ ∈ caligraphic_C start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG - 1 end_POSTSUBSCRIPT. Then we can take simply one of the actions at~,i±subscriptsuperscript𝑎plus-or-minus~𝑡𝑖a^{\pm}_{\tilde{t},i}italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG , italic_i end_POSTSUBSCRIPT as the estimator θ^tsubscript^𝜃𝑡\hat{\theta}_{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the result follows using the relabeling t=2(d1)kt~𝑡2𝑑1𝑘~𝑡t=2(d-1)k\tilde{t}italic_t = 2 ( italic_d - 1 ) italic_k over~ start_ARG italic_t end_ARG and the inequality 1/(t~1)2/t~1~𝑡12~𝑡1/(\tilde{t}-1)\leq 2/\tilde{t}1 / ( over~ start_ARG italic_t end_ARG - 1 ) ≤ 2 / over~ start_ARG italic_t end_ARG for t~2~𝑡2\tilde{t}\geq 2over~ start_ARG italic_t end_ARG ≥ 2. A more detailed analogous computation of the above probability can be found in [16][Appendix C.1]. ∎

In the previous Theorem we did not set a specific value for the parameter k𝑘kitalic_k or the number of subsamples per action. We note that the regret scales linearly with k𝑘kitalic_k but since the success probability scales exponentially with k𝑘kitalic_k it will suffice to set klog(T)similar-to𝑘𝑇k\sim\log(T)italic_k ∼ roman_log ( italic_T ) such that in expectation we get the log2(T)superscript2𝑇\log^{2}(T)roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) behavior. We formalize this in the following Corollary.

Corollary 12.

Under the same assumptions of Theorem 11 we can fix k=24log(T~2)𝑘24superscript~𝑇2k=\lceil 24\log(\widetilde{T}^{2})\rceilitalic_k = ⌈ 24 roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⌉ and we have that for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ],

𝔼[Regret(T)]344(d1)log(T)+(3546d(d1)β2+1152(d1)32β)log2(T)𝔼Regret𝑇344𝑑1𝑇3546𝑑𝑑1superscript𝛽21152superscript𝑑132𝛽superscript2𝑇\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]\leq 344(% d-1)\log\left(T\right)+\left(3546d(d-1)\beta^{2}+1152(d-1)^{\frac{3}{2}}\beta% \right)\log^{2}\left(T\right)blackboard_E [ Regret ( italic_T ) ] ≤ 344 ( italic_d - 1 ) roman_log ( italic_T ) + ( 3546 italic_d ( italic_d - 1 ) italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1152 ( italic_d - 1 ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_β ) roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) (96)
𝔼[θθ^t22]27648d2β2log(T)+4608dd1βlog(T)t+4(d1)log(T)T.𝔼superscriptsubscriptnorm𝜃subscript^𝜃𝑡2227648superscript𝑑2superscript𝛽2𝑇4608𝑑𝑑1𝛽𝑇𝑡4𝑑1𝑇𝑇\displaystyle\operatorname{\mathbb{E}}\left[\|\theta-\hat{\theta}_{t}\|_{2}^{2% }\right]\leq\frac{27648d^{2}\beta^{2}\log(T)+4608d\sqrt{d-1}\beta\log(T)}{t}+% \frac{4(d-1)\log(T)}{T}.blackboard_E [ ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 27648 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_T ) + 4608 italic_d square-root start_ARG italic_d - 1 end_ARG italic_β roman_log ( italic_T ) end_ARG start_ARG italic_t end_ARG + divide start_ARG 4 ( italic_d - 1 ) roman_log ( italic_T ) end_ARG start_ARG italic_T end_ARG . (97)

Using that β=O(d)𝛽𝑂𝑑\beta=O(d)italic_β = italic_O ( italic_d ) gives

𝔼[Regret(T)]=O(d4log2(T)),𝔼[θθ^t22]=O~(d4t).formulae-sequence𝔼Regret𝑇𝑂superscript𝑑4superscript2𝑇𝔼superscriptsubscriptnorm𝜃subscript^𝜃𝑡22~𝑂superscript𝑑4𝑡\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]=O(d^{4}% \log^{2}(T)),\quad\operatorname{\mathbb{E}}\left[\|\theta-\hat{\theta}_{t}\|_{% 2}^{2}\right]=\tilde{O}\left(\frac{d^{4}}{t}\right).blackboard_E [ Regret ( italic_T ) ] = italic_O ( italic_d start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ) , blackboard_E [ ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = over~ start_ARG italic_O end_ARG ( divide start_ARG italic_d start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_t end_ARG ) . (98)
Proof.

The result of Theorem 11 holds with probability at least (1exp(k/24))T~superscript1𝑘24~𝑇(1-\exp(-k/24))^{\widetilde{T}}( 1 - roman_exp ( - italic_k / 24 ) ) start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT. Setting k=24log(T~2)𝑘24superscript~𝑇2k=\lceil 24\log(\widetilde{T}^{2})\rceilitalic_k = ⌈ 24 roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⌉ gives

(1exp(k/24))T~(11T~2)T~11T~.superscript1𝑘24~𝑇superscript11superscript~𝑇2~𝑇11~𝑇\displaystyle(1-\exp(-k/24))^{\widetilde{T}}\geq\left(1-\frac{1}{\widetilde{T}% ^{2}}\right)^{\widetilde{T}}\geq 1-\frac{1}{\widetilde{T}}.( 1 - roman_exp ( - italic_k / 24 ) ) start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT ≥ ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT ≥ 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_T end_ARG end_ARG . (99)

Then given the event RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT such that Algorithm 1 achieves the bounds given by Theorem 11 we have that the probability of failure is bounded by

Pr(RTC)1T~,Prsubscriptsuperscript𝑅𝐶𝑇1~𝑇\displaystyle\mathrm{Pr}(R^{C}_{T})\leq\frac{1}{\widetilde{T}},roman_Pr ( italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_T end_ARG end_ARG , (100)

where we used 1=Pr(RT)+Pr(RTC)1Prsubscript𝑅𝑇Prsubscriptsuperscript𝑅𝐶𝑇1=\mathrm{Pr}(R_{T})+\mathrm{Pr}(R^{C}_{T})1 = roman_Pr ( italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + roman_Pr ( italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). Then the expectation of the bad events can be bounded as

𝔼[Regret(T)𝕀{RTC}]𝔼Regret𝑇𝕀subscriptsuperscript𝑅𝐶𝑇\displaystyle\operatorname{\mathbb{E}}\left[\text{Regret}(T)\mathbb{I}\{R^{C}_% {T}\}\right]blackboard_E [ Regret ( italic_T ) blackboard_I { italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } ] 4(d1)kT~Pr(RTC)4(d1)kabsent4𝑑1𝑘~𝑇Prsubscriptsuperscript𝑅𝐶𝑇4𝑑1𝑘\displaystyle\leq 4(d-1)k\widetilde{T}\mathrm{Pr}(R^{C}_{T})\leq 4(d-1)k≤ 4 ( italic_d - 1 ) italic_k over~ start_ARG italic_T end_ARG roman_Pr ( italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ 4 ( italic_d - 1 ) italic_k (101)
𝔼[θθ^t22𝕀{RTC}]𝔼superscriptsubscriptnorm𝜃subscript^𝜃𝑡22𝕀subscriptsuperscript𝑅𝐶𝑇\displaystyle\operatorname{\mathbb{E}}\left[\|\theta-\hat{\theta}_{t}\|_{2}^{2% }\mathbb{I}\{R^{C}_{T}\}\right]blackboard_E [ ∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_I { italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } ] 4Pr(RTC)4T~absent4Prsubscriptsuperscript𝑅𝐶𝑇4~𝑇\displaystyle\leq 4\mathrm{Pr}(R^{C}_{T})\leq\frac{4}{\widetilde{T}}≤ 4 roman_P roman_r ( italic_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ divide start_ARG 4 end_ARG start_ARG over~ start_ARG italic_T end_ARG end_ARG (102)

where we used Regret(T)2T=4(d1)kT~Regret𝑇2𝑇4𝑑1𝑘~𝑇\text{Regret}(T)\leq 2T=4(d-1)k\widetilde{T}Regret ( italic_T ) ≤ 2 italic_T = 4 ( italic_d - 1 ) italic_k over~ start_ARG italic_T end_ARG, θθ^t224superscriptsubscriptnorm𝜃subscript^𝜃𝑡224\|\theta-\hat{\theta}_{t}\|_{2}^{2}\leq 4∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4. Finally the result follows inserting the value of k=24log(T~2)𝑘24superscript~𝑇2k=24\log(\widetilde{T}^{2})italic_k = 24 roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) into the bounds of Theorem 11 and using T~T~𝑇𝑇\widetilde{T}\leq Tover~ start_ARG italic_T end_ARG ≤ italic_T. ∎

5 Algorithm for qubit PSMAQB and numerical experiments

In this Section we prove our main result that is a regret bound for LinUCB-VVN when applied to the qubit PSMAQB problem.

\begin{overpic}[percent,width=260.17464pt]{psmaqb_regret_infidelity.png} \put(-6.0,35.0){\rotatebox{90.0}{Regret($T$)}} \put(50.0,-2.0){$T$} \put(32.0,25.0){\rotatebox{90.0}{\tiny$\log\left(1-F(\Pi,\Pi_{t})\right)$}} \put(60.0,8.0){\tiny$\log\left(\frac{t}{\log(t)}\right)$} \end{overpic}
Figure 2: Expected regret vs the number or rounds T𝑇Titalic_T for the LinUCB-VNN algorithm. We run T=4104𝑇4superscript104T=4\cdot 10^{4}italic_T = 4 ⋅ 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT rounds with k=10𝑘10k=10italic_k = 10 subsamples for the median of means construction. We use 100100100100 independents experiments and average over them. We obtain results for each round but only plot (red crosses) few for clarity of the figure. We fit the regression Regret(T)=m1log2T+b1Regret𝑇subscript𝑚1superscript2𝑇subscript𝑏1\text{Regret}(T)=m_{1}\log^{2}T+b_{1}Regret ( italic_T ) = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with m1=3.2164±0.0009subscript𝑚1plus-or-minus3.21640.0009m_{1}=3.2164\pm 0.0009italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3.2164 ± 0.0009 and b1=0.84±0.016subscript𝑏1plus-or-minus0.840.016b_{1}=0.84\pm 0.016italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.84 ± 0.016. In the inset plot we plot the expected infidelity of the output estimator at each rounds t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] versus the number of rounds t𝑡titalic_t. We take Πt=ΠθtwMoMsubscriptΠ𝑡subscriptΠsubscriptsuperscript𝜃wMoM𝑡\Pi_{t}=\Pi_{\theta^{\text{wMoM}}_{t}}roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the estimator given by the median of means linear least squares estimator. We fit the regression 1F(Π,Πt)=b2(logtt)m21𝐹ΠsubscriptΠ𝑡subscript𝑏2superscript𝑡𝑡subscript𝑚21-F(\Pi,\Pi_{t})=b_{2}\left(\frac{\log t}{t}\right)^{m_{2}}1 - italic_F ( roman_Π , roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG roman_log italic_t end_ARG start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and we obtain m2=0.996±0.002subscript𝑚2plus-or-minus0.9960.002m_{2}=-0.996\pm 0.002italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 0.996 ± 0.002 b2=0.112±0.007subscript𝑏2plus-or-minus0.1120.007b_{2}=0.112\pm 0.007italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.112 ± 0.007. We note that the number of subsamples of the theoretical results is very conservative in comparison with the value we take for the simulations.
Theorem 13.

Let T~~𝑇\widetilde{T}\in\mathbb{N}over~ start_ARG italic_T end_ARG ∈ blackboard_N and fix T=96T~log(T~2)𝑇96~𝑇superscript~𝑇2T=\lceil 96\widetilde{T}\log(\widetilde{T}^{2})\rceilitalic_T = ⌈ 96 over~ start_ARG italic_T end_ARG roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⌉. Then given a PSMAQB with action set 𝒜=𝒮2𝒜subscriptsuperscript𝒮2\mathcal{A}=\mathcal{S}^{*}_{2}caligraphic_A = caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and environment Πθ𝒮2subscriptΠ𝜃subscriptsuperscript𝒮2\Pi_{\theta}\in\mathcal{S}^{*}_{2}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (qubits) we can apply Algorithm 1 for d=3𝑑3d=3italic_d = 3 and it achieves

𝔼[Regret(T)]C1log(T)+C2log2(T).𝔼Regret𝑇subscript𝐶1𝑇subscript𝐶2superscript2𝑇\displaystyle\operatorname{\mathbb{E}}\left[\textup{Regret}(T)\right]\leq C_{1% }\log\left(T\right)+C_{2}\log^{2}\left(T\right).blackboard_E [ Regret ( italic_T ) ] ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( italic_T ) + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) . (103)

for some universal constants C1,C20subscript𝐶1subscript𝐶20C_{1},C_{2}\geq 0italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0. Also at each time step t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] it outputs an estimator Π^t𝒮2subscript^Π𝑡subscriptsuperscript𝒮2\hat{\Pi}_{t}\in\mathcal{S}^{*}_{2}over^ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of ΠθsubscriptΠ𝜃\Pi_{\theta}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with infidelity scaling

𝔼[1F(Πθ,Π^t)]C3log(T)t,𝔼1𝐹subscriptΠ𝜃subscript^Π𝑡subscript𝐶3𝑇𝑡\displaystyle\operatorname{\mathbb{E}}\left[1-F\left(\Pi_{\theta},\hat{\Pi}_{t% }\right)\right]\leq\frac{C_{3}\log(T)}{t},blackboard_E [ 1 - italic_F ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , over^ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log ( italic_T ) end_ARG start_ARG italic_t end_ARG , (104)

for some universal constant C30subscript𝐶30C_{3}\geq 0italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≥ 0.

Proof.

In order to apply Algorithm 1 to a PSMAQB we set d=3𝑑3d=3italic_d = 3 (dimension for a classical linear stochastic bandit) and the actions that we select will be given by Πat,i±subscriptΠsubscriptsuperscript𝑎plus-or-minus𝑡𝑖\Pi_{a^{\pm}_{t,i}}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT where at,i±subscriptsuperscript𝑎plus-or-minus𝑡𝑖a^{\pm}_{t,i}italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT are updated as in (64). Note that they are valid action since at,i±𝕊2subscriptsuperscript𝑎plus-or-minus𝑡𝑖superscript𝕊2a^{\pm}_{t,i}\in\mathbb{S}^{2}italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT imply Πat,i±𝒮2subscriptΠsubscriptsuperscript𝑎plus-or-minus𝑡𝑖subscriptsuperscript𝒮2\Pi_{a^{\pm}_{t,i}}\in\mathcal{S}^{*}_{2}roman_Π start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The rewards received by the algorithm follow (3.4) with the normalization given in (44). This model fits into the linear bandit with linearly vanishing variance noise model explained in Section 3.5 and thus we can apply the guarantees established in Theorem 11 and Corollary 12.

The algorithm is set with k=24log(T~2)𝑘24superscript~𝑇2k=\lceil 24\log(\widetilde{T}^{2})\rceilitalic_k = ⌈ 24 roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⌉ batches for the MoM construction. We set λ0=2subscript𝜆02\lambda_{0}=2italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2, and using θ2=1subscriptnorm𝜃21\|\theta\|_{2}=1∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 we have that the constant β𝛽\betaitalic_β given in (62) has the value

β=9(33+2)2=279+1083.𝛽9superscript33222791083\displaystyle\beta=9\left(3\sqrt{3}+2\right)^{2}=279+108\sqrt{3}.italic_β = 9 ( 3 square-root start_ARG 3 end_ARG + 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 279 + 108 square-root start_ARG 3 end_ARG . (105)

Then we can check that for d=3𝑑3d=3italic_d = 3 the condition (79) for the input parameter λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for Theorem 11 to hold is satisfied since

λ0=2max{2,13+126(279+1083)}=2.subscript𝜆0221312627910832\displaystyle\lambda_{0}=2\geq\max\left\{2,\frac{1}{3}+\frac{1}{2\sqrt{6}(279+% 108\sqrt{3})}\right\}=2.italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 ≥ roman_max { 2 , divide start_ARG 1 end_ARG start_ARG 3 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG 6 end_ARG ( 279 + 108 square-root start_ARG 3 end_ARG ) end_ARG } = 2 . (106)

In the above we just substituted all numerical values. Then we are under the assumptions of Theorem 11 and Corollary 12 and the result follows applying both results with the relation of regrets between the classical and quantum model given in (47), the relation

θθ^t22=4(1F(Πθ,Πθ^t)),superscriptsubscriptnorm𝜃subscript^𝜃𝑡2241𝐹subscriptΠ𝜃subscriptΠsubscript^𝜃𝑡\displaystyle\|\theta-\hat{\theta}_{t}\|_{2}^{2}=4\left(1-F\left(\Pi_{\theta},% \Pi_{\hat{\theta}_{t}}\right)\right),∥ italic_θ - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 4 ( 1 - italic_F ( roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (107)

and substituting all numerical values. We take the estimator θ^tsubscript^𝜃𝑡\hat{\theta}_{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given in Theorem 11 for d=3𝑑3d=3italic_d = 3. We use also the bound T~T~𝑇𝑇\widetilde{T}\leq Tover~ start_ARG italic_T end_ARG ≤ italic_T and reabsorb all the constants into C1,C2,C3subscript𝐶1subscript𝐶2subscript𝐶3C_{1},C_{2},C_{3}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. ∎

Remark 1. The constant dependence can be slightly improved taking the estimator for ΠθsubscriptΠ𝜃\Pi_{\theta}roman_Π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as ΠθtwMoMsubscriptΠsubscriptsuperscript𝜃wMoM𝑡\Pi_{\theta^{\text{wMoM}}_{t}}roman_Π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with θtwMoMsubscriptsuperscript𝜃wMoM𝑡\theta^{\text{wMoM}}_{t}italic_θ start_POSTSUPERSCRIPT wMoM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defined in (64).

Remark 2. The result of Theorem 13 also holds with high probability. In particular for the choice of batches k=24log(T~2)𝑘24superscript~𝑇2k=24\log(\widetilde{T}^{2})italic_k = 24 roman_log ( over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with probability at least 11T~11~𝑇1-\frac{1}{\widetilde{T}}1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_T end_ARG end_ARG.

6 Regret lower bound for PSMAQB

While the algorithm for PSMAQB presented above is inspired by classical bandit theory, the lower bound on the regret that we derive is essentially based on quantum information theory. The key insight here is that a policy for PSMAQB can be viewed as a sequence of state tomographies. The expected fidelity of these tomographies is linked to the regret. Hence, existing upper bounds on tomography fidelity also provide a lower bound for the expected regret of the policy.

6.1 Average fidelity bound for pure state tomography

In its most general form, a tomography procedure takes n𝑛nitalic_n copies of an unknown state Π𝒮dΠsuperscriptsubscript𝒮𝑑\Pi\in\mathcal{S}_{d}^{*}roman_Π ∈ caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and performs a joint measurement on the state ΠnsuperscriptΠtensor-productabsent𝑛\Pi^{\otimes n}roman_Π start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT. This is captured in the following definition. Let (𝒮d,Σ)superscriptsubscript𝒮𝑑Σ(\mathcal{S}_{d}^{*},\Sigma)( caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Σ ) be a σ𝜎\sigmaitalic_σ-algebra. A tomography scheme is a positive operator-valued measure (POVM) 𝒯:ΣEnd(n):𝒯ΣEndsuperscripttensor-productabsent𝑛\mathcal{T}:\Sigma\to\operatorname{End}(\mathcal{H}^{\otimes n})caligraphic_T : roman_Σ → roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ) such that 𝒯(𝒮d)=Πn+𝒯superscriptsubscript𝒮𝑑subscriptsuperscriptΠ𝑛\mathcal{T}(\mathcal{S}_{d}^{*})=\Pi^{+}_{n}caligraphic_T ( caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where Πn+subscriptsuperscriptΠ𝑛\Pi^{+}_{n}roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the symmetrization operator on nsuperscripttensor-productabsent𝑛\mathcal{H}^{\otimes n}caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT. For any ρEnd(n)𝜌Endsuperscripttensor-productabsent𝑛\rho\in\operatorname{End}(\mathcal{H}^{\otimes n})italic_ρ ∈ roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ), this POVM gives rise to a complex-valued measure

P𝒯,ρ(A)=Tr(𝒯(A)ρ)subscript𝑃𝒯𝜌𝐴Tr𝒯𝐴𝜌P_{\mathcal{T},\rho}(A)=\operatorname{Tr}(\mathcal{T}(A)\rho)italic_P start_POSTSUBSCRIPT caligraphic_T , italic_ρ end_POSTSUBSCRIPT ( italic_A ) = roman_Tr ( caligraphic_T ( italic_A ) italic_ρ ) (108)

for AΣ𝐴ΣA\in\Sigmaitalic_A ∈ roman_Σ. P𝒯,ρsubscript𝑃𝒯𝜌P_{\mathcal{T},\rho}italic_P start_POSTSUBSCRIPT caligraphic_T , italic_ρ end_POSTSUBSCRIPT becomes a probability measure if ρ𝜌\rhoitalic_ρ satisfies ρ0,Πn+ρ=ρΠn+=ρformulae-sequence𝜌0subscriptsuperscriptΠ𝑛𝜌𝜌subscriptsuperscriptΠ𝑛𝜌\rho\geq 0,\ \Pi^{+}_{n}\rho=\rho\Pi^{+}_{n}=\rhoitalic_ρ ≥ 0 , roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ρ = italic_ρ roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ρ, and Trρ=1Tr𝜌1\operatorname{Tr}\rho=1roman_Tr italic_ρ = 1. Given n𝑛nitalic_n copies of ΠΠ\Piroman_Π, the tomography scheme produces the distribution P𝒯,Πnsubscript𝑃𝒯superscriptΠtensor-productabsent𝑛P_{\mathcal{T},\Pi^{\otimes n}}italic_P start_POSTSUBSCRIPT caligraphic_T , roman_Π start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT of the predicted states. Note that ΠnsuperscriptΠtensor-productabsent𝑛\Pi^{\otimes n}roman_Π start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT satisfies the properties above, so P𝒯,Πnsubscript𝑃𝒯superscriptΠtensor-productabsent𝑛P_{\mathcal{T},\Pi^{\otimes n}}italic_P start_POSTSUBSCRIPT caligraphic_T , roman_Π start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is indeed a probability distribution. The fidelity of this distribution is given by

F(𝒯,Π)=Tr(Πσ)𝑑P𝒯,Πn(σ).𝐹𝒯ΠTrΠ𝜎differential-dsubscript𝑃𝒯superscriptΠtensor-productabsent𝑛𝜎F(\mathcal{T},\Pi)=\int\operatorname{Tr}(\Pi\sigma)dP_{\mathcal{T},\Pi^{% \otimes n}}(\sigma).italic_F ( caligraphic_T , roman_Π ) = ∫ roman_Tr ( roman_Π italic_σ ) italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , roman_Π start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_σ ) . (109)

Finally, the average fidelity of the tomography scheme is defined as

F(𝒯)=F(𝒯,|ψψ|)𝑑ψ,𝐹𝒯𝐹𝒯ket𝜓bra𝜓differential-d𝜓F(\mathcal{T})=\int F(\mathcal{T},|\psi\rangle\langle\psi|)d\psi,italic_F ( caligraphic_T ) = ∫ italic_F ( caligraphic_T , | italic_ψ ⟩ ⟨ italic_ψ | ) italic_d italic_ψ , (110)

where the integration is taken with respect to the normalized uniform measure over all pure states. In the following, 𝑑ψdifferential-d𝜓\int d\psi∫ italic_d italic_ψ will always imply this measure. We will provide a lower bound on F(𝒯)𝐹𝒯F(\mathcal{T})italic_F ( caligraphic_T ) in terms of d𝑑ditalic_d and n𝑛nitalic_n, following the proof technique from [10]. In [10], the proof is only presented for tomography schemes producing a finite number of predictions. For our definition, we will require more general measure-theoretic tools. Before we introduce the upper bound on the fidelity, we will prove some auxiliary lemmas about the nature of the measure P𝒯,ρsubscript𝑃𝒯𝜌P_{\mathcal{T},\rho}italic_P start_POSTSUBSCRIPT caligraphic_T , italic_ρ end_POSTSUBSCRIPT.

Lemma 14.

Let (Ω,Σ)ΩΣ(\Omega,\Sigma)( roman_Ω , roman_Σ ) be a σ𝜎\sigmaitalic_σ-algebra, and let O:ΣEnd(~):𝑂ΣEnd~O:\Sigma\to\operatorname{End}(\widetilde{\mathcal{H}})italic_O : roman_Σ → roman_End ( over~ start_ARG caligraphic_H end_ARG ) be a POVM with values acting on a finite-dimensional Hilbert space ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG with dim~=d~dim~~𝑑\operatorname{dim}\widetilde{\mathcal{H}}=\tilde{d}roman_dim over~ start_ARG caligraphic_H end_ARG = over~ start_ARG italic_d end_ARG s.t. O(Ω)𝟙𝑂Ω1O(\Omega)\leq\mathbbm{1}italic_O ( roman_Ω ) ≤ blackboard_1, where 𝟙1\mathbbm{1}blackboard_1 is the identity operator. Further, let PO,σ:Σ:subscript𝑃𝑂𝜎ΣP_{O,\sigma}:\Sigma\to\mathbb{C}italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT : roman_Σ → blackboard_C be a complex-valued measure, defined for any σEnd(~)𝜎End~\sigma\in\operatorname{End}(\widetilde{\mathcal{H}})italic_σ ∈ roman_End ( over~ start_ARG caligraphic_H end_ARG ) as

PO,σ(A)=Tr[O(A)σ].subscript𝑃𝑂𝜎𝐴Tr𝑂𝐴𝜎P_{O,\sigma}(A)=\operatorname{Tr}[O(A)\sigma].italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT ( italic_A ) = roman_Tr [ italic_O ( italic_A ) italic_σ ] . (111)

Then, there exists a set of functions {fσ}subscript𝑓𝜎\{f_{\sigma}\}{ italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT } indexed by σEnd~𝜎End~\sigma\in\operatorname{End}\widetilde{\mathcal{H}}italic_σ ∈ roman_End over~ start_ARG caligraphic_H end_ARG that are linear w.r.t. σ𝜎\sigmaitalic_σ for all ω𝜔\omegaitalic_ω and that satisfy

fσ:Ωs.t.AΣPO,σ(A)=Afσ(ω)𝑑PO,𝟙(ω).:subscript𝑓𝜎formulae-sequenceΩs.t.formulae-sequencefor-all𝐴Σsubscript𝑃𝑂𝜎𝐴subscript𝐴subscript𝑓𝜎𝜔differential-dsubscript𝑃𝑂1𝜔f_{\sigma}:\Omega\to\mathbb{C}\quad\text{s.t.}\quad\forall A\in\Sigma\ \ P_{O,% \sigma}(A)=\int_{A}f_{\sigma}(\omega)dP_{O,\mathbbm{1}}(\omega).italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT : roman_Ω → blackboard_C s.t. ∀ italic_A ∈ roman_Σ italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT ( italic_A ) = ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ) italic_d italic_P start_POSTSUBSCRIPT italic_O , blackboard_1 end_POSTSUBSCRIPT ( italic_ω ) . (112)

We purposefully formulated this lemma with slightly more general objects than the ones used in the definition of tomography. That is, ΩΩ\Omegaroman_Ω does not need to be 𝒮dsuperscriptsubscript𝒮𝑑\mathcal{S}_{d}^{*}caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG does not need to be the n-th power nsuperscripttensor-productabsent𝑛\mathcal{H}^{\otimes n}caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT, although we will focus on this case.

Proof.

Let {|i}i=1d~superscriptsubscriptket𝑖𝑖1~𝑑\{\ket{i}\}_{i=1}^{\tilde{d}}{ | start_ARG italic_i end_ARG ⟩ } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT be a basis of ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG We will first show that PO,σsubscript𝑃𝑂𝜎P_{O,\sigma}italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT is dominated by PO,𝟙subscript𝑃𝑂1P_{O,\mathbbm{1}}italic_P start_POSTSUBSCRIPT italic_O , blackboard_1 end_POSTSUBSCRIPT for all σ𝜎\sigmaitalic_σ. Indeed, let AΣ𝐴ΣA\in\Sigmaitalic_A ∈ roman_Σ. Assume that PO,𝟙(A)=0subscript𝑃𝑂1𝐴0P_{O,\mathbbm{1}}(A)=0italic_P start_POSTSUBSCRIPT italic_O , blackboard_1 end_POSTSUBSCRIPT ( italic_A ) = 0. This gives us

Tr[O(A)𝟙]=Tr[O(A)]=0,Tr𝑂𝐴1Tr𝑂𝐴0\operatorname{Tr}[O(A)\mathbbm{1}]=\operatorname{Tr}[O(A)]=0,roman_Tr [ italic_O ( italic_A ) blackboard_1 ] = roman_Tr [ italic_O ( italic_A ) ] = 0 , (113)

and, because O(A)0𝑂𝐴0O(A)\geq 0italic_O ( italic_A ) ≥ 0, we also have O(A)=0𝑂𝐴0O(A)=0italic_O ( italic_A ) = 0. Therefore,

PO,σ(A)=Tr[O(A)σ]=0.subscript𝑃𝑂𝜎𝐴Tr𝑂𝐴𝜎0P_{O,\sigma}(A)=\operatorname{Tr}[O(A)\sigma]=0.italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT ( italic_A ) = roman_Tr [ italic_O ( italic_A ) italic_σ ] = 0 . (114)

Hence, for any |i,|jket𝑖ket𝑗\ket{i},\ket{j}| start_ARG italic_i end_ARG ⟩ , | start_ARG italic_j end_ARG ⟩ from the basis we can introduce the Radon-Nikodym derivatives f|ij|subscript𝑓ket𝑖bra𝑗f_{\ket{i}\bra{j}}italic_f start_POSTSUBSCRIPT | start_ARG italic_i end_ARG ⟩ ⟨ start_ARG italic_j end_ARG | end_POSTSUBSCRIPT, which will satisfy (112). Then, for any σEnd~𝜎End~\sigma\in\operatorname{End}\widetilde{\mathcal{H}}italic_σ ∈ roman_End over~ start_ARG caligraphic_H end_ARG we can define

fσ(ω)=i,j=1d~i|σ|jf|ij|(ω).subscript𝑓𝜎𝜔superscriptsubscript𝑖𝑗1~𝑑bra𝑖𝜎ket𝑗subscript𝑓ket𝑖bra𝑗𝜔f_{\sigma}(\omega)=\sum_{i,j=1}^{\tilde{d}}\bra{i}\sigma\ket{j}f_{\ket{i}\bra{% j}}(\omega).italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ⟨ start_ARG italic_i end_ARG | italic_σ | start_ARG italic_j end_ARG ⟩ italic_f start_POSTSUBSCRIPT | start_ARG italic_i end_ARG ⟩ ⟨ start_ARG italic_j end_ARG | end_POSTSUBSCRIPT ( italic_ω ) . (115)

These fσsubscript𝑓𝜎f_{\sigma}italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT are linear in σ𝜎\sigmaitalic_σ by definition. A direct calculation shows that they also satisfy (112). ∎

Note that for σ0𝜎0\sigma\geq 0italic_σ ≥ 0, the measure PO,σsubscript𝑃𝑂𝜎P_{O,\sigma}italic_P start_POSTSUBSCRIPT italic_O , italic_σ end_POSTSUBSCRIPT is finite and nonnegative, but nonnegativity (and even real-valuedness) do not hold for a general σEnd(~)𝜎End~\sigma\in\operatorname{End}(\widetilde{\mathcal{H}})italic_σ ∈ roman_End ( over~ start_ARG caligraphic_H end_ARG ).

By our definition of fσ(ω)subscript𝑓𝜎𝜔f_{\sigma}(\omega)italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ), it can be written as

fσ(ω)=Tr[K(ω)σ],whereK(ω)=i,j=1d~f|ij|(ω)|ji|.formulae-sequencesubscript𝑓𝜎𝜔Tr𝐾𝜔𝜎where𝐾𝜔superscriptsubscript𝑖𝑗1~𝑑subscript𝑓ket𝑖bra𝑗𝜔ket𝑗bra𝑖f_{\sigma}(\omega)=\operatorname{Tr}\left[K(\omega)\sigma\right],\quad\text{% where}\ K(\omega)=\sum_{i,j=1}^{\tilde{d}}f_{|i\rangle\langle j|}(\omega)|j% \rangle\langle i|.italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ) = roman_Tr [ italic_K ( italic_ω ) italic_σ ] , where italic_K ( italic_ω ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT | italic_i ⟩ ⟨ italic_j | end_POSTSUBSCRIPT ( italic_ω ) | italic_j ⟩ ⟨ italic_i | . (116)

As the following lemma demonstrates, K(ω)0𝐾𝜔0K(\omega)\geq 0italic_K ( italic_ω ) ≥ 0 for PO,𝟙subscript𝑃𝑂1P_{O,\mathbbm{1}}italic_P start_POSTSUBSCRIPT italic_O , blackboard_1 end_POSTSUBSCRIPT-almost every ω𝜔\omegaitalic_ω:

Lemma 15.

Let (Ω,Σ,μ)ΩΣ𝜇(\Omega,\Sigma,\mu)( roman_Ω , roman_Σ , italic_μ ) be a measurable space and V:ΩEnd(~):𝑉ΩEnd~V:\Omega\to\operatorname{End}(\widetilde{\mathcal{H}})italic_V : roman_Ω → roman_End ( over~ start_ARG caligraphic_H end_ARG ) be a measurable operator-valued function with values acting on a finite-dimensional Hilbert space ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG such that

AΣAV(ω)𝑑μ(ω)0.formulae-sequencefor-all𝐴Σsubscript𝐴𝑉𝜔differential-d𝜇𝜔0\forall A\in\Sigma\quad\int_{A}V(\omega)d\mu(\omega)\geq 0.∀ italic_A ∈ roman_Σ ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_V ( italic_ω ) italic_d italic_μ ( italic_ω ) ≥ 0 . (117)

Then, V(ω)0𝑉𝜔0V(\omega)\geq 0italic_V ( italic_ω ) ≥ 0 μ𝜇\muitalic_μ-almost everywhere.

Proof.

Let |ψ~ket𝜓~\ket{\psi}\in\widetilde{\mathcal{H}}| start_ARG italic_ψ end_ARG ⟩ ∈ over~ start_ARG caligraphic_H end_ARG and define

gψ(ω)=ψ|V(ω)|ψ.subscript𝑔𝜓𝜔bra𝜓𝑉𝜔ket𝜓g_{\psi}(\omega)=\bra{\psi}V(\omega)\ket{\psi}.italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ω ) = ⟨ start_ARG italic_ψ end_ARG | italic_V ( italic_ω ) | start_ARG italic_ψ end_ARG ⟩ . (118)

By the given condition, for any AΣ𝐴ΣA\in\Sigmaitalic_A ∈ roman_Σ

Agψ(ω)𝑑μ(ω)=ψ|AV(ω)𝑑μ(ω)|ψ0.subscript𝐴subscript𝑔𝜓𝜔differential-d𝜇𝜔bra𝜓subscript𝐴𝑉𝜔differential-d𝜇𝜔ket𝜓0\int_{A}g_{\psi}(\omega)d\mu(\omega)=\bra{\psi}\int_{A}V(\omega)d\mu(\omega)% \ket{\psi}\geq 0.∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ω ) italic_d italic_μ ( italic_ω ) = ⟨ start_ARG italic_ψ end_ARG | ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_V ( italic_ω ) italic_d italic_μ ( italic_ω ) | start_ARG italic_ψ end_ARG ⟩ ≥ 0 . (119)

It follows that gψ(ω)0subscript𝑔𝜓𝜔0g_{\psi}(\omega)\geq 0italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ω ) ≥ 0 μ𝜇\muitalic_μ-almost everywhere. Let

Zψ={ωΩ s.t. gψ(ω)<0}subscript𝑍𝜓𝜔Ω s.t. subscript𝑔𝜓𝜔0Z_{\psi}=\{\omega\in\Omega\text{ s.t. }g_{\psi}(\omega)<0\}italic_Z start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = { italic_ω ∈ roman_Ω s.t. italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_ω ) < 0 } (120)

We have shown that μ(Zψ)=0𝜇subscript𝑍𝜓0\mu(Z_{\psi})=0italic_μ ( italic_Z start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ) = 0. Next, since ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG is finite-dimensional, it is separable. Therefore, there exists a countable set {|ψk}ksubscriptketsubscript𝜓𝑘𝑘\{\ket{\psi_{k}}\}_{k}{ | start_ARG italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟩ } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT dense in ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG. Let

Z=kZψk.𝑍subscript𝑘subscript𝑍subscript𝜓𝑘Z=\bigcup_{k}Z_{\psi_{k}}.italic_Z = ⋃ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (121)

We have that μ(Z)=0𝜇𝑍0\mu(Z)=0italic_μ ( italic_Z ) = 0. Finally, let ωΩZ𝜔Ω𝑍\omega\in\Omega\setminus Zitalic_ω ∈ roman_Ω ∖ italic_Z and |ψ~ket𝜓~\ket{\psi}\in\widetilde{\mathcal{H}}| start_ARG italic_ψ end_ARG ⟩ ∈ over~ start_ARG caligraphic_H end_ARG. Because {|ψk}ketsubscript𝜓𝑘\{\ket{\psi_{k}}\}{ | start_ARG italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟩ } is dense in ~~\widetilde{\mathcal{H}}over~ start_ARG caligraphic_H end_ARG, there exists a sequence {|ψki}ketsubscript𝜓subscript𝑘𝑖\{\ket{\psi_{k_{i}}}\}{ | start_ARG italic_ψ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ } converging to |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩. Then,

0ψki|V(ω)|ψkiiψ|V(ω)|ψ.0brasubscript𝜓subscript𝑘𝑖𝑉𝜔ketsubscript𝜓subscript𝑘𝑖𝑖bra𝜓𝑉𝜔ket𝜓0\leq\bra{\psi_{k_{i}}}V(\omega)\ket{\psi_{k_{i}}}\xrightarrow{i\to\infty}\bra% {\psi}V(\omega)\ket{\psi}.0 ≤ ⟨ start_ARG italic_ψ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG | italic_V ( italic_ω ) | start_ARG italic_ψ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ start_ARROW start_OVERACCENT italic_i → ∞ end_OVERACCENT → end_ARROW ⟨ start_ARG italic_ψ end_ARG | italic_V ( italic_ω ) | start_ARG italic_ψ end_ARG ⟩ . (122)

Overall, we get that

ωΩZ,|ψ~ψ|V(ω)|ψ0.formulae-sequencefor-all𝜔Ω𝑍formulae-sequenceket𝜓~bra𝜓𝑉𝜔ket𝜓0\forall\omega\in\Omega\setminus Z,\ \ket{\psi}\in\widetilde{\mathcal{H}}\quad% \bra{\psi}V(\omega)\ket{\psi}\geq 0.∀ italic_ω ∈ roman_Ω ∖ italic_Z , | start_ARG italic_ψ end_ARG ⟩ ∈ over~ start_ARG caligraphic_H end_ARG ⟨ start_ARG italic_ψ end_ARG | italic_V ( italic_ω ) | start_ARG italic_ψ end_ARG ⟩ ≥ 0 . (123)

Together with μ(Z)=0𝜇𝑍0\mu(Z)=0italic_μ ( italic_Z ) = 0, this gives the desired result. ∎

Now we can apply this analysis to the POVM corresponding to our tomography scheme, and get the desired upper bound on the fidelity.

Theorem 16.

For any tomography scheme 𝒯𝒯\mathcal{T}caligraphic_T utilizing n𝑛nitalic_n copies of the input state, the average fidelity is bounded by

F(𝒯)n+1n+d.𝐹𝒯𝑛1𝑛𝑑F(\mathcal{T})\leq\frac{n+1}{n+d}.italic_F ( caligraphic_T ) ≤ divide start_ARG italic_n + 1 end_ARG start_ARG italic_n + italic_d end_ARG . (124)
Proof.

We will introduce the density K(ω)𝐾𝜔K(\omega)italic_K ( italic_ω ) from (116) for our tomography scheme 𝒯𝒯\mathcal{T}caligraphic_T and the corresponding measure P𝒯,σsubscript𝑃𝒯𝜎P_{\mathcal{T},\sigma}italic_P start_POSTSUBSCRIPT caligraphic_T , italic_σ end_POSTSUBSCRIPT. Lemma 14 allows us to introduce for any σEnd(n)𝜎Endsuperscripttensor-productabsent𝑛\sigma\in\operatorname{End}(\mathcal{H}^{\otimes n})italic_σ ∈ roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ) the density fσ:Ω:subscript𝑓𝜎Ωf_{\sigma}:\Omega\to\mathbb{C}italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT : roman_Ω → blackboard_C s.t.

AΣP𝒯,σ(A)=Afσ(ω)𝑑P𝒯,𝟙(ω).formulae-sequencefor-all𝐴Σsubscript𝑃𝒯𝜎𝐴subscript𝐴subscript𝑓𝜎𝜔differential-dsubscript𝑃𝒯1𝜔\forall A\in\Sigma\ \ P_{\mathcal{T},\sigma}(A)=\int_{A}f_{\sigma}(\omega)dP_{% \mathcal{T},\mathbbm{1}}(\omega).∀ italic_A ∈ roman_Σ italic_P start_POSTSUBSCRIPT caligraphic_T , italic_σ end_POSTSUBSCRIPT ( italic_A ) = ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ) italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_ω ) . (125)

This density can be written as fσ(ω)=Tr(K(ω)σ)subscript𝑓𝜎𝜔Tr𝐾𝜔𝜎f_{\sigma}(\omega)=\operatorname{Tr}\left(K(\omega)\sigma\right)italic_f start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_ω ) = roman_Tr ( italic_K ( italic_ω ) italic_σ ) for some K(ω)End(n)𝐾𝜔Endsuperscripttensor-productabsent𝑛K(\omega)\in\operatorname{End}(\mathcal{H}^{\otimes n})italic_K ( italic_ω ) ∈ roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ). K(ω)𝐾𝜔K(\omega)italic_K ( italic_ω ) can be considered as the operator-valued density of 𝒯𝒯\mathcal{T}caligraphic_T w.r.t. P𝒯,𝟙subscript𝑃𝒯1P_{\mathcal{T},\mathbbm{1}}italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT:

AΣ𝒯(A)=AK(ω)𝑑P𝒯,𝟙(ω).formulae-sequencefor-all𝐴Σ𝒯𝐴subscript𝐴𝐾𝜔differential-dsubscript𝑃𝒯1𝜔\forall A\in\Sigma\quad\mathcal{T}(A)=\int_{A}K(\omega)dP_{\mathcal{T},% \mathbbm{1}}(\omega).∀ italic_A ∈ roman_Σ caligraphic_T ( italic_A ) = ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_K ( italic_ω ) italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_ω ) . (126)

Since 𝒯(A)0𝒯𝐴0\mathcal{T}(A)\geq 0caligraphic_T ( italic_A ) ≥ 0, it follows by Lemma 15 that K(ω)0𝐾𝜔0K(\omega)\geq 0italic_K ( italic_ω ) ≥ 0 for P𝒯,𝟙subscript𝑃𝒯1P_{\mathcal{T},\mathbbm{1}}italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT-almost all ω𝜔\omegaitalic_ω. Furthermore, as 𝒯(𝒮d)=Πn+𝒯superscriptsubscript𝒮𝑑subscriptsuperscriptΠ𝑛\mathcal{T}(\mathcal{S}_{d}^{*})=\Pi^{+}_{n}caligraphic_T ( caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have that for all AΣ𝐴ΣA\in\Sigmaitalic_A ∈ roman_Σ, 𝒯(A)Πn+𝒯𝐴subscriptsuperscriptΠ𝑛\mathcal{T}(A)\leq\Pi^{+}_{n}caligraphic_T ( italic_A ) ≤ roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Therefore, 𝒯(A)Πn+=Πn+𝒯(A)=𝒯(A)𝒯𝐴subscriptsuperscriptΠ𝑛subscriptsuperscriptΠ𝑛𝒯𝐴𝒯𝐴\mathcal{T}(A)\Pi^{+}_{n}=\Pi^{+}_{n}\mathcal{T}(A)=\mathcal{T}(A)caligraphic_T ( italic_A ) roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_T ( italic_A ) = caligraphic_T ( italic_A ). This means that K~(ω)=Πn+K(ω)Πn+~𝐾𝜔subscriptsuperscriptΠ𝑛𝐾𝜔subscriptsuperscriptΠ𝑛\tilde{K}(\omega)=\Pi^{+}_{n}K(\omega)\Pi^{+}_{n}over~ start_ARG italic_K end_ARG ( italic_ω ) = roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_K ( italic_ω ) roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT would also satisfy (126). In the following, we will without loss of generality assume that

K(ω)=Πn+K(ω)=K(ω)Πn+.𝐾𝜔subscriptsuperscriptΠ𝑛𝐾𝜔𝐾𝜔subscriptsuperscriptΠ𝑛K(\omega)=\Pi^{+}_{n}K(\omega)=K(\omega)\Pi^{+}_{n}.italic_K ( italic_ω ) = roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_K ( italic_ω ) = italic_K ( italic_ω ) roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . (127)

With these tools at hand, we are ready to adapt the proof from [10] to the general case of POVM tomography schemes. We begin by rewriting the expression (109) for average fidelity:

F(𝒯)𝐹𝒯\displaystyle F(\mathcal{T})italic_F ( caligraphic_T ) =𝑑ψ𝑑P𝒯,(|ψψ|)n(σ)Tr(σ|ψψ|)absentdifferential-d𝜓differential-dsubscript𝑃𝒯superscriptket𝜓bra𝜓tensor-productabsent𝑛𝜎Tr𝜎ket𝜓bra𝜓\displaystyle=\int d\psi\int dP_{\mathcal{T},(\ket{\psi}\!\bra{\psi})^{\otimes n% }}(\sigma)\operatorname{Tr}(\sigma\ket{\psi}\!\bra{\psi})= ∫ italic_d italic_ψ ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( italic_σ | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) (128)
=𝑑ψ𝑑P𝒯,𝟙(σ)Tr(|ψψ|σ)Tr(K(σ)(|ψψ|)n).absentdifferential-d𝜓differential-dsubscript𝑃𝒯1𝜎Trket𝜓bra𝜓𝜎Tr𝐾𝜎superscriptket𝜓bra𝜓tensor-productabsent𝑛\displaystyle=\int d\psi\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)\operatorname% {Tr}(\ket{\psi}\!\bra{\psi}\sigma)\operatorname{Tr}\left(K(\sigma)(\ket{\psi}% \!\bra{\psi})^{\otimes n}\right).= ∫ italic_d italic_ψ ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | italic_σ ) roman_Tr ( italic_K ( italic_σ ) ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ) . (129)

Since fidelity is nonnegative and its average is bounded by 1, we can change the order of integration. Following [10], we introduce notation

σn(k)=𝟙(k1)σ𝟙(nk)n.subscript𝜎𝑛𝑘tensor-productsuperscript1tensor-productabsent𝑘1𝜎superscript1tensor-productabsent𝑛𝑘superscripttensor-productabsent𝑛\sigma_{n}(k)=\mathbbm{1}^{\otimes(k-1)}\otimes\sigma\otimes\mathbbm{1}^{% \otimes(n-k)}\in\mathcal{H}^{\otimes n}.italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) = blackboard_1 start_POSTSUPERSCRIPT ⊗ ( italic_k - 1 ) end_POSTSUPERSCRIPT ⊗ italic_σ ⊗ blackboard_1 start_POSTSUPERSCRIPT ⊗ ( italic_n - italic_k ) end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT . (130)

The product of traces in (129) can be rewritten in the following manner:

F(𝒯)=𝑑P𝒯,𝟙(σ)𝑑ψTr((K(σ)𝟙)(|ψψ|)(n+1)σn+1(n+1)).𝐹𝒯differential-dsubscript𝑃𝒯1𝜎differential-d𝜓Trtensor-product𝐾𝜎1superscriptket𝜓bra𝜓tensor-productabsent𝑛1subscript𝜎𝑛1𝑛1\displaystyle\begin{split}F(\mathcal{T})&=\int dP_{\mathcal{T},\mathbbm{1}}(% \sigma)\int d\psi\\ &\quad\quad\operatorname{Tr}\left((K(\sigma)\otimes\mathbbm{1})(\ket{\psi}\!% \bra{\psi})^{\otimes(n+1)}\sigma_{n+1}(n+1)\right).\end{split}start_ROW start_CELL italic_F ( caligraphic_T ) end_CELL start_CELL = ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) ∫ italic_d italic_ψ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_Tr ( ( italic_K ( italic_σ ) ⊗ blackboard_1 ) ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ ( italic_n + 1 ) end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_n + 1 ) ) . end_CELL end_ROW (131)

We can now take the inner integral in closed form. As shown in [10, Eq. (4)],

𝑑ψ(|ψψ|)n=Πn+Dn,differential-d𝜓superscriptket𝜓bra𝜓tensor-productabsent𝑛subscriptsuperscriptΠ𝑛subscript𝐷𝑛\int d\psi(\ket{\psi}\!\bra{\psi})^{\otimes n}=\frac{\Pi^{+}_{n}}{D_{n}},∫ italic_d italic_ψ ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT = divide start_ARG roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG , (132)

where Dn=(n+d1d)subscript𝐷𝑛binomial𝑛𝑑1𝑑D_{n}=\binom{n+d-1}{d}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( FRACOP start_ARG italic_n + italic_d - 1 end_ARG start_ARG italic_d end_ARG ). Another useful result in this paper is [10, Eq. (8)]:

Trn+1(Πn+1+σn+1(n+1))=1n+1Πn+(𝟙+k=1nσn(k)),subscriptTr𝑛1subscriptsuperscriptΠ𝑛1subscript𝜎𝑛1𝑛11𝑛1subscriptsuperscriptΠ𝑛1superscriptsubscript𝑘1𝑛subscript𝜎𝑛𝑘\operatorname{Tr}_{n+1}\left(\Pi^{+}_{n+1}\sigma_{n+1}(n+1)\right)=\frac{1}{n+% 1}\Pi^{+}_{n}\left(\mathbbm{1}+\sum_{k=1}^{n}\sigma_{n}(k)\right),roman_Tr start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_n + 1 ) ) = divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( blackboard_1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) ) , (133)

where Trn+1:End((n+1))End(n):subscriptTr𝑛1Endsuperscripttensor-productabsent𝑛1Endsuperscripttensor-productabsent𝑛\operatorname{Tr}_{n+1}:\operatorname{End}(\mathcal{H}^{\otimes(n+1)})\to% \operatorname{End}(\mathcal{H}^{\otimes n})roman_Tr start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT : roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ ( italic_n + 1 ) end_POSTSUPERSCRIPT ) → roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ) is the partial trace on the (n+1)𝑛1(n+1)( italic_n + 1 )-st copy of the system. These expressions allow us to rewrite (131) as follows:

F(𝒯)𝐹𝒯\displaystyle F(\mathcal{T})italic_F ( caligraphic_T ) =1Dn+1𝑑P𝒯,𝟙(σ)Tr((K(σ)𝟙)Πn+1+σn+1(n+1))absent1subscript𝐷𝑛1differential-dsubscript𝑃𝒯1𝜎Trtensor-product𝐾𝜎1subscriptsuperscriptΠ𝑛1subscript𝜎𝑛1𝑛1\displaystyle=\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left((K(\sigma)\otimes\mathbbm{1})\Pi^{+}_{n+1}\sigma_{n+1}(% n+1)\right)= divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( ( italic_K ( italic_σ ) ⊗ blackboard_1 ) roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_n + 1 ) ) (134)
=1Dn+1𝑑P𝒯,𝟙(σ)Tr(K(σ)Trn+1(Πn+1+σn+1(n+1)))absent1subscript𝐷𝑛1differential-dsubscript𝑃𝒯1𝜎Tr𝐾𝜎subscriptTr𝑛1subscriptsuperscriptΠ𝑛1subscript𝜎𝑛1𝑛1\displaystyle=\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left(K(\sigma)\operatorname{Tr}_{n+1}\left(\Pi^{+}_{n+1}% \sigma_{n+1}(n+1)\right)\right)= divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( italic_K ( italic_σ ) roman_Tr start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_n + 1 ) ) ) (135)
=1(n+1)Dn+1𝑑P𝒯,𝟙(σ)Tr(K(σ)(𝟙+k=1nσn(k))).absent1𝑛1subscript𝐷𝑛1differential-dsubscript𝑃𝒯1𝜎Tr𝐾𝜎1superscriptsubscript𝑘1𝑛subscript𝜎𝑛𝑘\displaystyle=\frac{1}{(n+1)D_{n+1}}\int dP_{\mathcal{T},\mathbbm{1}}(\sigma)% \operatorname{Tr}\left(K(\sigma)\left(\mathbbm{1}+\sum_{k=1}^{n}\sigma_{n}(k)% \right)\right).= divide start_ARG 1 end_ARG start_ARG ( italic_n + 1 ) italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( italic_K ( italic_σ ) ( blackboard_1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) ) ) . (136)

Finally, σn(k)𝟙subscript𝜎𝑛𝑘1\sigma_{n}(k)\leq\mathbbm{1}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) ≤ blackboard_1, so Tr(K(σ)σn(k))Tr(K(σ))Tr𝐾𝜎subscript𝜎𝑛𝑘Tr𝐾𝜎\operatorname{Tr}(K(\sigma)\sigma_{n}(k))\leq\operatorname{Tr}(K(\sigma))roman_Tr ( italic_K ( italic_σ ) italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) ) ≤ roman_Tr ( italic_K ( italic_σ ) ), and we can bound the above as

F(𝒯)1Dn+1𝑑P𝒯,𝟙(σ)Tr(K(σ))𝐹𝒯1subscript𝐷𝑛1differential-dsubscript𝑃𝒯1𝜎Tr𝐾𝜎\displaystyle F(\mathcal{T})\leq\frac{1}{D_{n+1}}\int dP_{\mathcal{T},\mathbbm% {1}}(\sigma)\operatorname{Tr}\left(K(\sigma)\right)italic_F ( caligraphic_T ) ≤ divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG ∫ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T , blackboard_1 end_POSTSUBSCRIPT ( italic_σ ) roman_Tr ( italic_K ( italic_σ ) )
=TrΠn+Dn+1=DnDn+1=n+1n+d.absentTrsubscriptsuperscriptΠ𝑛subscript𝐷𝑛1subscript𝐷𝑛subscript𝐷𝑛1𝑛1𝑛𝑑\displaystyle=\frac{\operatorname{Tr}\Pi^{+}_{n}}{D_{n+1}}=\frac{D_{n}}{D_{n+1% }}=\frac{n+1}{n+d}.= divide start_ARG roman_Tr roman_Π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_n + 1 end_ARG start_ARG italic_n + italic_d end_ARG . (137)

6.2 Bandit policy as a sequence of tomographies

Theorem 17.

Given a d𝑑ditalic_d-dimensional pure state general multi-armed quantum bandit we have that for any policy π𝜋\piitalic_π the average expected regret is bounded by

𝑑ψ𝔼|ψψ|,π[Regret(T,π,|ψψ|)](d1)log(Td+1),differential-d𝜓subscript𝔼ket𝜓bra𝜓𝜋Regret𝑇𝜋ket𝜓bra𝜓𝑑1𝑇𝑑1\displaystyle\int d\psi\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi},\pi}% \left[\textup{Regret}(T,\pi,\ket{\psi}\!\bra{\psi})\right]\geq(d-1)\log\left(% \frac{T}{d+1}\right),∫ italic_d italic_ψ blackboard_E start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT [ Regret ( italic_T , italic_π , | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) ] ≥ ( italic_d - 1 ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_d + 1 end_ARG ) , (138)

where the expectation is taken w.r.t. the measure (34) over actions taken by the bandit, and the regret is defined in (37).

The above Theorem gives 𝔼[Regret(T)]=Ω(dlogTd)𝔼Regret𝑇Ω𝑑𝑇𝑑\operatorname{\mathbb{E}}\left[\text{Regret}(T)\right]=\Omega(d\log\frac{T}{d})blackboard_E [ Regret ( italic_T ) ] = roman_Ω ( italic_d roman_log divide start_ARG italic_T end_ARG start_ARG italic_d end_ARG ). In the case of qubit environments, we have d=2𝑑2d=2italic_d = 2 and 𝔼[Regret(T)]=Ω(logT)𝔼Regret𝑇Ω𝑇{\operatorname{\mathbb{E}}\left[\text{Regret}(T)\right]=\Omega(\log T)}blackboard_E [ Regret ( italic_T ) ] = roman_Ω ( roman_log italic_T ).

Proof.

Given a policy π𝜋\piitalic_π, we can introduce a POVM Et:(Σ×{0,1})×tEnd(t):subscript𝐸𝑡superscriptΣ01absent𝑡Endsuperscripttensor-productabsent𝑡E_{t}:(\Sigma\times\{0,1\})^{\times t}\to\operatorname{End}(\mathcal{H}^{% \otimes t})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ( roman_Σ × { 0 , 1 } ) start_POSTSUPERSCRIPT × italic_t end_POSTSUPERSCRIPT → roman_End ( caligraphic_H start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT ) such that

P|ψψ|,πt(A1,r1,,At,rt)=Tr((|ψψ|)tEt(A1,r1,,At,rt)),subscriptsuperscript𝑃𝑡ket𝜓bra𝜓𝜋subscript𝐴1subscript𝑟1subscript𝐴𝑡subscript𝑟𝑡Trsuperscriptket𝜓bra𝜓tensor-productabsent𝑡subscript𝐸𝑡subscript𝐴1subscript𝑟1subscript𝐴𝑡subscript𝑟𝑡P^{t}_{\ket{\psi}\!\bra{\psi},\pi}(A_{1},r_{1},\dotsc,A_{t},r_{t})=% \operatorname{Tr}\left((\ket{\psi}\!\bra{\psi})^{\otimes t}E_{t}(A_{1},r_{1},% \dotsc,A_{t},r_{t})\right),italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Tr ( ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (139)

where P|ψψ|,πtsubscriptsuperscript𝑃𝑡ket𝜓bra𝜓𝜋P^{t}_{\ket{\psi}\!\bra{\psi},\pi}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT is the probability measure defined by (34), but only for actions and rewards until step t𝑡titalic_t. The construction of this POVM is presented in the proof of Lemma 9 in [14]. We will also define the coordinate mapping

Ψt(Π1,r1,,Πt,rt)=Πt,subscriptΨ𝑡subscriptΠ1subscript𝑟1subscriptΠ𝑡subscript𝑟𝑡subscriptΠ𝑡\Psi_{t}(\Pi_{1},r_{1},\dotsc,\Pi_{t},r_{t})=\Pi_{t},roman_Ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (140)

where Πi𝒜subscriptΠ𝑖𝒜\Pi_{i}\in\mathcal{A}roman_Π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A are actions and ri{0,1}subscript𝑟𝑖01r_{i}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } are rewards of the PSMAQB. Now we can for each step t𝑡titalic_t define a tomography scheme 𝒯t=EtΨt1subscript𝒯𝑡subscript𝐸𝑡superscriptsubscriptΨ𝑡1\mathcal{T}_{t}=E_{t}\circ\Psi_{t}^{-1}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ roman_Ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as the pushforward POVM from Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the space (𝒜,Σ)𝒜Σ(\mathcal{A},\Sigma)( caligraphic_A , roman_Σ ). Informally, this tomography scheme takes t𝑡titalic_t copies of the state, runs the policy π𝜋\piitalic_π on them, and outputs the t𝑡titalic_t-th action of the policy as the predicted state. For AΣ𝐴ΣA\in\Sigmaitalic_A ∈ roman_Σ, we can rewrite the tomography’s distribution on predictions as

P𝒯,(|ψψ|)t(A)=Tr(𝒯t(A)(|ψψ|)t)subscript𝑃𝒯superscriptket𝜓bra𝜓tensor-productabsent𝑡𝐴Trsubscript𝒯𝑡𝐴superscriptket𝜓bra𝜓tensor-productabsent𝑡\displaystyle P_{\mathcal{T},(\ket{\psi}\!\bra{\psi})^{\otimes t}}(A)=% \operatorname{Tr}\left(\mathcal{T}_{t}(A)(\ket{\psi}\!\bra{\psi})^{\otimes t}\right)italic_P start_POSTSUBSCRIPT caligraphic_T , ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A ) = roman_Tr ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A ) ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT )
=Tr(Et(Ψt1(A))(|ψψ|)t)=(P|ψψ|,πtΨ1)(A).absentTrsubscript𝐸𝑡superscriptsubscriptΨ𝑡1𝐴superscriptket𝜓bra𝜓tensor-productabsent𝑡subscriptsuperscript𝑃𝑡ket𝜓bra𝜓𝜋superscriptΨ1𝐴\displaystyle=\operatorname{Tr}\left(E_{t}(\Psi_{t}^{-1}(A))(\ket{\psi}\!\bra{% \psi})^{\otimes t}\right)=\left(P^{t}_{\ket{\psi}\!\bra{\psi},\pi}\circ\Psi^{-% 1}\right)(A).= roman_Tr ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A ) ) ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT ) = ( italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT ∘ roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ( italic_A ) . (141)

Then, the fidelity of 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the input |ψψ|ket𝜓bra𝜓\ket{\psi}\!\bra{\psi}| start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | can be rewritten as

F(𝒯t,|ψψ|)=ψ|ρ|ψ𝑑P𝒯t,(|ψψ|)t(ρ)𝐹subscript𝒯𝑡ket𝜓bra𝜓quantum-operator-product𝜓𝜌𝜓differential-dsubscript𝑃subscript𝒯𝑡superscriptket𝜓bra𝜓tensor-productabsent𝑡𝜌\displaystyle F(\mathcal{T}_{t},\ket{\psi}\!\bra{\psi})=\int\langle\psi|\rho|% \psi\rangle dP_{\mathcal{T}_{t},(\ket{\psi}\!\bra{\psi})^{\otimes t}}(\rho)italic_F ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) = ∫ ⟨ italic_ψ | italic_ρ | italic_ψ ⟩ italic_d italic_P start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) start_POSTSUPERSCRIPT ⊗ italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ρ ) (142)
=ψ|Ψt(Π1,r1,,Πt,rt)|ψ𝑑P|ψψ|,πt(Π1,r1,,Πt,rt)absentquantum-operator-product𝜓subscriptΨ𝑡subscriptΠ1subscript𝑟1subscriptΠ𝑡subscript𝑟𝑡𝜓differential-dsubscriptsuperscript𝑃𝑡ket𝜓bra𝜓𝜋subscriptΠ1subscript𝑟1subscriptΠ𝑡subscript𝑟𝑡\displaystyle=\int\langle\psi|\Psi_{t}(\Pi_{1},r_{1},\cdots,\Pi_{t},r_{t})|% \psi\rangle dP^{t}_{\ket{\psi}\!\bra{\psi},\pi}(\Pi_{1},r_{1},\cdots,\Pi_{t},r% _{t})= ∫ ⟨ italic_ψ | roman_Ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_ψ ⟩ italic_d italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (143)
=𝔼|ψψ|,π[ψ|Πt|ψ].absentsubscript𝔼ket𝜓bra𝜓𝜋quantum-operator-product𝜓subscriptΠ𝑡𝜓\displaystyle=\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi},\pi}\left[% \langle\psi|\Pi_{t}|\psi\rangle\right].= blackboard_E start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | , italic_π end_POSTSUBSCRIPT [ ⟨ italic_ψ | roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ ⟩ ] . (144)

Using the bound for average tomography fidelity on 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from Theorem 16, we can now bound the average regret of π𝜋\piitalic_π:

𝔼|ψψ|[Regret(T,π,|ψψ|)]𝑑ψsubscript𝔼ket𝜓bra𝜓Regret𝑇𝜋ket𝜓bra𝜓differential-d𝜓\displaystyle\int\operatorname{\mathbb{E}}_{\ket{\psi}\!\bra{\psi}}\left[% \textup{Regret}(T,\pi,\ket{\psi}\!\bra{\psi})\right]d\psi∫ blackboard_E start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | end_POSTSUBSCRIPT [ Regret ( italic_T , italic_π , | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | ) ] italic_d italic_ψ (145)
=Tt=1T𝔼|ψψ|[ψ|Πt|ψ]𝑑ψabsent𝑇superscriptsubscript𝑡1𝑇subscript𝔼ket𝜓bra𝜓delimited-[]quantum-operator-product𝜓subscriptΠ𝑡𝜓differential-d𝜓\displaystyle=T-\sum_{t=1}^{T}\int\ \mathbb{E}_{\ket{\psi}\!\bra{\psi}}\left[% \langle\psi|\Pi_{t}|\psi\rangle\right]d\psi= italic_T - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ blackboard_E start_POSTSUBSCRIPT | start_ARG italic_ψ end_ARG ⟩ ⟨ start_ARG italic_ψ end_ARG | end_POSTSUBSCRIPT [ ⟨ italic_ψ | roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ ⟩ ] italic_d italic_ψ (146)
=Tt=1TF(𝒯t)t=1T1t+1t+dabsent𝑇superscriptsubscript𝑡1𝑇𝐹subscript𝒯𝑡superscriptsubscript𝑡1𝑇1𝑡1𝑡𝑑\displaystyle=T-\sum_{t=1}^{T}F(\mathcal{T}_{t})\geq\sum_{t=1}^{T}1-\frac{t+1}% {t+d}= italic_T - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_F ( caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1 - divide start_ARG italic_t + 1 end_ARG start_ARG italic_t + italic_d end_ARG (147)
=t=1Td1t+d(d1)log(Td+1),absentsuperscriptsubscript𝑡1𝑇𝑑1𝑡𝑑𝑑1𝑇𝑑1\displaystyle=\sum_{t=1}^{T}\frac{d-1}{t+d}\geq(d-1)\log\left(\frac{T}{d+1}% \right),= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_d - 1 end_ARG start_ARG italic_t + italic_d end_ARG ≥ ( italic_d - 1 ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_d + 1 end_ARG ) , (148)

where the last inequality follows from bounding the sum with the integral of the function f(t)=1/(t+d)𝑓𝑡1𝑡𝑑f(t)=1/(t+d)italic_f ( italic_t ) = 1 / ( italic_t + italic_d ). ∎

Acknowledgements:

JL thanks Jan Seyfried and Yanglin Hu for comments and suggestions, Erkka Happasalo for discussions about disturbance and Roberto Rubboli for many technical discussions. Mikhail Terekhov is grateful to be supported by the EDIC Fellowship from the School of Computer Science at EPFL. Josep Lumbreras and Marco Tomammichel are supported by the National Research Foundation, Singapore and A*STAR under its CQT Bridging Grant and the Quantum Engineering Programme grant NRF2021-QEP2-02-P05.

References

  • [1] S. Aaronson and G. N. Rothblum. “Gentle measurement of quantum states and differential privacy”. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 322–333, (2019).
  • [2] Y. Aharonov, D. Z. Albert, and L. Vaidman. “How the result of a measurement of a component of the spin of a spin-1/2 particle can turn out to be 100”. Physical review letters 60(14): 1351 (1988).
  • [3] S. Brahmachari, J. Lumbreras, and M. Tomamichel. “Quantum contextual bandits and recommender systems for quantum data”. Quantum Machine Intelligence 6(2): 58 (2024).
  • [4] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. “Bandits With Heavy Tail”. IEEE Transactions on Information Theory 59(11): 7711–7717 (2013).
  • [5] P. Busch, P. Lahti, and R. F. Werner. “Proof of Heisenberg’s error-disturbance relation”. Physical Review Letters 111(16): 160405 (2013).
  • [6] C. Butucea, J. Johannes, and H. Stein. “Sample-optimal learning of quantum states using gentle measurements”. arXiv preprint arXiv:2505.24587 , (2025).
  • [7] M. S. Byrd and N. Khaneja. “Characterization of the positivity of the density matrix in terms of the coherence vector representation”. Physical Review A 68(6): 062322, (2003).
  • [8] M. Guţă, J. Kahn, R. Kueng, and J. A. Tropp. “Fast state tomography with optimal error bounds”. Journal of Physics A: Mathematical and Theoretical 53(20): 204001 (2020).
  • [9] J. Haah, A. W. Harrow, Z. Ji, X. Wu, and N. Yu. “Sample-optimal tomography of quantum states”. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 913–925, (2016).
  • [10] A. Hayashi, T. Hashimoto, and M. Horibe. “Reexamination of optimal quantum state estimation of pure states”. Physical review A 72(3): 032325, (2005).
  • [11] R. Kueng, H. Rauhut, and U. Terstiege. “Low rank matrix recovery from rank one measurements”. Applied and Computational Harmonic Analysis 42(1): 88–116 (2017).
  • [12] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press (2020).
  • [13] M. Lerasle. “Lecture notes: Selected topics on robust statistical learning theory”. arXiv:1908.10761 , (2019).
  • [14] J. Lumbreras, E. Haapasalo, and M. Tomamichel. “Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states”. Quantum 6: 749, (2022).
  • [15] J. Lumbreras, R. C. Huang, Y. Hu, M. Gu, and M. Tomamichel. “Quantum state-agnostic work extraction (almost) without dissipation”. arXiv preprint arXiv:2505.09456 , (2025).
  • [16] J. Lumbreras and M. Tomamichel. “Linear bandits with polylogarithmic minimax regret”. In Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 3644–3682, (2024).
  • [17] A. M. Medina and S. Yang. “No-regret algorithms for heavy-tailed linear bandits”. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1642–1650, (2016).
  • [18] M. Ozawa. “Universally valid reformulation of the Heisenberg uncertainty principle on noise and disturbance in measurement”. Physical Review A 67(4): 042105 (2003).
  • [19] H. Shao, X. Yu, I. King, and M. R. Lyu. “Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs”. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 8430–8439, Red Hook, NY, USA(2018).
  • [20] A. Winter. “Coding theorem and strong converse for quantum channels”. IEEE Transactions on information theory 45(7): 2481–2485 (2002).