主页 > IT业界  > 

【强化学习】随机策略的策略梯度


文章目录 Policy的目标函数定理1定理2定理3定理4定理5

Policy的目标函数

J ( π ) = E τ ∣ π [ G 0 ] = E τ ∣ π [ ∑ t = 0 γ t r t ] J({\pi})=\mathbb{E}_{\tau|\pi}[G_0]=\mathbb{E}_{\tau|\pi}[\sum\limits_{t=0}\gamma^{t}r_{t}] J(π)=Eτ∣π​[G0​]=Eτ∣π​[t=0∑​γtrt​]

定理1

∇ θ J ( θ ) = E τ ∣ π [ ∑ t = 0 ∞ γ t G t ∇ θ ln ⁡ π ( a t ∣ s t ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi} [\sum\limits_{t=0}^{\infty}\gamma ^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_{t})] ∇θ​J(θ)=Eτ∣π​[t=0∑∞​γtGt​∇θ​lnπ(at​∣st​)]

G t = ∑ k = 0 ∞ γ k r t + k G_{t}=\sum_{k=0}^{\infty}\gamma^{k}r_{t+k} Gt​=∑k=0∞​γkrt+k​

证明:

J ( θ ) = ∑ τ ∣ π G 0 π ( τ ; θ ) J({\theta})=\sum_{\tau|\pi} G_{0}\pi(\tau; \theta) J(θ)=∑τ∣π​G0​π(τ;θ)

∇ θ J ( θ ) = ∑ τ ∣ π G 0 ∇ θ π ( τ ; θ ) \nabla_{\theta}J(\theta)=\sum_{\tau|\pi} G_{0}\nabla_\theta\pi(\tau; \theta) ∇θ​J(θ)=∑τ∣π​G0​∇θ​π(τ;θ)

∇ θ π ( τ ; θ ) = π ( τ ; θ ) ∇ θ ln ⁡ π ( τ ; θ ) \nabla_{\theta}\pi(\tau;\theta)=\pi(\tau;\theta)\nabla_{\theta} \ln\pi(\tau;\theta) ∇θ​π(τ;θ)=π(τ;θ)∇θ​lnπ(τ;θ)

π ( τ ; θ ) = p 1 ( s 0 ) Π i = 0 ∞ π ( a i ∣ s i ; θ ) T ( s i , a i , s i + 1 ) \pi(\tau;\theta)=p_{1}(s_{0})\Pi_{i=0}^{\infty}\pi(a_{i}|s_{i};\theta)T(s_i,a_i,s_{i+1}) π(τ;θ)=p1​(s0​)Πi=0∞​π(ai​∣si​;θ)T(si​,ai​,si+1​)

ln ⁡ π ( τ ; θ ) = ∑ i = 0 ∞ π ( a i ∣ s i ; θ ) + ∑ i = 0 ∞ ln ⁡ T ( s i , a i , s i + 1 ) + ln ⁡ p 1 ( s 0 ) \ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\pi(a_i|s_i;\theta)+\sum_{i=0}^{\infty}\ln T(s_i,a_i,s_{i+1}) + \ln p_{1}(s_0) lnπ(τ;θ)=∑i=0∞​π(ai​∣si​;θ)+∑i=0∞​lnT(si​,ai​,si+1​)+lnp1​(s0​)

∇ θ ln ⁡ π ( τ ; θ ) = ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) \nabla_{\theta}\ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta) ∇θ​lnπ(τ;θ)=∑i=0∞​∇θ​π(ai​∣si​;θ)

∇ θ J ( θ ) = ∑ τ G 0 π ( τ ; θ ) ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) = E τ [ G 0 ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}J(\theta)=\sum_{\tau}G_0\pi(\tau;\theta)\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)=\mathbb{E}_{\tau}[G_0\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)] ∇θ​J(θ)=∑τ​G0​π(τ;θ)∑i=0∞​∇θ​π(ai​∣si​;θ)=Eτ​[G0​∑i=0∞​∇θ​π(ai​∣si​;θ)]

定理2

∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_{i}\nabla_{\theta}\ln\pi(s_i)] ∇θ​J(θ)=Eτ∣π​[∑i=0∞​γiGi​∇θ​lnπ(si​)]

证明:

J ( θ ) = E τ ∣ π [ G 0 ] = E s 0 E τ ∣ s 0 , π [ G 0 ] = E s 0 V θ ( s 0 ) J(\theta)=\mathbb{E}_{\tau|\pi}[G_{0}]=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0,\pi}[G_{0}]=\mathbb{E}_{s_0}V^{\theta}(s_0) J(θ)=Eτ∣π​[G0​]=Es0​​Eτ∣s0​,π​[G0​]=Es0​​Vθ(s0​)

∇ θ J ( θ ) = E s 0 ∇ θ V θ ( s 0 ) \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\nabla_{\theta}V^{\theta}(s_0) ∇θ​J(θ)=Es0​​∇θ​Vθ(s0​)

V θ ( s 0 ) = ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) V^{\theta}(s_0)=\sum_{a}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) Vθ(s0​)=∑a​π(a∣s0​;θ)Qθ(s0​,a)

∇ θ V θ ( s 0 ) = ∑ a [ ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\sum_{a}[\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) + \nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)] ∇θ​Vθ(s0​)=∑a​[∇θ​π(a∣s0​;θ)Qθ(s0​,a)+∇θ​Qθ(s0​,a)π(a∣s0​;θ)]

∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln ⁡ π ( a ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 = a , π [ G 0 ] = E a 0 ∣ s 0 , π { ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 , π [ G 0 ] } = E a 0 ∣ s 0 , π E τ ∣ s 0 , a 0 , π [ G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) ] = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0=a,\pi}[G_0]\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0,\pi}[G_0]\}\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{\tau|s_0,a_0,\pi}[G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)]\\ &=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0​∑​∇θ​π(a∣s0​;θ)Qθ(s0​,a)​=a∑​π(a∣s0​;θ)∇θ​lnπ(a∣s0​;θ)Eτ∣s0​,a0​=a,π​[G0​]=Ea0​∣s0​,π​{∇θ​lnπ(a0​∣s0​;θ)Eτ∣s0​,a0​,π​[G0​]}=Ea0​∣s0​,π​Eτ∣s0​,a0​,π​[G0​∇θ​lnπ(a0​∣s0​;θ)]=Eτ∣s0​,π​{G0​∇θ​lnπ(a0​∣s0​;θ)}​​

∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) ∑a​∇θ​Qθ(s0​,a)π(a∣s0​;θ)=γEa0​∣s0​,π​∇θ​Qθ(s0​,a0​)

其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0​,a)=∑s′​T(s0​,a,s′)[r(s0​,a,s′)+γVθ(s′)]

∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) ∇θ​Qθ(s0​,a0​)=∑s′​T(s0​,a0​,s′)∇θ​Vθ(s′)=Es1​∣s0​,a0​​∇θ​Vθ(s1​)

所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) ∑a​∇θ​Qθ(s0​,a)π(a∣s0​;θ)=γEa0​∣s0​,π​Es1​∣s0​,a0​​∇θ​Vθ(s1​)=γEs1​∣s0​,π​∇θ​Vθ(s1​)

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) ∇θ​Vθ(s0​)=Eτ∣s0​,π​{G0​∇θ​lnπ(a0​∣s0​;θ)}+γEs1​∣s0​,π​∇θ​Vθ(s1​)

同理可得:

V θ ( s i ) = E τ i ∣ s i , π { G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) ,   i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{G_i\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si​)=Eτi​∣si​,π​{Gi​∇θ​lnπ(ai​∣si​;θ)}+γEsi+1​∣si​,π​∇θ​Vθ(si+1​), i=1,2,...$

将 V θ ( s 1 ) = E τ 1 ∣ s 1 , π { G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1​)=Eτ1​∣s1​,π​{G1​∇θ​lnπ(a1​∣s1​;θ)}+γEs2​∣s1​,π​∇θ​Vθ(s2​) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1​∣s0​,π​∇θ​Vθ(s1​), 得

γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1​∣s0​,π​∇θ​Vθ(s1​)​=γEs1​∣s0​,π​{Eτ1​∣s1​,π​{G1​∇θ​lnπ(a1​∣s1​;θ)}+γEs2​∣s1​,π​∇θ​Vθ(s2​)}=γEτ1​∣s0​,π​[G1​∇θ​lnπ(a1​∣s1​;θ)]+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)=Eτ∣s0​,π​[γG1​∇θ​lnπ(a1​∣s1​;θ)]+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)​​

进而

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) + γ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) ∇θ​Vθ(s0​)=Eτ∣s0​,π​{G0​∇θ​lnπ(a0​∣s0​;θ)+γG1​∇θ​lnπ(a1​∣s1​;θ)}+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)

再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { G 2 ∇ θ ln ⁡ π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{G_2\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2​)=Eτ2​∣s2​,π​{G2​∇θ​lnπ(a2​∣s2​;θ)}+γEs3​∣s2​,π​∇θ​Vθ(s3​) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2​∣s0​,π​∇θ​Vθ(s2​) …

不断重复上述过程得到

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}G_{i}\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θ​Vθ(s0​)=Eτ∣s0​,π​[∑i=0∞​γiGi​∇θ​lnπ(ai​∣si​;θ)]

∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θ​J(θ)=Es0​​Eτ∣s0​,π​[∑i=0∞​γiGi​∇θ​lnπ(ai​∣si​;θ)]=Eτ∣π​[∑i=0∞​γiGi​∇θ​lnπ(ai​∣si​;θ)]

定理3

∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] ∇θ​J(θ)=Eτ∣π​[∑i=0∞​γiQi​(si​,ai​)∇θ​lnπ(ai​∣si​)]

以上同上一个证明, 不赘述.

∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln ⁡ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E a 0 ∣ s 0 , π { ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) } = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta)Q^{\theta}(s_0,a)\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) Q^{\theta}(s_0,a_0)\}\notag\\ &=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0​∑​∇θ​π(a∣s0​;θ)Qθ(s0​,a)​=a∑​π(a∣s0​;θ)∇θ​lnπ(a∣s0​;θ)Qθ(s0​,a)=Ea0​∣s0​,π​{∇θ​lnπ(a0​∣s0​;θ)Qθ(s0​,a0​)}=Eτ∣s0​,π​{Qθ(s0​,a0​)∇θ​lnπ(a0​∣s0​;θ)}​​

∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) ∑a​∇θ​Qθ(s0​,a)π(a∣s0​;θ)=γEa0​∣s0​,π​∇θ​Qθ(s0​,a0​)

其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0​,a)=∑s′​T(s0​,a,s′)[r(s0​,a,s′)+γVθ(s′)]

∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) ∇θ​Qθ(s0​,a0​)=∑s′​T(s0​,a0​,s′)∇θ​Vθ(s′)=Es1​∣s0​,a0​​∇θ​Vθ(s1​)

所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) ∑a​∇θ​Qθ(s0​,a)π(a∣s0​;θ)=γEa0​∣s0​,π​Es1​∣s0​,a0​​∇θ​Vθ(s1​)=γEs1​∣s0​,π​∇θ​Vθ(s1​)

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) ∇θ​Vθ(s0​)=Eτ∣s0​,π​{Qθ(s0​,a0​)∇θ​lnπ(a0​∣s0​;θ)}+γEs1​∣s0​,π​∇θ​Vθ(s1​)

同理可得:

V θ ( s i ) = E τ i ∣ s i , π { Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) ,   i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si​)=Eτi​∣si​,π​{Qθ(si​,ai​)∇θ​lnπ(ai​∣si​;θ)}+γEsi+1​∣si​,π​∇θ​Vθ(si+1​), i=1,2,...$

将 V θ ( s 1 ) = E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1​)=Eτ1​∣s1​,π​{Qθ(s1​,a1​)∇θ​lnπ(a1​∣s1​;θ)}+γEs2​∣s1​,π​∇θ​Vθ(s2​) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1​∣s0​,π​∇θ​Vθ(s1​), 得

γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1​∣s0​,π​∇θ​Vθ(s1​)​=γEs1​∣s0​,π​{Eτ1​∣s1​,π​{Qθ(s1​,a1​)∇θ​lnπ(a1​∣s1​;θ)}+γEs2​∣s1​,π​∇θ​Vθ(s2​)}=γEτ1​∣s0​,π​[Qθ(s1​,a1​)∇θ​lnπ(a1​∣s1​;θ)]+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)=Eτ∣s0​,π​[γQθ(s1​,a1​)∇θ​lnπ(a1​∣s1​;θ)]+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)​​

进而

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) + γ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) ∇θ​Vθ(s0​)=Eτ∣s0​,π​{Qθ(s0​,a0​)∇θ​lnπ(a0​∣s0​;θ)+γQθ(s1​,a1​)∇θ​lnπ(a1​∣s1​;θ)}+γ2Es2​∣s0​,π​∇θ​Vθ(s2​)

再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { Q θ ( s 2 , a 2 ) ∇ θ ln ⁡ π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{Q^{\theta}(s_2,a_2)\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2​)=Eτ2​∣s2​,π​{Qθ(s2​,a2​)∇θ​lnπ(a2​∣s2​;θ)}+γEs3​∣s2​,π​∇θ​Vθ(s3​) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2​∣s0​,π​∇θ​Vθ(s2​) …

不断重复上述过程得到

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θ​Vθ(s0​)=Eτ∣s0​,π​[∑i=0∞​γiQθ(si​,ai​)∇θ​lnπ(ai​∣si​;θ)]

∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θ​J(θ)=Es0​​Eτ∣s0​,π​[∑i=0∞​γiQθ(si​,ai​)∇θ​lnπ(ai​∣si​;θ)]=Eτ∣π​[∑i=0∞​γiQθ(si​,ai​)∇θ​lnπ(ai​∣si​;θ)]

推论.

∇ θ J ( θ ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr ⁡ [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln ⁡ π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) ∇θ​J(θ)=∑s​∑a​∑i=0∞​γiPr[st​=s,at​=a∣π]Qθ(s,a)∇θ​lnπ(a∣s)

证明: ∇ θ J ( θ ) = ∑ i = 0 ∞ γ i E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\gamma^{i}\mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] ∇θ​J(θ)=∑i=0∞​γiEτ∣π​[Qi​(si​,ai​)∇θ​lnπ(ai​∣si​)]

E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] = ∑ s ∑ a Pr ⁡ [ s i = s , a i = a ∣ π ] [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] ∣ s i = s , a i = a \mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]=\sum_s\sum_{a}\Pr[s_i=s,a_i=a|\pi][Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]|_{s_{i}=s,a_{i}=a} Eτ∣π​[Qi​(si​,ai​)∇θ​lnπ(ai​∣si​)]=∑s​∑a​Pr[si​=s,ai​=a∣π][Qi​(si​,ai​)∇θ​lnπ(ai​∣si​)]∣si​=s,ai​=a​

∇ θ J ( θ ) = ∑ i = 0 ∞ ∑ s ∑ a γ i Pr ⁡ [ s i = s , a i = a ∣ π ] Q ( s , a ) ln ⁡ π ( a ∣ s ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr ⁡ [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln ⁡ π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\sum_s\sum_{a}\gamma^{i}\Pr[s_i=s,a_i=a|\pi]Q(s,a)\ln \pi(a|s)=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) ∇θ​J(θ)=∑i=0∞​∑s​∑a​γiPr[si​=s,ai​=a∣π]Q(s,a)lnπ(a∣s)=∑s​∑a​∑i=0∞​γiPr[st​=s,at​=a∣π]Qθ(s,a)∇θ​lnπ(a∣s)

定理4

E ( s i , a i ) ∣ π ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) = ∑ s ∑ a Pr ⁡ [ s i = s , a i = a ∣ π ] ∇ θ ln ⁡ π ( a ∣ s ; θ ) Q θ ( s , a ) \mathbb{E}_{(s_i,a_i)|\pi} \nabla_{\theta} \ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i)=\sum_{s}\sum_{a}\Pr[s_i=s,a_i=a|\pi] \nabla_{\theta} \ln \pi(a|s;\theta)Q^{\theta}(s,a) E(si​,ai​)∣π​∇θ​lnπ(ai​∣si​;θ)Qθ(si​,ai​)=∑s​∑a​Pr[si​=s,ai​=a∣π]∇θ​lnπ(a∣s;θ)Qθ(s,a)

证明:

引理. Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\theta}(s')] Qθ(s,a)=∑s′​T(s,a,s′)[r(s,a,s′)+γVθ(s′)]

∇ θ Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) ∇ θ V θ ( s ′ ) \nabla_{\theta}Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')\nabla_{\theta}V^{\theta}(s') ∇θ​Qθ(s,a)=∑s′​T(s,a,s′)∇θ​Vθ(s′)

∇ θ V ( s ′ ) = ∇ θ ∑ a ′ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) = ∑ a ′ [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( a ′ ∣ s ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] \nabla_{\theta}V(s')=\nabla_{\theta}\sum_{a'}\pi(a'|s';\theta)Q^{\theta}(s',a')=\sum_{a'}[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(a'|s';\theta)\nabla_{\theta}Q^{\theta}(s',a')] ∇θ​V(s′)=∇θ​∑a′​π(a′∣s′;θ)Qθ(s′,a′)=∑a′​[∇θ​π(a′∣s′;θ)Qθ(s′,a′)+π(a′∣s′;θ)∇θ​Qθ(s′,a′)]

所以

∇ θ Q θ ( s , a ) = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( s ′ ∣ a ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) π ( s ′ ∣ a ′ ; θ ) [ ∇ θ ln ⁡ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + ∇ θ Q θ ( s ′ , a ′ ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s,a)&=\gamma\sum_{s'}\sum_{a'}T(s,a,s')[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(s'|a';\theta)\nabla_{\theta}Q^{\theta}(s',a')]\\ &=\gamma\sum_{s'}\sum_{a'}T(s,a,s')\pi(s'|a';\theta)[\nabla_{\theta}\ln \pi(a'|s';\theta)Q^{\theta}(s',a')+\nabla_{\theta}Q^{\theta}(s',a')] \end{align} ∇θ​Qθ(s,a)​=γs′∑​a′∑​T(s,a,s′)[∇θ​π(a′∣s′;θ)Qθ(s′,a′)+π(s′∣a′;θ)∇θ​Qθ(s′,a′)]=γs′∑​a′∑​T(s,a,s′)π(s′∣a′;θ)[∇θ​lnπ(a′∣s′;θ)Qθ(s′,a′)+∇θ​Qθ(s′,a′)]​​

∇ θ Q θ ( s i , a i ) = γ ∑ s i + 1 ∑ a i + 1 T ( s i , a i , s i + 1 ) π ( s i + 1 ∣ a i + 1 ; θ ) [ ∇ θ ln ⁡ π ( a i + 1 ∣ s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] = γ E ( s i + 1 , a i + 1 ) ∣ ( s i , a i ) , π [ ∇ θ ln ⁡ π ( a i + 1 , s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s_i,a_i)&=\gamma\sum_{s_{i+1}}\sum_{a_{i+1}}T(s_i,a_i,s_{i+1})\pi(s_{i+1}|a_{i+1};\theta)[\nabla_{\theta}\ln \pi(a_{i+1}|s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag\\ &=\gamma\mathbb{E}_{(s_{i+1},a_{i+1})|(s_i,a_i), \pi}[\nabla_{\theta}\ln\pi(a_{i+1},s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag \end{align} ∇θ​Qθ(si​,ai​)​=γsi+1​∑​ai+1​∑​T(si​,ai​,si+1​)π(si+1​∣ai+1​;θ)[∇θ​lnπ(ai+1​∣si+1​;θ)Qθ(si+1​,ai+1​)+∇θ​Qθ(si+1​,ai+1​)]=γE(si+1​,ai+1​)∣(si​,ai​),π​[∇θ​lnπ(ai+1​,si+1​;θ)Qθ(si+1​,ai+1​)+∇θ​Qθ(si+1​,ai+1​)]​

∇ θ J ( θ ) = E s 0 V θ ( s 0 ) = E s 0 ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) \nabla_{\theta} J(\theta)=\mathbb{E}_{s_0}V^{\theta}(s_0)=\mathbb{E}_{s_0}\sum_a \pi(a|s_0;\theta)Q^{\theta}(s_0,a) ∇θ​J(θ)=Es0​​Vθ(s0​)=Es0​​∑a​π(a∣s0​;θ)Qθ(s0​,a)

∇ θ J ( θ ) = E s 0 ∑ a ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + E s 0 ∑ a π (

标签:

【强化学习】随机策略的策略梯度由讯客互联IT业界栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“【强化学习】随机策略的策略梯度