TUHH Open Research
Help
  • Log In
    New user? Click here to register.Have you forgotten your password?
  • English
  • Deutsch
  • Communities & Collections
  • Publications
  • Research Data
  • People
  • Institutions
  • Projects
  • Statistics
  1. Home
  2. TUHH
  3. Publications
  4. Convergence properties of natural gradient descent for minimizing KL divergence
 
Options

Convergence properties of natural gradient descent for minimizing KL divergence

Citation Link: https://doi.org/10.15480/882.15749
Publikationstyp
Journal Article
Date Issued
2025-07-30
Sprache
English
Author(s)
Datar, Adwait  
Data Science Foundations E-21  
Ay, Nihat  
Data Science Foundations E-21  
TORE-DOI
10.15480/882.15749
TORE-URI
https://hdl.handle.net/11420/56918
Journal
Transactions on machine learning research  
Issue
7
Start Page
1
End Page
28
Citation
Transactions on machine learning research (7): 1-28 (2025)
Publisher Link
https://openreview.net/forum?id=h6hjjAF5Bj
Scopus ID
2-s2.0-105012908170
Publisher
OpenReview.net
The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is
often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry− the exponential
family (θ coordinates) and the mixture family (η coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient
descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that
the convergence rates of GD in the θ and η coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations
of the dual coordinates, the convergence rates of GD in η and θ coordinates can be scaled to 2c and 2 c , respectively, for any c > 0, while NGD maintains a fixed convergence rate of
2, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time,
we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.
DDC Class
600: Technology
Publication version
publishedVersion
Lizenz
https://creativecommons.org/licenses/by/4.0/
Loading...
Thumbnail Image
Name

4787_Convergence_Properties_of.pdf

Type

Main Article

Size

1016.62 KB

Format

Adobe PDF

TUHH
Weiterführende Links
  • Contact
  • Send Feedback
  • Cookie settings
  • Privacy policy
  • Impress
DSpace Software

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science
Design by effective webwork GmbH

  • Deutsche NationalbibliothekDeutsche Nationalbibliothek
  • ORCiD Member OrganizationORCiD Member Organization
  • DataCiteDataCite
  • Re3DataRe3Data
  • OpenDOAROpenDOAR
  • OpenAireOpenAire
  • BASE Bielefeld Academic Search EngineBASE Bielefeld Academic Search Engine
Feedback