Saltar al contenido principal

Escribe una PREreview

A Comparative Survey of CNN-LSTM Architectures for Image Captioning

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202512.1301.v1

Image captioning, the task of automatically generating textual descriptions for images, lies at the intersection of computer vision and natural language processing. Architectures combining Convolutional Neural Networks (CNNs) for visual feature extraction and Long Short-Term Memory (LSTM) networks for language generation have become a dominant paradigm. This survey provides a comprehensive overview of fifteen influential papers employing these CNN-LSTM frameworks, summarizing their core contributions, architectural variations (including attention mechanisms and encoder-decoder designs), training strategies, and performance on benchmark datasets. A detailed comparative analysis, presented in tabular format, evaluates these works by detailing their technical approaches, key contributions or advantages, and identified limitations. Based on this analysis, we identify key evolutionary trends in CNN-LSTM models, discuss prevailing challenges such as generating human-like and contextually rich captions, and highlight promising future research directions, including deeper reasoning, improved evaluation, and the integration of newer architectures.

Puedes escribir una PREreview de A Comparative Survey of CNN-LSTM Architectures for Image Captioning. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora