Image captioning typically relies on a
combination of Convolutional Neural Networks
(CNNs) for image feature extraction and
Recurrent Neural Networks (RNNs), such as
LSTMs or GRUs, to generate descriptive text
based on those features. More recently,
transformer-based models have also been used
for better contextual understanding.
Image captioning typically relies on a
combination of Convolutional Neural Networks
(CNNs) for image feature extraction and
Recurrent Neural Networks (RNNs), such as
LSTMs or GRUs, to generate descriptive text
based on those features. More recently,
transformer-based models have also been used
for better contextual understanding.
CNN Encoder – Extracts features from the
input image.
RNN/T Transformer Decoder – Translates the
CNN output into a human-readable sentence.
Flask API – Exposes endpoints to receive
images and return captions.
MQTT Broker – Could be used to publish or
subscribe to events/images in a real-time
pipeline.
InfluxDB & Grafana – Possibly storing
metadata or performance metrics related to
captioning for monitoring and analytics.