Deep Learning Model for Hashtag Generation

Introduction

The TextToTagGenerator model is a sequence-to-sequence transformer model trained to generate tags for a given input text. The model is based on the T5 architecture, which is a transformer model that uses a sequence-to-sequence approach to solve various natural language processing tasks.
The model is available at : TextToTagGenerator

Model Architecture

The TextToTagGenerator model uses the T-5 transformer architecture, which consists of a stack of transformer encoder and decoder layers. The encoder processes the input text, and the decoder generates the output tags. The model is pre-trained on a large corpus of text data and fine-tuned on a dataset of text and tag pairs using a maximum likelihood estimation loss.

T5 (Text-To-Text Transfer Transformer)

T5 is a pre-trained transformer-based language model developed by Google’s AI research team, Google Brain. The “T” in T5 stands for “Text” and the “5” represents the five stages of the model’s training process: pre-training, fine-tuning, evaluation, analysis, and deployment.

Source

Model Structure of T5

Model Framework of TextToTagGenerator

Framework of TextToTagGenerator

Training Dataset

Dataset consists of medium articles and their corresponding hashtags, with a total of 192,368 samples and two columns: ‘text’ and ‘tag’. The ‘text’ column contains the actual content of the articles, while the ‘tag’ column represents the hashtags used for each article.
The dataset was split into training, validation, and test sets using an 80–20 split. The purpose of this split was to have a portion of the data reserved for testing and evaluating the performance of the
model.
One important aspect of the dataset is the diversity of topics covered in the medium articles, which allows the model to learn and generalize to a wide range of topics. However, there may also be some imbalance in the distribution of hashtags, which could affect the performance of the model on certain tags with fewer samples.

Tokenization

The input text and output tags are tokenized using the AutoTokenizer provided by the Hugging Face Transformers library. The tokenizer converts the text and tags into a sequence of tokens, which are then processed by the transformer model.

Model Training

The model is trained using the Seq2SeqTrainer provided by the Hugging Face Transformers library.
The arguments used in training TextToTagGenerator are given below:
optimizer : AdamW
Learning rate : 4e-5
Batch size : 16
evaluation_strategy: ‘steps’
eval_steps: 100
weight_decay: 0.01

Evaluation

The TextToTagGenerator model achieved a precision score of 0.625 and a recall score of 0.142 on the test set. The F1 score was 0.231 for rouge1, 0.176 for rouge2, 0.226 for rougeL, and 0.228 for rougeLsum. These results indicate that the model can generate relevant hashtags for Medium articles, although there is still room for improvement.
Graphical Reports of the Model Training is given here.
The full report of test on model is given below:

Rouge scores of TextToTagGenerator Model

Analysis

The TextToTag generator is a model that aims to predict the relevant hashtags for a given text input. The model was trained on a dataset of Medium articles and achieved a precision score of 0.62 for ROUGE-1, 0.49 for ROUGE-2, 0.61 for ROUGE-L, and 0.61 for ROUGE-Lsum on the test dataset.
The precision score of 0.62 for ROUGE-1 indicates that the model can predict relevant single-word hashtags with a reasonable level of accuracy. However, the score of 0.49 for ROUGE-2 suggests that the model struggles with predicting relevant multi-word hashtags. This could be because the
model was not able to capture the context of the text input effectively.
The ROUGE-L and ROUGE-L sum scores of 0.61 suggest that the model performs reasonably well at predicting hashtags that are similar to the ones in the ground truth. However, it is important to note that these scores are lower than the ROUGE-1 score, indicating that the model has more difficulty with predicting longer hashtags that are not exact matches.
The TextToTagGenerator model’s performance is satisfactory but not perfect. One of the main reasons for this is the limited amount of data used for training. Scaling up the dataset and training the model on a more extensive dataset could help improve its performance. Additionally, the model could be fine-tuned on specific domains, such as technology, sports, or entertainment, to improve its performance on specific topics. Furthermore, the model could be improved by incorporating external data sources such as WordNet or Wikipedia to enhance its understanding of the domain and improve the quality of the generated hashtags.

Conclusion

In conclusion, TextToTagGenerator is a deep learning model that can automatically generate relevant hashtags for Medium articles. The model achieved a precision score of 0.625 and a recall score of 0.142 on the test set, indicating that it can generate useful hashtags for social media posts.
However, there is still room for improvement, and future work should focus on scaling up the dataset, fine-tuning the model on specific domains, and incorporating external data sources to enhance the model’s performance.

References

T5 a detailed Explaination
T5-small model
Exloring Transfer Learning with T5
T5 paper
Illustrated Transformer

TextToTagGenerator