Skip to content

Latest commit

 

History

History
47 lines (32 loc) · 1.89 KB

README.md

File metadata and controls

47 lines (32 loc) · 1.89 KB

HTML Bad coding practices

Overview

This document provides a brief explanation of the parameters, training process, and testing phases of a code framework utilizing a T5 model for text correction and generation.

Parameters

  • model: T5 model instance for training.
  • tokenizer: Tokenizer associated with the T5 model.
  • train_dataset: Training dataset (instance of HtmlDataset).
  • epochs: Number of training epochs (default is 3).
  • learning_rate: 0.001.

Training

Training is performed using a custom train function that:

  • Loads batches of data.
  • Tokenizes inputs and labels.
  • Computes gradients and updates model parameters.

The function prints the average loss for each epoch and saves the fine-tuned model at the end of training.

Components Used

  1. Custom Components:

    • HTMLAwareTokenizer: Custom HTML-aware tokenizer.
    • HtmlDataset: Custom dataset class for handling input and label data.
    • preprocess_html: Custom function for preprocessing HTML text.
    • tokenize_html_aware: Custom function for HTML-aware tokenization.
  2. Predefined Components:

    • T5ForConditionalGeneration and T5Tokenizer: Classes from the Hugging Face Transformers library for T5 model and tokenizer.
    • torch.device: PyTorch class for specifying the device (CPU or GPU).
    • torch.optim.Adam: PyTorch class for the Adam optimizer.
    • torch.utils.data.DataLoader: PyTorch class for efficient data loading.
  3. Training and Optimization:

    • Utilizes PyTorch's training loop for gradient computation, backpropagation, and model parameter updates.

Testing

For testing, a custom function generate_correction is used. This function generates a corrected version of the input text using the fine-tuned T5 model and tokenizer parameters.


Note: Ensure all required dependencies and datasets are available before running the training or testing scripts.