Post

Text generation using t5-small transformer

Check out the original notebook here.

The goal of this blog is to demonstrate how data augmentation can enhance the development of a model pipeline capable of predicting the sentiment behind customer reviews using both original and engineered features.

The resulting model and dataset will be made available on Hugging Face. If you’re looking for the assets discussed in this blog, you can find them on Hugging Face: e-commerce model and augmented dataset.

The original dataset, Womens Clothing E-Commerce Reviews, is publicly available on Kaggle. You can check out the original dataset here.

1
dataset_name = "./Womens Clothing E-Commerce Reviews.csv"

This dataset contains 23,486 rows and 10 feature variables. Each row represents a customer review and includes the following variables:

  • Clothing ID: Integer categorical variable representing the specific item being reviewed.
  • Age: Positive integer variable indicating the reviewer’s age.
  • Title: String variable containing the title of the review.
  • Review Text: String variable for the main content of the review.
  • Rating: Positive ordinal integer variable indicating the product rating given by the customer, ranging from 1 (Worst) to 5 (Best).
  • Recommended IND: Binary variable indicating whether the customer recommends the product (1 for recommended, 0 for not recommended).
  • Positive Feedback Count: Positive integer capturing the number of other customers who found the review helpful.
  • Division Name: Categorical variable representing the high-level division of the product.
  • Department Name: Categorical variable for the department the product belongs to.
  • Class Name: Categorical variable representing the product’s class name.
Hidden code

Import common libraries that usually are in any Python environment.

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

In case of running the notebook in Google Colab we need to install some libraries that are not there…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
try:
    import google.colab
    from google.colab import files
    IN_COLAB = True
except:
    IN_COLAB = False

def upload_env_file():
    files.upload()
    if not os.path.exists('.env'):
        raise('Exception: .env file not uploaded on working area')

def load_env_variables():
    from dotenv import load_dotenv

    if not os.path.exists('.env'):
        raise('Exception: .env file not found on working area')
    load_dotenv()

def login_huggingface_hub():
    from huggingface_hub import login

    hf_token = os.environ['HF_TOKEN']
    login(hf_token)

def install_packages():
    print('Installing packages...')
    !curl -O https://raw.githubusercontent.com/wilberquito/women-e-commerce-opinion/main/requirements.txt
    !pip install -r requirements.txt

def download_data():
    print('Downloading dataset...')
    !curl -O https://raw.githubusercontent.com/wilberquito/women-e-commerce-opinion/main/scripts/download-data.sh
    !chmod +x download-data.sh
    !./download-data.sh

if IN_COLAB:
    install = input('Install Python packages [Y/n]: ')
    if install == 'Y':
        install_packages()
    if not os.path.exists('.env'):
        upload_env_file()

load_env_variables()
login_huggingface_hub()

if not os.path.exists(dataset_name):
    download_data()
    Install Python packages [Y/n]: Y
    Installing packages...
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   160  100   160    0     0    896      0 --:--:-- --:--:-- --:--:--   898
    The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
    Token is valid (permission: write).
    Your token has been saved to /root/.cache/huggingface/token
    Login successful

Other libraries that are no available in the environment after installing them from requirements.txt.

1
2
3
4
5
6
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import pipeline
import torch

Data analysis

1
2
3
4
5
cols = pd.read_csv(dataset_name, nrows=1).columns
reviews_df = pd.read_csv(dataset_name, usecols=cols[1:])
cols = [col.lower() for col in reviews_df.columns]
reviews_df.columns = cols
reviews_df.head()
 clothing idagetitlereview textratingrecommended indpositive feedback countdivision namedepartment nameclass name
076733NaNAbsolutely wonderful - silky and sexy and comf…410InitmatesIntimateIntimates
1108034NaNLove this dress! it’s sooo pretty. i happene…514GeneralDressesDresses
2107760Some major design flawsI had such high hopes for this dress and reall…300GeneralDressesDresses
3104950My favorite buy!I love, love, love this jumpsuit. it’s fun, fl…510General PetiteBottomsPants
484747Flattering shirtThis shirt is very flattering to all due to th…516GeneralTopsBlouses
1
reviews_df.shape
    (23486, 10)
1
reviews_df.describe()
 clothing idageratingrecommended indpositive feedback count
count23486.00000023486.00000023486.00000023486.00000023486.000000
mean918.11870943.1985444.1960320.8223622.535936
std203.29898012.2795441.1100310.3822165.702202
min0.00000018.0000001.0000000.0000000.000000
25%861.00000034.0000004.0000001.0000000.000000
50%936.00000041.0000005.0000001.0000001.000000
75%1078.00000052.0000005.0000001.0000003.000000
max1205.00000099.0000005.0000001.000000122.000000
1
reviews_df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 23486 entries, 0 to 23485
    Data columns (total 10 columns):
     #   Column                   Non-Null Count  Dtype
    ---  ------                   --------------  -----
     0   clothing id              23486 non-null  int64
     1   age                      23486 non-null  int64
     2   title                    19676 non-null  object
     3   review text              22641 non-null  object
     4   rating                   23486 non-null  int64
     5   recommended ind          23486 non-null  int64
     6   positive feedback count  23486 non-null  int64
     7   division name            23472 non-null  object
     8   department name          23472 non-null  object
     9   class name               23472 non-null  object
    dtypes: int64(5), object(5)
    memory usage: 1.8+ MB

Binning rating

1
reviews_df["rating"].value_counts()
 count
513131
45077
32871
21565
1842
1
2
ax = reviews_df["rating"].plot(kind="hist", bins=np.arange(0, 6) + 0.5, ec='k')
plt.show()

png Distribution of ratings

Creating a model that predicts ratings based on reviews is challenging with the previous distribution. To address this, rather than predicting a rating between 1 and 5, let’s simplify it by predicting a value between 1 and 3: where 1 indicates the user is dissatisfied, 2 indicates satisfaction, and 3 indicates the user loves the product.

1
2
3
4
5
6
7
8
9
10
def soft_rating(rating):
    new_rating = 3
    if rating <= 3:
        new_rating = 1
    elif rating <= 4:
        new_rating = 2
    return new_rating

reviews_df["soft rating"] = reviews_df["rating"].map(soft_rating)
reviews_df.head()
 clothing idagetitlereview textratingrecommended indpositive feedback countdivision namedepartment nameclass namesoft rating
076733NaNAbsolutely wonderful - silky and sexy and comf…410InitmatesIntimateIntimates2
1108034NaNLove this dress! it’s sooo pretty. i happene…514GeneralDressesDresses3
2107760Some major design flawsI had such high hopes for this dress and reall…300GeneralDressesDresses1
3104950My favorite buy!I love, love, love this jumpsuit. it’s fun, fl…510General PetiteBottomsPants3
484747Flattering shirtThis shirt is very flattering to all due to th…516GeneralTopsBlouses3
1
reviews_df["soft rating"].value_counts()
 count
313131
15278
25077
1
2
ax = reviews_df["soft rating"].plot(kind="hist", bins=np.arange(0, 4) + 0.5, ec="k")
plt.show()

png Distribution of softed rating

Removing useless data

It would be helpful to determine if any features in the dataset are relevant. We can do this by first examining the linear correlation between soft rating and the other features.

1
2
3
4
numeric_reviews_df = reviews_df.select_dtypes(include="number").drop(
    labels=["clothing id"], axis=1
)
numeric_reviews_df.corr()
 ageratingrecommended indpositive feedback countsoft rating
age1.0000000.0268310.0306220.0430790.036042
rating0.0268311.0000000.792336-0.0649610.941389
recommended ind0.0306220.7923361.000000-0.0690450.726893
positive feedback count0.043079-0.064961-0.0690451.000000-0.062530
soft rating0.0360420.9413890.726893-0.0625301.000000

From this information, it’s evident that the positive feedback count is more of a hindrance than a help. Additionally, it doesn’t directly reflect the user’s opinion but rather how others have interpreted it.

1
2
reviews_df = reviews_df.drop(labels=["positive feedback count"], axis=1, errors='ignore')
reviews_df.head(3)
 clothing idagetitlereview textratingrecommended inddivision namedepartment nameclass namesoft rating
076733NaNAbsolutely wonderful - silky and sexy and comf…41InitmatesIntimateIntimates2
1108034NaNLove this dress! it’s sooo pretty. i happene…51GeneralDressesDresses3
2107760Some major design flawsI had such high hopes for this dress and reall…30GeneralDressesDresses1
1
reviews_df['division name'].value_counts()
 count
General13850
General Petite8120
Initmates1502
1
reviews_df['department name'].value_counts()
 count
Tops10468
Dresses6319
Bottoms3799
Intimate1735
Jackets1032
Trend119
1
reviews_df['class name'].value_counts()
 count
Dresses6319
Knits4843
Blouses3097
Sweaters1428
Pants1388
Jeans1147
Fine gauge1100
Skirts945
Jackets704
Lounge691
Swim350
Outerwear328
Shorts317
Sleep228
Legwear165
Intimates154
Layering146
Trend119
Casual bottoms2
Chemises1

The features division name, department name, and class name could be useful, but their distribution is highly skewed. To simplify the analysis, I’ve decided to take a different approach and drop them.

1
2
reviews_df = reviews_df.drop(labels=['division name', 'department name', 'class name'], axis=1, errors='ignore')
reviews_df.head(3)
 clothing idagetitlereview textratingrecommended indsoft rating
076733NaNAbsolutely wonderful - silky and sexy and comf…412
1108034NaNLove this dress! it’s sooo pretty. i happene…513
2107760Some major design flawsI had such high hopes for this dress and reall…301

Binning age

1
2
3
4
5
spam = 10
spam_window = spam / 2
limit = int(100 / spam)
bins = np.array([x*spam for x in range(limit)]) + spam_window
bins
    array([ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.])
1
np.histogram(reviews_df['age'], bins = bins)
    (array([   0,  892, 5175, 7771, 5110, 3152, 1193,  161,   30]),
     array([ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.]))
1
2
reviews_df['age'].plot(kind='hist', bins=bins, ec='k')
plt.show()

png

1
2
3
4
5
6
7
8
9
10
def soft_age(age):
    generation = 3
    if age < 35:
        generation = 1
    elif age < 55:
        generation = 2
    return generation

reviews_df['generation'] = reviews_df['age'].map(soft_age)
reviews_df.head(3)
 clothing idagetitlereview textratingrecommended indsoft ratinggeneration
076733NaNAbsolutely wonderful - silky and sexy and comf…4121
1108034NaNLove this dress! it’s sooo pretty. i happene…5131
2107760Some major design flawsI had such high hopes for this dress and reall…3013
1
reviews_df['generation'].value_counts()
 count
212881
16067
34538

Check out missing values

1
reviews_df.isna().any()
 0
clothing idFalse
ageFalse
titleTrue
review textTrue
ratingFalse
recommended indFalse
soft ratingFalse
generationFalse

There are some missing values! Let’s figure it out how many for each feature.

1
reviews_df.isna().sum()
 0
clothing id0
age0
title3810
review text845
rating0
recommended ind0
soft rating0
generation0

Well, what about the $\%$ of missing values to respect the hole dataset?

1
(reviews_df.isna().sum() * 100) / reviews_df.shape[0]
 0
clothing id0.000000
age0.000000
title16.222430
review text3.597888
rating0.000000
recommended ind0.000000
soft rating0.000000
generation0.000000

What concerns me the most is the $16\%$ of missing values for titles. However, the more significant issue is the approximately $3.6\%$ of missing descriptions. Titles can be generated as summaries of the descriptions, but not the other way around.

1
2
3
reviews_df = reviews_df[~reviews_df['review text'].isna()]
reviews_df = reviews_df.reset_index()
reviews_df.head(3)
 indexclothing idagetitlereview textratingrecommended indsoft ratinggeneration
0076733NaNAbsolutely wonderful - silky and sexy and comf…4121
11108034NaNLove this dress! it’s sooo pretty. i happene…5131
22107760Some major design flawsI had such high hopes for this dress and reall…3013

After removing the samples without descriptions, the number of missing titles decreased, but there are still many reviews without a title.

1
(reviews_df.isna().sum() * 100) / reviews_df.shape[0]
 0
index0.000000
clothing id0.000000
age0.000000
title13.100128
review text0.000000
rating0.000000
recommended ind0.000000
soft rating0.000000
generation0.000000
1
reviews_df['title'].head(3)
 title
0NaN
1NaN
2Some major design flaws

Select features

1
2
3
4
5
6
7
8
9
10
11
12
13
14
features = [
    'clothing id',
    'title',
    'review text',
    'soft rating',
    'recommended ind',
    'generation'
]

reviews_df['soft rating'] = reviews_df['soft rating'].astype('category')
reviews_df['generation'] = reviews_df['generation'].astype('category')

reviews_df = reviews_df[features]
reviews_df.head(3)
 clothing idtitlereview textsoft ratingrecommended indgeneration
0767NaNAbsolutely wonderful - silky and sexy and comf…211
11080NaNLove this dress! it’s sooo pretty. i happene…311
21077Some major design flawsI had such high hopes for this dress and reall…103
1
reviews_df.dtypes
 0
clothing idint64
titleobject
review textobject
soft ratingcategory
recommended indint64
generationcategory

Fine-tuning t5-small transformer

T5 is a text-to-text transform framework capable to do machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). If you want to know more about this framework consult it here as A Shared Text-To-Text Framework.

In this case we will fine-tune the transformer in order to provide good summarizations.

Generating datasets

Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def generate_datasets(reviews, test_size=0.2, stratify='soft rating'):
    titled_reviews = reviews[~reviews['title'].isna()]
    infer_reviews = reviews[reviews['title'].isna()]

    train, test = train_test_split(
        titled_reviews,
        test_size=test_size,
        stratify=titled_reviews[stratify],
        random_state=42
    )

    datasets = [Dataset.from_pandas(df) for df in [train, test, infer_reviews]]
    dataset_names = ['train', 'test', 'infer']
    dict_datasets = dict(zip(dataset_names, datasets))

    train_datasets = DatasetDict(
        {
            'train': dict_datasets['train'],
            'test': dict_datasets['test']
        }
    )
    infer_dataset = dict_datasets['infer']

    return train_datasets, infer_dataset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def preprocess_function(examples):
    prefix = "summarize: "
    inputs = [prefix + doc for doc in examples["review text"]]
    model_inputs = tokenizer(
        inputs, max_length=1024,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    )
    labels = tokenizer(
        text_target=examples["title"],
        max_length=128,
        truncation=True,
        padding='max_length',
        return_tensors="pt"
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def compute_metrics(eval_pred):
    import evaluate

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    rouge = evaluate.load("rouge")
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}
1
2
train_datasets, infer_dataset = generate_datasets(reviews_df)
train_datasets, infer_dataset
    (DatasetDict({
         train: Dataset({
             features: ['clothing id', 'title', 'review text', 'soft rating', 'recommended ind', 'generation', '__index_level_0__'],
             num_rows: 15740
         })
         test: Dataset({
             features: ['clothing id', 'title', 'review text', 'soft rating', 'recommended ind', 'generation', '__index_level_0__'],
             num_rows: 3935
         })
     }),
     Dataset({
         features: ['clothing id', 'title', 'review text', 'soft rating', 'recommended ind', 'generation', '__index_level_0__'],
         num_rows: 2966
     }))
1
2
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]
    spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]
    tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]
1
2
3
batch_size = 256
batched = True
train_datasets = train_datasets.map(preprocess_function, batch_size=batch_size, batched=batched)
    Map:   0%|          | 0/15740 [00:00<?, ? examples/s]
    Map:   0%|          | 0/3935 [00:00<?, ? examples/s]

Fine-tuning t5-small

1
2
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
model
    config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]
    model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]
    generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

    T5ForConditionalGeneration(
      (shared): Embedding(32128, 512)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 512)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                  (relative_attention_bias): Embedding(32, 8)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): T5LayerFF(
                (DenseReluDense): T5DenseActDense(
                  (wi): Linear(in_features=512, out_features=2048, bias=False)
                  (wo): Linear(in_features=2048, out_features=512, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): ReLU()
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (1-5): 5 x T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): T5LayerFF(
                (DenseReluDense): T5DenseActDense(
                  (wi): Linear(in_features=512, out_features=2048, bias=False)
                  (wo): Linear(in_features=2048, out_features=512, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): ReLU()
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
        )
        (final_layer_norm): T5LayerNorm()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (decoder): T5Stack(
        (embed_tokens): Embedding(32128, 512)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                  (relative_attention_bias): Embedding(32, 8)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): T5LayerCrossAttention(
                (EncDecAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (2): T5LayerFF(
                (DenseReluDense): T5DenseActDense(
                  (wi): Linear(in_features=512, out_features=2048, bias=False)
                  (wo): Linear(in_features=2048, out_features=512, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): ReLU()
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
          (1-5): 5 x T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): T5LayerCrossAttention(
                (EncDecAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (2): T5LayerFF(
                (DenseReluDense): T5DenseActDense(
                  (wi): Linear(in_features=512, out_features=2048, bias=False)
                  (wo): Linear(in_features=2048, out_features=512, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                  (act): ReLU()
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
        )
        (final_layer_norm): T5LayerNorm()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (lm_head): Linear(in_features=512, out_features=32128, bias=False)
    )
1
torch.cuda.is_available()
    True
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
if torch.cuda.is_available():
    torch.cuda.empty_cache()

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
epochs = 1

training_args = Seq2SeqTrainingArguments(
    output_dir="e-comerce",
    eval_strategy="steps",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=weight_decay,
    save_total_limit=3,
    num_train_epochs=epochs,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets["train"],
    eval_dataset=train_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
StepTraining LossValidation LossRouge1Rouge2RougelRougelsumGen Len
5000.4332000.1958950.0000000.0000000.0000000.0000000.000000
    TrainOutput(global_step=984, training_loss=0.32267741846844433, metrics={'train_runtime': 1143.418, 'train_samples_per_second': 13.766, 'train_steps_per_second': 0.861, 'total_flos': 4260559910338560.0, 'train_loss': 0.32267741846844433, 'epoch': 1.0})

Data agumentation using the t5-small fine-tuned

In this section we will se how to re-use the fine tuned model in the previous section to generete summaries as titles given the descriptions of the dataset.

1
2
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
text
    "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In the following line you can check the embeddings generated by the text

1
2
3
tokenizer = AutoTokenizer.from_pretrained("wilberquito/e-comerce")
inputs = tokenizer(text, return_tensors="pt").input_ids
inputs
    tensor([[21603,    10,    37,    86,    89,  6105,   419,  8291,  1983,  1364,
                 7,  7744,  2672,  1358,     6,   533,   124,  1358,     6,    11,
               827,  1358,     5,    94,    31,     7,     8,   167,  8299,  1041,
                30,     3, 26074,     8,  3298,  5362,    16,   797,   892,     6,
                84,    56,  5656,    95,   797,  2765,    11,   482,   207,    18,
              8832,    53,     6,  7021,  2476,   640,     8,   684,     5,    94,
                31,   195,  1364,     8, 11724,    11,   987,     8,  6173,    18,
              1123,   138,   189,    63,    11, 11711,    12,   726,    70,  2725,
               698,     5,   275,   150,    80,   492,   365,  1514, 31471,   399,
               215,    56,   726,     3,     9, 23925,    72,    16,  5161,     5,
                 1]])
1
2
summarizer = pipeline("summarization", model="wilberquito/e-comerce", min_length=5, max_length=30)
summarizer(text)
    Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

    [{'summary_text': 'Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs .'}]
Hidden code
1
2
3
4
5
6
7
def title_data_augmentation(summarizer, batch):
    reviews = batch['review text']
    summaries = summarizer(reviews)
    summaries = [summary['summary_text'] for summary in summaries]
    return {
        'title': summaries
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
summarization_kwargs = {
    'min_length': 4,
    'max_length': 16,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

summarizer = pipeline(
    "summarization",
    model="wilberquito/e-comerce",
    **summarization_kwargs
)

infer_dataset = infer_dataset.map(
    lambda batch: title_data_augmentation(summarizer, batch),
    batched=True, batch_size=256
)

for data in infer_dataset:
    print(data)
    break
    {'clothing id': 767, 'title': 'Beautiful and sexy', 'review text': 'Absolutely wonderful - silky and sexy and comfortable', 'soft rating': 2, 'recommended ind': 1, 'generation': 1, '__index_level_0__': 0}

Publish the agumented dataset in Hugging Face

1
2
infer_df = pd.DataFrame(infer_dataset)
infer_df.head(3)
 clothing idtitlereview textsoft ratingrecommended indgenerationindex_level_0
0767Beautiful and sexyAbsolutely wonderful - silky and sexy and comf…2110
11080Love this dress!Love this dress! it’s sooo pretty. i happene…3111
21095Beautiful dress!This dress is perfection! so pretty and flatte…31211
1
2
train_df = pd.DataFrame(train_datasets['train'])
train_df.head(3)
 clothing idtitlereview textsoft ratingrecommended indgenerationindex_level_0
0835Gorgeous top!I’m obsessed with this top! i got it for chris…31318003
142LovelyMakes me feel like a sexy ice skater. true to …31118273
2860Classic stripesI do not know if this item is worth the full p…2137611
1
2
test_df = pd.DataFrame(train_datasets['test'])
test_df.head(3)
 clothing idtitlereview textsoft ratingrecommended indgenerationindex_level_0
01095Elegant!I purchased this in london, early april, and i…31317402
1830LampshadeIf you like to wear lampshades this is the shi…10219915
2829Maeve tunicI bought this tunic today because it paired we…3134001
1
2
3
dataset_df = pd.concat([infer_df, train_df, test_df])
dataset_df = dataset_df.drop(['__index_level_0__'], axis=1)
dataset_df.head(3)
 clothing idtitlereview textsoft ratingrecommended indgeneration
0767Beautiful and sexyAbsolutely wonderful - silky and sexy and comf…211
11080Love this dress!Love this dress! it’s sooo pretty. i happene…311
21095Beautiful dress!This dress is perfection! so pretty and flatte…312
Hidden code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def generate_e_commerce_datasets(reviews, test_size=0.2, stratify='soft rating'):
    train_df, test_df = train_test_split(
        reviews,
        test_size=test_size,
        stratify=reviews[stratify],
        random_state=42
    )
    datasets = DatasetDict(
        {
            'train': Dataset.from_pandas(train_df),
            'test': Dataset.from_pandas(test_df)
        }
    )

    return datasets
1
2
e_commerce_datasets = generate_e_commerce_datasets(dataset_df)
e_commerce_datasets
    DatasetDict({
        train: Dataset({
            features: ['clothing id', 'title', 'review text', 'soft rating', 'recommended ind', 'generation', '__index_level_0__'],
            num_rows: 18112
        })
        test: Dataset({
            features: ['clothing id', 'title', 'review text', 'soft rating', 'recommended ind', 'generation', '__index_level_0__'],
            num_rows: 4529
        })
    })

1
e_commerce_datasets.push_to_hub('wilberquito/processed_women_clothing_e_commerce_opinions')
    Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
    Creating parquet from Arrow format:   0%|          | 0/19 [00:00<?, ?ba/s]
    Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
    Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]
    README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]
    CommitInfo(commit_url='https://huggingface.co/datasets/wilberquito/processed_women_clothing_e_commerce_opinions/commit/44fc5121c2a7926f18e558f8f13cc11de7d1458b', commit_message='Upload dataset', commit_description='', oid='44fc5121c2a7926f18e558f8f13cc11de7d1458b', pr_url=None, pr_revision=None, pr_num=None)
This post is licensed under CC BY 4.0 by the author.