Recently, I worked on the issue of product classification based on their one-sentence descriptions. My goal was to detect collection names or series of products (e.g., for Adidas shoes, it is copa, superstar, etc.). Different solutions are available during small research. Unfortunately, as in the case of product names, they did quite well, therefore in the fact collection names, it was worst. In today’s post, I wanted to share my concept of solving this problem that does not use machine learning.


Source code is available on my GitHub.

# Libraries:
  
import pandas as pd
import regex
import string
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords

Let \(n\) be the number of the sentences \(S = \{s_1, s_2, s_3, ..., s_n\}\) (e.g., auction names from Allegro service). My goal is to find such a set of words \(X = \{x_1, x_2, x_3, ..., x_m\}\), which together appear in all sentences, but no two words appear in the same sentence. More formally \(\forall s\in S\ \exists x\in X: the\ word\ x\ is\ a\ fragment\ of\ the\ sentence\ s\) and \(\forall i,j: x_i\ is\ a\ fragment\ of\ the\ sentence\ s \wedge x_j\ is\ a\ fragment\ of\ the\ sentence\ s \implies i=j\). Of course, there is more than one collection that meets these criteria (everything depends on the product category, length of descriptions, number of sentences, and many other parameters). The development of an algorithm that finds a set of \(X\) is one problem. Finding the right set (containing only the names of producers or the name of collections) is the second.

In this post, I focus on the first part of the problem. In practice, the data is not ideal (especially that it may contain errors), so the conditions may be somewhat weakened. Words from the set X do not have to appear in all sentences (90%-95% coverage is enough). Additionally, I allow the possibility of co-occurrence of words with each other (eg, a maximum of five sentences). For a better understanding of the solution to the problem, I downloaded 6000 auction titles from Allegro service, which related to sports shoes (allegro-sports-shoes.csv), and on this data, I will work in the further part of the post.

df = pd.read_csv("allegro-sports-shoes.csv", sep=';', encoding='utf-8')
data = list(df.iloc[:, 0])
data[:3]
## ['Buty Nike air max 270 czerwone 43', 'Buty męskie Umbro Prime II UMMX218001', 'Buty męskie sportowe Adidas Galaxy 4 F36171']

A large part of the characters and words can be removed from the sentence at the very beginning. Due to the optimization of the script, I moved some code (e.g., stopwords) to the next part.

def clean_data(data, filter=None):
    data = [x.lower() for x in data]
    data = [regex.sub('|\\'.join(string.punctuation), ' ', x) for x in data]
    data = [regex.sub(' ([0-9]+[^ ]{0,} )+', ' ', x) for x in data]
    data = [regex.sub('^[0-9]+[^ ]{0,} ', ' ', x) for x in data]
    data = [regex.sub(' [0-9]+[^ ]{0,}$', ' ', x) for x in data]
    if filter:
        data = [x for x in data if x.find(filter) > -1]
        
    return data

I convert the cleaned sentences to the word frequency table. Each column of such a table represents the words appearing in the sentences. Rows represent sentences. The cells contain information on the number of words in the sentence.

def prepare_freq_table(data):
    freq_table = [word_tokenize(x) for x in data]
    freq_table = [dict(FreqDist(x)) for x in freq_table]
    freq_table = pd.DataFrame.from_dict(freq_table)
    freq_table = freq_table.fillna(0)
    
    return freq_table

The resulting table will be used to search for words that do not occur with each other. The algorithm looking for vectors (columns) whose sum of the product is zero (you can look at it as per the perpendicularity of vectors).

Below is an example of words that do not co-exist with each other (the product of puma and adidas vectors is a zero vector).

##    puma adidas
## 1     0      0
## 2     0      1
## 3     0      0
## 4     1      0
## 5     0      0
## 6     0      0
## 7     0      0
## 8     0      1
## 9     0      1
## 10    0      1

Another example concerns words that co-exist with each other (the product of the vectors adidas and racer is a non-zero vector)

##    adidas racer
## 1       0     0
## 2       0     0
## 3       0     0
## 4       1     0
## 5       1     1
## 6       0     0
## 7       1     0
## 8       1     1
## 9       0     0
## 10      1     1

The following function allows two vectors to be compared with each other. I assume that the input data is not perfect. There may be situations where the words I am looking for, may very rarely occur with each other - for example, the auction, where someone sells several pairs of shoes from various collections.

def prepare_perp_summary(v1, v2, term1, term2, perp_threshold=5):
    iloc = sum(v1*v2)
    size = sum(v1) + sum(v2)
    
    return {'term1': term1,
            'term2': term2,
            'distance': iloc,
            'is_perp': iloc < perp_threshold,
            'size': size,
            'size_term1': sum(v1),
            'size_term2': sum(v2),
            'size_prop12': sum(v1)/sum(v2),
            'size_prop21': sum(v2)/sum(v1)}

The prepare_perp_table function generates comparisons for all word pairs. It is currently the bottleneck of the algorithm. Before generating comparisons, I remove short words (one and two letters), delete stopwords and delete words that occur almost always and rarely.

def prepare_perp_table(freq_table, sw='polish', rm_low_freq=5, rm_high_freq=5, perp_threshold=5):
    # remove short words
    data = freq_table.iloc[:, freq_table.columns.map(lambda x: len(x) > 2)]
    # remove polish stopwords
    data = data.iloc[:, ~data.columns.isin(stopwords.words(sw))]
    # remove words with low and hight frequency
    data = data.iloc[:, list(data.sum() > rm_low_freq)]
    data = data.iloc[:, list(data.sum() < data.shape[0] - rm_high_freq)]
    # convert positive numbers to 1
    for c in data.columns:
        data[c] = data[c].map(lambda x: 1 if x > 0 else 0)

    perp_table = []
    for i in range(data.shape[1]-1):
        for j in range(i+1, data.shape[1]):
            perp_table += [prepare_perp_summary(data[data.columns[i]],
                                                data[data.columns[j]],
                                                data.columns[i],
                                                data.columns[j],
                                                perp_threshold=perp_threshold)]
    perp_table = pd.DataFrame.from_dict(perp_table)
    perp_table = perp_table[perp_table.is_perp]
    
    return perp_table

All that remains is the algorithm that will generate a set of words that meet the assumptions.

The function find_perp_set works as follows:
1. select two words that do not appear together, which are in the most significant number of sentences. The ratio of these words to each other must not exceed the value of the max_proportion parameter.
2. find all words that do not appear with any of the previously selected terms.
3. from among the chosen words, choose one that appears in the most significant number of sentences (keep the ratio max_proportion).
4. repeat points 2 and 3 until the words found will not include 90% of the sentence (the default value of the parameter bp).

def find_perp_set(perp_table, input_data_len, max_proportion=3, bp=0.90):
    # find two perp words with the highest size
    curr_set = perp_table[(perp_table.size_prop12 < max_proportion) & \
                          (perp_table.size_prop21 < max_proportion)].sort_values(by=['size']).iloc[-1, :]
    curr_set = [curr_set.term1, curr_set.term2]

    p = perp_table[perp_table.term1.isin(curr_set) & perp_table.term2.isin(curr_set)]
    print('+{}: {}% size of set'.format(curr_set[0], round(p['size_term1'].sum()/(input_data_len) * 100, 2)))
    p = p['size'].sum() / (input_data_len)
    print('+{}: {}% size of set'.format(curr_set[1], round(p * 100, 2)))

    while p < bp:
        tmp_perp_table = perp_table[~(perp_table.term1.isin(curr_set) & perp_table.term2.isin(curr_set))]

        # find perp elements to curr_set
        tt = []
        for c in curr_set:
            t = tmp_perp_table[(tmp_perp_table.term1 == c) | (tmp_perp_table.term2 == c)]
            t = list(set(list(t.term1) + list(t.term2)))
            t.remove(c)
            tt += [t]

        # find perp elements to all words in curr_set
        perp_terms = tt[0]
        for i in range(1, len(tt)):
            perp_terms = list(set(perp_terms) & set(tt[i]))

        tmp_set = tmp_perp_table[((tmp_perp_table.term1.isin(perp_terms) & tmp_perp_table.term2.isin(curr_set)) |
                                  (tmp_perp_table.term1.isin(curr_set) & tmp_perp_table.term2.isin(perp_terms))) &
                                 ((tmp_perp_table.size_prop12 < max_proportion) &
                                  (tmp_perp_table.size_prop21 < max_proportion))].sort_values(by=['size'])
        if tmp_set.shape[0] > 0:
            tmp_set = tmp_set.iloc[-1, :]
            tmp_set = list(set(perp_terms) and set([tmp_set.term1, tmp_set.term2]))
            new_term = [x for x in tmp_set if x not in curr_set]

            # add this element to curr_set
            curr_set = list(set(curr_set + new_term))

            p = perp_table[perp_table.term1.isin(curr_set) & perp_table.term2.isin(curr_set)]['size'].sum()/((len(curr_set)-1)*input_data_len)
            print('+{}: {}% size of set'.format(new_term[0], round(p*100, 2)))
        else:
            print('early stoping - no more words')
            break
            
    return curr_set

Time for tests.

df = pd.read_csv("allegro-sports-shoes.csv", sep=';')
data = list(df.iloc[:, 0])
data = clean_data(data, filter)
freq_table = prepare_freq_table(data)
perp_table = prepare_perp_table(freq_table)
perp_set_1 = find_perp_set(perp_table, input_data_len=len(data))
## +adidas: 29.15% size of set
## +nike: 49.17% size of set
## +puma: 58.37% size of set
## +balance: 67.17% size of set
## +reebok: 73.6% size of set
## +adidasy: 77.78% size of set
## +lacoste: 81.05% size of set
## +asics: 83.62% size of set
## +kappa: 85.2% size of set
## +under: 86.15% size of set
## +salomon: 86.9% size of set
## +timberland: 87.6% size of set
## +caterpillar: 88.25% size of set
## +timberland: 88.85% size of set
## +trapery: 89.37% size of set
## +vans: 89.83% size of set
## +big: 90.28% size of set

The first iteration was quite good. There are several elements that you can work on. The first one is an extension with expressions consisting of several words (here New Balance, Big Star). Another problem is the varieties (here adidasy).

The use of recursion is also considered. I check what happens to the data filtered after the word puma.

df = pd.read_csv("allegro-sports-shoes.csv", sep=';')
data = list(df.iloc[:, 0])
data = clean_data(data, filter='puma')
freq_table = prepare_freq_table(data)
perp_table = prepare_perp_table(freq_table)
perp_set_2 = find_perp_set(perp_table, input_data_len=len(data))
## +classic: 8.7% size of set
## +drift: 34.24% size of set
## +ignite: 41.85% size of set
## +smash: 48.37% size of set
## +runner: 52.36% size of set
## +rebound: 55.98% size of set
## +nrgy: 59.42% size of set
## +mid: 62.14% size of set
## +evospeed: 64.67% size of set
## +enzo: 66.49% size of set
## +turin: 68.3% size of set
## +court: 69.93% size of set
## +mesh: 71.56% size of set
## +knit: 73.19% size of set
## +jago: 74.64% size of set
## +miejskie: 76.09% size of set
## +whirlwind: 77.54% size of set
## +trainer: 78.99% size of set
## +tekstylne: 80.43% size of set
## +tsugi: 81.7% size of set
## +lite: 82.79% size of set
## +halowe: 83.88% size of set
## +desierto: 84.96% size of set
## +wyprzedaż: 86.05% size of set
## +evo: 87.14% size of set
## early stoping - no more words

The results are also very cool. The only thing you can find fault with is such words as ‘wyprzedaż’, but it remains in the matter of better cleaning of input data.

In summary, it seems to me that the algorithm has a particular potential, but it also requires a lot of work. Perhaps in this version, it may be a neat tool supporting the analyst’s work.