python - 如何计算文本中的短语并提取最常用的短语?

我有一个带有列文本的数据集 df:

text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B

如您所见,文本中有很多常用短语。我如何检测它们并提取最常见的(让我们说出现 2 次或更多次)。所以想要的输出是:

text                                cnt
the main goal is to develop          4
ai products for department           2
ai products for department           2

the main goal is to develop 被捕获但 the main goal is to 等没有被捕获的原因是因为它是其中最长的

我怎么能那样做?

回答1

您可以使用 https://en.wikipedia.org/wiki/N-gram 来执行此操作。主要思想是:

  1. 对于每个句子,得到n-gram,例如'主要目标是开发智能日历'的2-gram(bi-gram):['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']
  2. 获取所有这些具有不同 n-gram 的短语,n 范围从 1len(sentence)
  3. 计算它们的出现次数,store 字典的计数和长度
  4. 使用计数和长度对结果进行排序

使用 python,你可以这样:

text=['the main goal is to develop a smart calendar',
        'the main goal is to develop a smart calendar',
        'the main goal is to develop a chat bot',
        'it is clear that the main goal is to develop a product',
        'ai products for department A',
        'launching ai products for department B']


def get_ngram(word_list, n):
    ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
    print(ngram_list)
    return ngram_list


def get_ngram_pieces(text):
    text_pieces = []
    for sentence in text:
        word_list = sentence.split()
        print(word_list)
        for n in range(1, len(word_list) + 1):
            ngram_list = get_ngram(word_list, n)
            text_pieces.extend(ngram_list)

    return text_pieces
    

def get_count(text_pieces):
    keys = set(text_pieces)
    phrase_dict = {}
    for key in keys:
        phrase_dict[key] = (text_pieces.count(key), len(key.split()))
    return phrase_dict

all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))

phrase_dict_sorted 的前 10 名是

is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4

相似文章

最新文章