这篇文章将向你展示一个构建基本无监督主题模型的简化示例。我们将使用潜Dirichlet分配(LDA)模型。简而言之，LDA是一个概率模型，其中每个主题被视为单词的混合，而每个文档被视为主题的混合。利用LDA，我们将尝试从语料库中识别潜在主题。

Python设置

本文假设你能够访问并熟悉Python，包括安装包、定义函数和其他基本任务。如果你是Python新手，那么这是一个很好的开始。

?? 确保安装了numpy、pandaps、nltk、sklearn、matplotlib、seaborn、wordcloud和pyLDAvis;

?? 确保你已从nltk下载了“stopwords”和“wordnet”语料库。

下面的脚本可以帮助你下载这些语料库。

import nltk
nltk.download('stopwords') 
nltk.download('wordnet')

让我们通过导入所需的包来准备环境。我们还将定义一组要忽略的停用词：

# 数据集
from sklearn.datasets import fetch_20newsgroups

# 数据操纵
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100

# 文本预处理与建模
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.pipeline import Pipeline

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='talk')
from wordcloud import WordCloud
import pyLDAvis
import pyLDAvis.sklearn

# 警告
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

# 停用词
stop_words = set(ENGLISH_STOP_WORDS).union(stopwords.words('english'))
stop_words = stop_words.union(['let', 'mayn', 'ought', 'oughtn', 
                               'shall'])
print(f"Number of stop words: {len(stop_words)}")

预热示例

为了建立直觉，让我们用4个文档创建一个小示例。我们可以看到有两个主题：芒果和棋盘游戏。让我们用LDA来确定主题群：

data = ["We played board games yesterday.",
        "Delicious mango!",
        "She plays board games every weekend.",
        "Mangoes are now in season. Buy them while they are cheap."]
        
example = pd.DataFrame({'document': data})
example

作为第一步，我们将使用词袋方法将文本数据转换为数字数据。进行此转换的最简单方法是使用sklearn中的CountVectorizer()：

# 转换为文档-术语矩阵
vectoriser = CountVectorizer(stop_words='english')
example_matrix = vectoriser.fit_transform(example['document'])

# 提取特征/术语名称
feature_names = vectoriser.get_feature_names()

# 检查文档-术语矩阵
pd.DataFrame.sparse.from_spmatrix(example_matrix, 
                                  columns=feature_names)

如果我们能把“mango”和“mangoes”当作同一个词来处理，把“played”和“plays”当作相同的词来处理，那就更好了。为了解决这个问题，我们将创建一个自定义函数来预处理文本：

def preprocess_text(document):
    """将文档预处理为标准化标识."""
    # 将单词标识成最小长度为3的字母标识
    tokeniser = RegexpTokenizer(r'[A-Za-z]{3,}')
    tokens = tokeniser.tokenize(document)
    
    # 带词性标记的标签
    pos_map = {'J': 'a', 'N': 'n', 'R': 'r', 'V': 'v'}
    pos_tags = pos_tag(tokens)
    
    # 小写和词根化
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(t.lower(), pos=pos_map.get(p[0], 'v')) for t, p in pos_tags]
    
    # 删除停用词
    keywords= [lemma for lemma in lemmas if lemma not in stop_words]
    return keywords
    
# 转换为文档-术语矩阵
vectoriser = CountVectorizer(analyzer=preprocess_text)
example_matrix = vectoriser.fit_transform(example['document'])

# 提取特征/术语名称
feature_names = vectoriser.get_feature_names()

# 检查文档-术语矩阵
pd.DataFrame.sparse.from_spmatrix(example_matrix, 
                                  columns=feature_names)

这看起来更好！如果你对我们刚刚做的预处理的细节不太清楚，这篇文章解释了预处理的基本步骤，并简要解释了词性标记和词义化：https://towardsdatascience.com/introduction-to-nlp-part-1-preprocessing-text-in-python-8f007d44ca96 。

好的，我们的数据处于模型可理解的状态，所以让我们构建一个简单的模型。

在构建模型时，我们需要为n_components参数定义主题的数量。在这个小例子中，我们知道阅读这四个简短的文档有两个主题。然而，通常你不知道主题的数量，而合适的主题数量则取决于你的判断。

# 建立lda模型
lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(example_matrix)

# 检查主题
def describe_topics(lda, feature_names, top_n_words=5, show_weight=False):
    """打印lda模型中每个主题的前n个单词"""
    normalised_weights = lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis]
    for i, weights in enumerate(normalised_weights):  
        print(f"********** Topic {i+1} **********")
        if show_weight:
            feature_weights = [*zip(np.round(weights, 4), feature_names)]
            feature_weights.sort(reverse=True)
            print(feature_weights[:top_n_words], '\n')
        else:
            top_words = [feature_names[i] for i in weights.argsort()[:-top_n_words-1:-1]]
            print(top_words, '\n')
describe_topics(lda, feature_names, show_weight=True)

主题1中最重要的单词是“mango”(权重为0.2266)，而主题2中最重要的单词是“play”、“game”和“board”(每个单词的权重均为0.1919)。

现在让我们将每个主题的概率分配给文档：

example[['topic1', 'topic2']] = lda.transform(example_matrix)
example

很好，现在我们将通过找到每个记录的概率最高的主题来确定主导主题：

example['top'] = example.iloc[:, 1:3].idxmax(axis=1)
example['prop'] = example.iloc[:, 1:3].max(axis=1)
example

我们让主题名称更具描述性：

topic_mapping = {'topic1': 'mango', 'topic2': 'game'}
example['topic'] = example['top'].map(topic_mapping)
example

太棒了，现在我们看看它如何为新文档分配主题：

def assign_topic(document):
    """使用lda模型预测为文档指定主题。"""
    tokens = vectoriser.transform(pd.Series(document))
    probabilities = lda.transform(tokens)
    topic = probabilities.argmax()
    topic_name = topic_mapping['topic'+str(topic+1)]
    return topic_name
    
assign_topic("Board games are so fun!")

耶！它正确地预测了主题。热身之后，我们来看一个更现实的例子。

现实的例子

在本节中，我们将使用新闻组数据集中的三个主题。这是为了使这篇介绍性文章的内容易于管理，更易于理解。

# 加载数据
categories = ['comp.sys.mac.hardware', 'rec.sport.baseball', 
              'alt.atheism']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 
                                        'quotes'), 
                                categories=categories)
df = pd.DataFrame(newsgroups['data'], columns=['document'])
print("Shape:", df.shape)
df.head()

快速检查一下是否有不具信息性的例子。查看文档的字符长度。

df['n_characters'] = df['document'].str.len()
df.describe().T

最小值为零-这些是空文档。让我们看看20个字符以下的文档是什么样子的：

df.query("n_characters<20").sort_values('n_characters', ascending=False)

这些文件告诉我们的信息不多，所以我们把它们排除在外。但是如果你认为应该包括它们或者应该更改阈值，可以在以下脚本中进行调整：

df.query("n_characters>=20", inplace=True)
df.nsmallest(5, 'n_characters')

现在，数据集中最短的示例看起来更合理。

在没有监督的情况下，通常没有针对目标数据集进行划分检查。但是，我认为最好留出一些看不见的测试文档，以便稍后对模型进行测试。用下面的脚本，我们将为此保留5条记录。

X_train, X_test = train_test_split(df['document'], test_size=5, 
                                   random_state=1)
print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

现在我们将训练文档矢量化(即转换为数值表示)。为了更实用，假设我们不知道主题的数量，我们将尝试使用网格搜索来找到合适数量的主题。

如果你对粗略的主题范围有一个直观的猜测，缩小选项范围的一个方法是运行网格搜索。但是，仅将网格搜索的输出作为指导，并检查主题是否有意义。

# 预处理文本
vectoriser = CountVectorizer(analyzer=preprocess_text, min_df=5)
document_term_matrix = vectoriser.fit_transform(X_train)

# 跨所选参数运行网格搜索
lda = LatentDirichletAllocation(learning_method='online', random_state=0)
param_grid = {'n_components': [2, 3, 4, 5]}
lda_search = GridSearchCV(lda, param_grid=param_grid, cv=3)
lda_search.fit(document_term_matrix)

# 检查网格搜索输出
results = pd.DataFrame(lda_search.cv_results_)\
            .sort_values("rank_test_score")
results[['param_n_components', "rank_test_score", 'mean_test_score', 
         'std_test_score']]

在这种情况下，score表示对数似然值，这个值越高越好。我们发现3个组分的平均测试分数最高。很好，我们知道从新闻组数据中提取了3个主题！我们运行一个包含3个主题的LDA模型，并将其与管道中的预处理步骤放在一起。

如果你不熟悉管道，这篇文章(滚动到1.Pipeline)用一个简单的例子解释了它的作用：https://towardsdatascience.com/pipeline-columntransformer-and-featureunion-explained-f5491f815f 。

n_components = 3
pipe = Pipeline([('vectoriser', CountVectorizer(analyzer=preprocess_text, min_df=5)),
                 ('lda', LatentDirichletAllocation(n_components=n_components, learning_method='online', random_state=0))])
pipe.fit(X_train)

# 检查主题
feature_names = pipe['vectoriser'].get_feature_names()
describe_topics(pipe['lda'], feature_names, top_n_words=10)

似乎第一个主题是关于宗教，第二个主题是关于体育，第三个主题是关于Mac/Apple的-但是我们需要更多的信息。我们将尝试在主题语境化部分更好地理解主题。

让我们将主题概率添加到每个训练文档中。

pd.options.display.max_colwidth = 50
train = pd.DataFrame(X_train)
columns = ['topic'+str(i+1) for i in range(n_components)]
train[columns] = pipe.transform(X_train)
train.head()

现在我们来组织概率。我们将创建几个新列：

train = train.assign(top1=np.nan, prob1=np.nan, top2=np.nan, 
                     prob2=np.nan, top3=np.nan, prob3=np.nan)
for record in train.index:
    top = train.loc[record, 'topic1':'topic3'].astype(float).nlargest(3)
    train.loc[record, ['top1', 'top2', 'top3']] = top.index
    train.loc[record, ['prob1', 'prob2', 'prob3']] = top.values
train.drop(columns="document").head()

top1列存储每个记录的最主要主题，prob1列存储最主要主题的概率分数。top2和prob2在第二个最主要的主题上也是一样的，依此类推。我发现用这种方式组织可以更容易地进一步分析主题。让我们来检查一下两个主题的概率相同的情况。

train.loc[train['prob1']==train['prob2'], 'document':'topic3']

很难说这些文档有主导主题，因为这三个主题的概率都是相等的。由于这些文件相当短，所以模型给出的概率相等似乎是合理的。

幸运的是这种情况只有3个。如果我们对train['prob2']==train['prob3']重复相同的操作，将找到相同的3条记录。如果你好奇，可以通过调整上面的代码来查看。

现在我们可视化最主要的主题的概率分布。理想情况下，我们更希望大多数概率集中在更高的值上。

plt.figure(figsize=(12,5))
sns.kdeplot(data=train, x='prob1', hue='top1', shade=True, 
            common_norm=False)
plt.title("Probability of dominant topic colour coded by topics");

很高兴看到大多数值都接近1。让我们检查概率值的汇总统计：

train[["prob1", "prob2", "prob3"]].describe()

最主要主题的平均概率为0.9374。太棒了！

主题语境化

现在我们找出每个主题的最前面的单词来理解发现的主题：

def inspect_term_frequency(df, vectoriser, n=30):
    """在语料库中显示前n个常用词"""
    document_term_matrix = vectoriser.transform(df)
    document_term_matrix_df = pd.DataFrame(document_term_matrix.toarray(), 
                                           columns=feature_names)
    term_frequency = pd.DataFrame(document_term_matrix_df.sum(axis=0), 
                                  columns=['frequency'])
    return term_frequency.nlargest(n, 'frequency')
    
fig, ax = plt.subplots(1, 3, figsize=(16,12))
for i in range(n_components):
    topic = 'topic' + str(i+1)
    topic_df = train.loc[train['top1']==topic, 'document']
    freqs = inspect_term_frequency(topic_df, pipe['vectoriser'])
    sns.barplot(data=freqs, x='frequency', y=freqs.index, ax=ax[i])
    ax[i].set_title(f"Top words for {topic}")
plt.tight_layout()

从我们目前所看到的情况来看，可以总结如下主题：

主题1：与宗教有关，尤其是无神论；

主题2：与运动有关，尤其是棒球；

主题3：电脑相关，尤其是Mac/Apple。

让我们看看每个主题的词云：

fig, ax = plt.subplots(1, 3, figsize=(20, 8))
for i in range(n_components):
    topic = 'topic' + str(i+1)
    text = ' '.join(train.loc[train['top1']==topic, 'document'].values)    
    wordcloud = WordCloud(width=1000, height=1000, random_state=1, background_color='Black', 
                          colormap='Set2', collocations=False, stopwords=stop_words).generate(text)
    ax[i].imshow(wordcloud) 
    ax[i].set_title(topic)
    ax[i].axis("off");

我们还可以使用pyLDAvis包以交互方式可视化主题：

pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(pipe['lda'], 
                         pipe['vectoriser'].transform(X_train), 
                         pipe['vectoriser'])

课题任务

根据目前为止的主题探索，我们可以将主题命名如下：

topic_mapping = {'topic1': 'atheism', 
                 'topic2': 'baseball', 
                 'topic3': 'mac'}
train['topic'] = train['top1'].map(topic_mapping)
train[['document', 'topic']].head()

对于这个特定的数据集，我们还可以将模型预测与数据集附带的标签进行比较。然而，目标标签通常不会像这样容易访问。

# 向数据框添加目标标签
target_mapping = dict(enumerate(newsgroups['target_names']))
df = pd.merge(df, pd.Series(newsgroups['target'], name='target'), 
              how='left', left_index=True, right_index=True)
df['target'] = df['target'].map(target_mapping)

# 向训练添加目标标签
train = pd.merge(train, df['target'], how='left', left_index=True, 
                 right_index=True)
train[['document', 'topic', 'target']].head()

很高兴看到这些主题与这些示例相匹配。你可以进一步扩展检查(例如，查看混淆矩阵)。

让我们仔细看看几个训练示例，看看主题有多准确：

def assign_topic(document):
    """使用lda模型预测为文档指定主题"""
    probabilities = pipe.transform(document)
    topic = probabilities.argmax()
    topic_name = topic_mapping['topic'+str(topic+1)]
    return topic_name
    
for i, document in enumerate(X_train.sample(3, random_state=2).values):
    print(f"********** Test example {i+1} **********")
    print(document, '\n')
    print(f"Assigned topic: {assign_topic(np.atleast_1d(document))}", '\n')

到目前为止，这三个例子看起来不错，但是如果你看更多的例子，可能会有一些没有意义的任务。现在让我们为新案例分配主题并检查：

for i, document in enumerate(X_test.values):
    print(f"********** Test example {i+1} **********")
    print(document, '\n')
    print(f"Assigned topic: {assign_topic(np.atleast_1d(document))}", '\n')

你同意这些任务吗？我们对照一下目标标签：

test = pd.DataFrame(X_test)
test[columns] = pipe.transform(X_test)
test['topic'] = test.loc[:, 'topic1':'topic3'].idxmax(axis=1).map(topic_mapping)
test = pd.merge(test, df['target'], how='left', left_index=True, 
                 right_index=True)
test

瞧，现在你知道如何运行一个基本的无监督主题模型了！

虽然我们在这篇文章中只运行了一次包含3个主题的迭代，但实际上你可能会发现自己使用不同的主题数和其他参数运行多个迭代来找到合适的模型。所以继续尝试吧！??

无监督主题模型

Python设置

预热示例

现实的例子

主题语境化

课题任务

相关推荐

取消回复欢迎你发表评论:

Google 黑客常用搜索语句一览原力计划

npx简介（npxvip是哪国的）

在 Android 模拟器上运行 ARM 应用（android模拟器原理）

GB28181,B接口协议之SIPRTSPRTPRTMP协议从入门到精通

安装使用Hoppscotch构建API请求访问与测试

手机实时提取SIM卡打电话的信令和声音-辅助外设与商用通话方案

Python自动化办公——后台截图（python 自动截图）

轻松转换!AppleNumbers到Excel的快捷教程

电脑端腾讯文档如何导出excel

用OpenCV测量图像中物体的大小（基于opencv的物体尺寸检测算法实现）

无监督主题模型

Python设置

预热示例

现实的例子

主题语境化

课题任务

相关推荐

取消回复欢迎 你 发表评论:

Google 黑客常用搜索语句一览 原力计划

npx简介（npxvip是哪国的）

在 Android 模拟器上运行 ARM 应用（android模拟器原理）

GB28181,B接口协议之SIPRTSPRTPRTMP协议从入门到精通

安装使用Hoppscotch构建API请求访问与测试

手机实时提取SIM卡打电话的信令和声音-辅助外设与商用通话方案

Python自动化办公——后台截图（python 自动截图）

轻松转换!AppleNumbers到Excel的快捷教程

电脑端腾讯文档如何导出excel

用OpenCV测量图像中物体的大小（基于opencv的物体尺寸检测算法实现）

取消回复欢迎你发表评论:

Google 黑客常用搜索语句一览原力计划