作者：Lianne & Justin
编译：ronghuaiyang

导读

使用Python一步一步的对Youtube流行的减肥频道进行机器学习，找到提高浏览量的方法，感觉公众号也是适用的。

在这篇文章中，我们在YouTube数据上使用了机器学习算法，给出一些可以获得更多的浏览量的建议。

我们将包含整个的“端到端”的过程：

抓取YouTube数据
使用NLP的视频标题
特征工程 — 建立预测决策树
更多

全部用Python实现。

如果你想知道数据科学如何帮助YouTube频道获得更多的浏览量和收入，那就去看看吧。

让我们开始吧。

我们分析的YouTube频道是Sydney Cummings —— 我们最喜欢的教练。她发展很快，最近订阅用户超过了20万。另外，Sydney每天都发布各种锻炼视频，这是一个可以分析的数据量。

今天还没有锻炼吗？看看她最近的视频吧。

30分钟的强力手臂和臀大肌锻炼

你可能已经注意到，她的视频标题遵循一个标准的格式。它们通常包括时间长度、身体部位、燃烧的卡路里和其他有关锻炼的描述性词语。在我点击这个视频之前，我会知道：

30分钟 — 我将在30分钟内完成整个锻炼。
强大的手臂和臀大肌 — 我将致力于手臂和臀大肌和集中力量。
燃烧310卡路里 — 我将燃烧相当数量的卡路里。

我们会分析这些关键信息。

让我们看看我们是否可以给Sydney推荐一些新的内容创作策略来进一步改进！

为了方便阅读，文章的结构如下:

准备：抓取数据
第1步：观察数据
第2步：使用NLP技术来对视频分类
第3步：特征工程
第4步：构建目标变量
第5步：构建决策树
第6步：理解决策树
获取更多浏览的一些可行的建议

准备：抓取数据

抓取YouTube数据有不同的方法。因为这只是一个一次性的项目，所以我们用“最简单”的方式来做，这需要手工操作，但是不需要额外的工具。

以下是一步一步的程序：

1、滚动到频道的视频页面，直到所有的视频出现。

2、右键点击最新的视频，选择“Inspect”

3、将光标悬停在每行上，找到所有高亮显示视频的HTML代码/元素的“最低”级别。

例如，我们使用Chrome浏览器，它看起来是这样的：

4、右键点击元素，选择“复制”，然后选择“复制元素”。

5、将复制的元素粘贴到文本文件中并保存。我们使用JupyterLab文本文件，并将其保存为sydney.txt。

6、使用Python提取信息并清理数据。

我们不解释细节，因为每个案例都是不同的。代码在这里：https://gist.github.com/liannewriting/cd35e68deee092eeab9c288f414d6bc1为你提供方便。

现在我们可以开始有趣的部分了!

我们将从这个数据集中提取特征，并研究影响观看数量的特征。

第1步：观察数据

在最后一步中将数据加载到Python中已经完成了，让我们看看我们的数据集df_videos。

df_videos有8个特征描述每个视频的细节，包括：

标题
发布了多长时间（小时）
视频长度
观看量
链接
卡路里数
发布日期
已发布的天数

已有837个视频上传。

另外，我们发现数据中有重复，因为Sydney上传了多次相同的视频。但我们忽略它，因为重复的很少。

# there are duplicates
df_videos['title'].value_counts(dropna=False).iloc[:30]

第2步：使用NLP技术来对视频分类

在这一步中，我们主要根据标题中的关键词对视频进行分类。

我们希望根据以下内容对视频进行分组:

这段视频关注的是身体的哪些部位?
这个视频是帮助我们增加力量还是减少脂肪?
其他关键字？

我们使用NLP技术和Natural Language Toolkit (NLTK)包来处理标题。在这篇文章中，我们不解释技术的细节。

构建关键词列表

首先，我们对视频的标题进行分词。

这个过程显式地将标题文本字符串分割为不同的tokens(单词)，并使用空格分隔。通过这种方式，计算机程序可以更好地理解文本。

# import all the packages
import pandas as pd
import numpy as np
from datetime import timedelta, datetime

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

import seaborn as sns

pd.options.mode.chained_assignment = None

from nltk import word_tokenize
from collections import Counter

words = df_videos['title'].str.lower().str.cat(sep=' ')
word_tokens = word_tokenize(words)

word_counter = Counter(word_tokens)
print('{} different words'.format(len(word_counter)))
word_counter.most_common(538)

在这些标题中有538个不同的单词。列表的顶部一些词如下。

许多词被频繁使用。这再次证实了Sydney确实使用了标准格式的视频标题。通过查看上面的列表，我们创建了3个关键词列表，这些关键词可用于在以后的步骤中对视频进行分类。

body_keywords — 识别视频关注的身体部位，如“full body”，“abs”，“legs”。
workout_type_keywords — 这告诉锻炼类型，如 “cardio”，“stretch”，“strength”。
other_keywords — 这包括经常使用但很难归类的关键词，如“bootcamp”，“burnout”，“toning”。

确定关键词列表的词干

在形成这些关键词列表之后，我们对它们提取词干。在词干提取过程确保计算机程序能够匹配具有相同含义的单词。

例如，单词“abs”和“ab”有相同的词干“ab”。

from nltk.stem import PorterStemmer
ps = PorterStemmer()

body_keywords = ['full', 'total', 'abs', 'legs', 'butt', 'upper', 'arms', 'back', 'shoulders', 'chest', 'leg', 'biceps', 'thighs', 'glutes', 'core', 'triceps', 'hamstrings', 'lower', 'hips', 'booty']

body_keywords_set = set([ps.stem(tok) for tok in body_keywords]) # stem the keywords (since the title is also stemmed.)
body_keywords_dict = {ps.stem(tok):tok for tok in body_keywords} # use this dictionary to revert the stemmed words back to the original.

workout_type_keywords = ['cardio', 'stretch', 'strength', 'hiit', 'tabata', 'pilates', 'yoga']
workout_type_keywords_set = set([ps.stem(tok) for tok in workout_type_keywords]) # stem the keywords (since the title is also stemmed.)
workout_type_keywords_dict = {ps.stem(tok):tok for tok in workout_type_keywords} # use this dictionary to revert the stemmed words back to the original.

other_keywords = ['boot camp', 'bootcamp', 'burnout', 'conditioning', 'bodyweight', 'circuit', 'sculpt', 'agility', 'resistance', 'athlete', 'toned', 'toning', 'tone', 'boxing', 'kickboxing', 'plyo', 'sport', 'superset', 'workout', 'speed']
other_keywords_set = set(other_keywords)

对Youtube标题进行分词和词干提取

除了关键词，我们还需要对标题分词和提取词干。这些程序为进一步匹配准备了关键字和标题的列表。

%%time

from nltk import pos_tag
from nltk.stem import PorterStemmer

ps = PorterStemmer()

# clean the titles.
def prepare_title(desc):
    # tokenize titles.
    tokens = word_tokenize(desc)
    
    # stem words.
    stemmed_tokens = [ps.stem(tok).lower() for tok in tokens]
    return set(stemmed_tokens)

df_videos['title_word_set'] = df_videos['title'].map(prepare_title)

现在，我们已经准备好构建特征了！

第3步：特征工程

经过头脑风暴，我们想出了两种与Sydney的YouTube观看相关的主要类型的特征 — 基于关键词和基于时间。让我们一个一个来看。

基于关键词的特征

指标特征

由于前面的工作，我们现在有3个关键词列表和精简的标题。我们现在可以匹配它们来对视频进行分类。

对于body_keywords和workout_type_keywords，在一个视频中可能有多个关键词。因此，在匹配之前，我们还创建了两个特征area和锻炼类型。这些特征把所有的身体部位和锻炼类型拼接到一起，成为一个字符串。

例如，一个锻炼视频可以同时做“ab”和“leg”，或者同时做“cardio”和“strength”。这个视频的部位特征是“ab+leg”，而锻炼类型是“cardio+strength”。

同时，我们也识别类似的关键字，如“total”和“full”，“core”和“ab”，并将它们分组。

最后，我们创建了三种不同类型的虚拟特征：

is_{}_area用来识别一个视频是否包含了一个特定的身体部分。
is_{}_workout识别锻炼类型。
*title_contains_{}*查看锻炼标题是否包含其他关键字。

需要说明的是，“legs strength burnout workout”的视频标题应该是is_leg_area = True, is_strength_workout = True, title_contains_burnout = True，而其他指标都是False的。

请阅读下面的Python代码以获得详细信息。

# matching the title with the list of single-word keyword using intersection
# return a string of workout type from the list of keyword

def body_type(s):
    body_parts = s.intersection(body_keywords_set)
    area_text = ''
    for body_part in body_parts:
        if body_part == 'total':
            area_text += '+full'
        elif body_part == 'core':
            area_text += '+ab'
        elif body_part in ['glute', 'booti']:
            area_text += '+butt'
        else:
            area_text += '+' + body_part
    if area_text == '':
        return 'full'
    return area_text[1:]

df_videos['area'] = df_videos['title_word_set'].map(body_type)

for area in body_keywords_set:
    df_videos['is_{}_area'.format(area)] = df_videos['area'].str.contains(area)
    
def workout_type(s):
    workout_types = s.intersection(workout_type_keywords_set)
    num_types = len(workout_types)
    workout_text = ''
    for workout_type in workout_types:
        if num_types > 1 and workout_type == 'stretch':
            continue            
        workout_text += '+' + workout_type
    if workout_text == '':
        #
        cardio_match = s.intersection(set(['sweat', 'fat']))
        if len(cardio_match) > 0:
            workout_text = '+cardio'
        else:
            workout_text = '+strength'
    return workout_text[1:]

df_videos['workout_type'] = df_videos['title_word_set'].map(workout_type)

for workout_type in workout_type_keywords_set:
    df_videos['is_{}_workout'.format(workout_type)] = df_videos['workout_type'].str.contains(workout_type)
for other in other_keywords_set:
    df_videos['title_contains_{}'.format(other)] = df_videos['title'].str.lower().str.contains(other)

频率特征

除了这些指标之外，我们还创建了三个特征，分别是num_body_areas、num_workout_types和num_other_keywords。他们计算一个视频标题中提到的关键词的数量。

举个例子，标题“abs and legs cardio strength workout”的num_body_areas和num_workout_types都是2。

这些特征可以帮助我们确定视频中应该包含的身体部位或锻炼类型的最佳数量。

area_cols = [col for col in df_videos.columns if col.find('_area') > 0]
df_videos['num_body_areas'] = df_videos[area_cols].sum(axis=1)

workout_cols = [col for col in df_videos.columns if col.find('_workout') > 0]
df_videos['num_workout_types'] = df_videos[workout_cols].sum(axis=1)

title_contains_cols = [c for c in df_videos.columns if c.startswith('title_contains')]
df_videos['num_other_keywords'] = np.sum(df_videos[title_contains_cols],axis=1)

燃烧率特征

最后但并非最不重要的是，我们创建了一个查看卡路里燃烧率的特征calories_per_min。

毕竟，我们都想要一些明确的(可量化的)锻炼目标。

df_videos['calories_per_min'] = df_videos['calories']/df_videos['length']

在介绍基于时间的特征之前，我们还修复了一些分类错误的视频。这个过程是手动的，所以这里不包含这个内容。

基于时间序列的特征

有了以上基于关键词的特征，我们已经可以找到特定的流行的视频类型。但这是否意味着Sydney应该总是发布同样类型的视频呢？

为了回答这个问题，我们还创建了一些基于时间序列的特征：

num_same_area — 过去30天内发布的关注同一area的视频数量(包括当前的)。例如，当前的视频集中在上半身时，在过去的30天里还有5个上半身的锻炼，这个特征= 6。
num_same_workout — 这个特征类似于num_same_area，只是计算了锻炼的类型。例如，当前的视频是HIIT锻炼时并且在过去30天内还有另外两个HIIT锻炼，这个特性= 3。
last_same_area — 上个视频与当前视频关注的身体区域相同间隔的天数。例如，当前视频关注abs，之前的abs视频为10天前，该特征=10。
last_same_workout — 这个功能类似于last_same_area，只是比较了不同的锻炼类型。
num_unique_areas — 过去30天内不一样的身体部位的数量。
num_unique_workouts — 过去30天内发布的不一样的锻炼类型的数量。

这些特征帮助我们了解观众是喜欢相似的视频还是喜欢不同类型的视频。请看下面的特征工程的详细过程。它涉及到一些转换以适应pandas的函数。

# prep for time series
df_videos = df_videos.sort_values(by='date').reset_index(drop=True)
# filter out the videos with no dates.
msk = ~df_videos['date'].isnull()
df_videos = df_videos[msk]

# start building time-based features
areas = df_videos['area'].unique()
d_areas = {}
for i in range(len(areas)):
    d_areas[areas[i]] = i

# convert the area, workout_type, workout_int to numeric feature first due to Pandas limitation, must do this for the rolling function.
df_videos['area_int'] = df_videos['area'].replace(d_areas)

workouts = df_videos['workout_type'].unique()
d_workouts = {}
for i in range(len(workouts)):
    d_workouts[workouts[i]] = i

# must do this for the rolling function.
df_videos['workout_int'] = df_videos['workout_type'].replace(d_workouts)
# feature engineering.
# ignore these rand features for now, they are for future use
df_videos['rand0'] = np.random.rand(len(df_videos.index))
df_videos['rand1'] = np.random.rand(len(df_videos.index))
df_videos['rand2'] = np.random.rand(len(df_videos.index))


df_videos['yyyymmdd'] = df_videos['date'].dt.year*10000 + df_videos['date'].dt.month*100 + df_videos['date'].dt.day
df_videos['yyyymm'] = (df_videos['yyyymmdd']/100).astype(int)
df_videos['month'] = df_videos['date'].dt.month_name()
df_videos['day_name'] = df_videos['date'].dt.day_name()
df_videos['day_of_month'] = df_videos['date'].dt.day

# set the index to the date so we can use the rolling function.
df_videos = df_videos.set_index('date')

def num_same_past(areas):
    current_area = areas[-1]
    return np.sum(areas == current_area)

# within the past 30 days, number of workout of the same area/type as the current video
df_videos['num_same_area'] = df_videos['area_int'].rolling(timedelta(days=30)).apply(num_same_past, raw=True)
df_videos['num_same_workout'] = df_videos['workout_int'].rolling(timedelta(days=30)).apply(num_same_past, raw=True)

def last_same_past(t):
    current_t = t[-1]
    current_dte = t.index[-1]
    past = t[:(len(t)-1)]
    past2 = past[past == current_t]
    days_ago = (current_dte - past2.index.max())/timedelta(days=1)
    if days_ago is np.nan:
        return 9999
    return days_ago

# within the past 30 days, number of dates the same area/type was posted
df_videos['last_same_area'] = df_videos['area_int'].rolling(timedelta(days=30)).apply(last_same_past, raw=False)
df_videos['last_same_workout'] = df_videos['workout_int'].rolling(timedelta(days=30)).apply(last_same_past, raw=False)

def num_unique_past(t):
    return t.nunique(dropna=False)

# within the past 30 days, number of unique areas/workout types
df_videos['num_unique_areas'] = df_videos['area_int'].rolling(timedelta(days=30)).apply(num_unique_past, raw=False)
df_videos['num_unique_workouts'] = df_videos['workout_int'].rolling(timedelta(days=30)).apply(num_unique_past, raw=False)

我们发现，Sydney 偶尔会发布一些与锻炼无关的视频。他们得到的观看次数明显少于健身视频，表现也与健身视频不同。所以我们把它们从分析中移除。

我们还过滤掉了最开始的30天的视频，因为它们缺乏足够的历史数据。

# filtering out videos don't have effective 30 days windows
# only include the videos after February (the ones before were different).
df_videos = df_videos['2018-03-01':'2019-12-15']

# filter out stretches, yoga and non-workouts.
msk = (~df_videos['workout_type'].isin(['none', 'stretch', 'yoga'])) & (~df_videos['calories'].isnull())
df_videos = df_videos[msk]

多重共线性测试

多重共线性(又称共线性)是指多元回归模型中的一个预测变量可以与其他预测变量进行线性预测，且预测精度相当高的一种现象。
至少在样本数据集中，多重共线性并不会降低整个模型的预测能力或可靠性，它只影响关于单个预测变量的计算。

正如维基百科所解释的那样，多重共线性的确会影响个体特征对结果的影响力。为什么这很重要？假设Sydney只在周一发布力量训练，她的视频总是在周一有更多的浏览量。这些视频的浏览量更高是因为它们是周一发布的，还是因为它们是力量训练的视频？

在提出建议时，我们希望回答这些类型的问题。所以我们要确保我们的特征之间没有很强的共线性。

现在我们已经清楚了多重共线性测试的原因。让我们看看应该用哪种方法。

我们经常使用成对相关来测试共线性，但有时这是不够的。多个特征(多于一对)可能同时存在共线性。

因此，我们使用了一种更复杂的方法。在高层次上，我们使用[K-fold交叉验证](https://machinelearningmastery.com/k- fold.cross -validation/)来实现这一点。

具体操作步骤如下：

根据我们的判断，选择一组关键特征进行共线性检验。

我们选择了下面的特征，因为它们对于预测YouTube视频的浏览量至关重要。

# group of critical features selected
cols = [‘length’, ‘calories’, ‘days_since_posted’, ‘area’, ‘workout_type’, ‘num_other_keywords’, ‘day_of_month’, ‘day_name’, ‘month’, ‘num_workout_types’, ‘num_body_areas’, ‘num_same_area’, ‘num_same_workout’, ‘num_unique_areas’, ‘num_unique_workouts’, ‘last_same_area’, ‘last_same_workout’, ‘rand0’, ‘rand1’, ‘rand2’]

正如你所看到的，我们还添加了三个由随机数组成的特征rand0、rand1、rand2。当比较特征之间的关系时，它们充当锚点。如果一个预测特征与这些随机特征相比不那么重要或相似，那么它就不是目标特征的重要预测因子。

为K-fold交叉验证准备这些特征。

在这个过程中，我们对类别特征和workout_type进行了转换。此转换确保每个类别至少有K个值。

workout_cnts = df_videos[‘workout_type’].value_counts()
workout_small_cnts = workout_cnts[workout_cnts < 5].index.values
d_workout_replace = {}
for w in workout_small_cnts:
d_workout_replace[w] = ‘other’

area_cnts = df_videos[‘area’].value_counts()
area_small_cnts = area_cnts[area_cnts < 5].index.values
d_area_replace = {}
for a in area_small_cnts:
d_area_replace[a] = ‘other’

使用其中一个特征作为目标，其他特征作为预测变量来训练预测模型。

接下来，我们对每个特征进行循环，并使用其他特征进行预测拟合一个模型。我们使用了一个简单的模型Gradient model (GBM)和K-fold验证。

根据目标特征是数值的还是类别的，我们应用不同的模型和分数(模型预测能力评估指标)。

当目标特性是数值型时，我们使用Gradient Regressor模型和Root Mean Squared Error (RMSE)，当目标特性是类别的，我们使用Gradient boost Classifier模型和Accuracy。

对于每个目标，我们打印出K-fold validation score(分数的平均值)和最重要的5个预测因子。

# select numeric columns
df_numeric = df_videos[cols].select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)

# select non numeric columns
df_non_numeric = df_videos[cols].select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)

# testing for multi collinearity.
# numeric
reg = GradientBoostingRegressor(n_estimators=100, max_depth=5,
                                learning_rate=0.1, loss='ls',
                                random_state=1)
# categorical
clf = GradientBoostingClassifier(n_estimators=100, max_depth=5,
                                learning_rate=0.1, loss='deviance',
                                random_state=1)

# try to predict one feature using the rest of others to test collinearity, so it's easier to interpret the results
for c in cols:
    # c is the thing to predict.
    
    if c not in ['rand0', 'rand1', 'rand2']:
        df_X = df_videos.replace(d_workout_replace).replace(d_area_replace)
        df_X['calories'] = df_X['calories'].fillna(0) # only calories should have missing values.

        X = df_X[cols].drop([c], axis=1) # drop the thing to predict.
        X = pd.get_dummies(X)
        y = df_X[c]

        print(c)

        if c in non_numeric_cols:
            scoring = 'accuracy'
            model = clf
            scores = cross_val_score(clf, X, y, cv=5, scoring=scoring)
            print(scoring + ": %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif c in numeric_cols:
            scoring = 'neg_root_mean_squared_error'
            model = reg
            scores = cross_val_score(reg, X, y, cv=5, scoring=scoring)
            print(scoring.replace('neg_', '') + ": %0.2f (+/- %0.2f)" % (-scores.mean(), scores.std() * 2))
        else:
            print('what is this?')

        model.fit(X, y)
        df_importances = pd.DataFrame(data={'feature_name': X.columns, 'importance': model.feature_importances_}).sort_values(by='importance', ascending=False)
        top5_features = df_importances.iloc[:5]
        print('top 5 features:')
        print(top5_features)

        print()

研究每个目标特征的得分和重要的预测因子。

我们研究每个目标特征及其与预测变量的关系。我们不会覆盖整个过程，只解释下面的两个例子。

我们发现视频长度和卡路里特征是相关的。这个发现很直观，因为锻炼的时间越长，消耗的卡路里越多。

我们可以可视化这个关系。

trace = go.Scatter(
    x = df_videos['length'],
    y = df_videos['calories'],
    mode = 'markers'
)

layout = dict(
    xaxis=dict(
        title='length'
    ),
    yaxis=dict(
        title='calories'
    )
)

fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

正如你所看到的，视频长度和卡路里之间是正相关的。但是相关性还没有强到可以丢弃。40-45分钟的视频消耗的卡路里与30-35分钟、50-55分钟甚至60分钟以上的视频消耗的卡路里重叠。因此，我们保留了它们。

同时，我们发现num_same_area和area_full特征是相关的。这一发现有些令人惊讶。让我们来探究一下原因。

下面的图显示了num_same_area和area之间的关系。

trace = go.Scatter(
    x = df_videos['area'],
    y = df_videos['num_same_area'],
    mode = 'markers'
)

layout = dict(
    xaxis=dict(
        title='area'
    ),
    yaxis=dict(
        title='num_same_area'
    )
)

fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

num_same_area特征统计了过去30天内发布的关注同一area的视频数量(包括当前的视频)。特征area_full代表全身锻炼，这是Sydney视频中最常见的一种。因此，当num_same_area很大时，这些视频其实是想关注全身。

假设我们发现更高的num_same_area(>=10)确实会导致更高的YouTube访问量。我们无法知道这是因为area_full还是num_same_area。因此，我们删除了num_same_area特征来防止这种情况。

除此之外，我们还使用类似的逻辑删除了num_same_workouts。

第4步：构建目标变量

你可能还记得，这个项目的目标是增加YouTube的访问量。我们应该仅仅使用浏览的数量作为我们的目标吗？

浏览的分布高度倾斜。视频的平均浏览量为27641次，而大部分视频的浏览量为130万次。这种偏差可能会给模型的解释带来问题。

因此，我们创建了特征views_quartile，而不是浏览。

我们将视频分为两类 —— 高浏览视频(“高”)和低浏览视频(“低”)。“高”被定义为浏览量的75%分位数(35,578)或更高，“低”是其他。

通过这种方式，我们使用预测模型来寻找产生前25%浏览视频的特征组合。这个新目标提供了稳定的结果和更好的洞察力。

df_videos['views_pct_rank'] = df_videos['views'].rank(pct=True)
df_videos['views_quartile'] = pd.cut(df_videos['views_pct_rank'], bins=[0, 0.75, 1.0], labels=['Q1-Q3', 'Q4'])

第5步：构建决策树

最后，我们拥有构建模型所需的一切！

我们在目标views_quartile上训练一个决策树模型。

为了避免过拟合，我们将叶子的最小样本数量设置为10。为了便于解释，我们将树的最大深度设置为8。

from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO
from IPython.display import Image
import sklearn.tree as tree
import pydotplus

features = df_videos.drop(['title', 'posted_ago', 'views', 'link', 'title_word_set', 'num_same_workout', 'area_int', 'workout_int', 'views_pct_rank', 'views_quartile', 'yyyymmdd', 'yyyymm', 'rand0', 'rand1', 'rand2'], axis=1).columns.values
target = 'views_quartile'

df_X = df_videos[features].fillna(0)
X = df_X[features]
X = pd.get_dummies(X)
y = df_videos[target]

dt = DecisionTreeClassifier(max_depth=8, min_samples_leaf=10) #max_depth is maximum number of levels in the tree
dt.fit(X, y)

dot_data = StringIO()
tree.export_graphviz(dt,
                     out_file=dot_data,
                     class_names=['...low', '.high'],
                     feature_names=X.columns, # the feature names.
                     filled=True, # Whether to fill in the boxes with colours.
                     rounded=True, # Whether to round the corners of the boxes.
                     special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

第6步：理解决策树

在最后一步中，我们研究并总结了导致高浏览量或低浏览量的“分支”。

我们发现的主要见解是什么?

洞察#1：每分钟燃烧的卡路里是最重要的特征

是的，卡路里是最重要的特征。人们似乎不太关心锻炼的类型或身体的部位。

在每分钟燃烧热量更高的锻炼中(≥12.025)，51/(34+51)=60%的视频有更高的观看率。

而每分钟消耗热量较少(≤9.846)的视频则远不如其他视频受欢迎。只有12/(154+12)= 7.2%的高浏览量。

对于每分钟燃烧中等卡路里的视频(9.846到12.025之间)，其他因素开始起作用。

当然，每个人都想高效地燃烧卡路里！

洞察#2：在不同的身体部位进行不同的锻炼并不能提高你的浏览量

这种见解与我们想象的有些不同。各种各样的锻炼不是更好吗?

在过去的一个月中，当不一样身体区域的数量(num_unique_area)较高(≥10)时，视频的浏览量往往较低。即使每分钟燃烧的卡路里很高，这种说法也是正确的。

结合前两种观点，42/(12+42)= 78%的视频会有更多的观看次数，当：

每分钟消耗的热量高(≥12.025)
过去一个月中锻炼的不一样的区域数量较少(< 10)

最近一个月提到的太多身体部位可能会让观众感到困惑。

洞察3：臀部锻炼很受欢迎

当一段视频消耗的卡路里较少(calories_per_min ≤9.846)时，5/(10+5)= 33%的人只要是在做臀部锻炼，就会获得较高的观看率，然而，只有7/(144+7)= 4.6%的非臀部锻炼有高的浏览量。

我们在决策树上看不到其他特定的身体部位非常的重要，Sydney的观众想要锻炼的是“臀部”区域！

获取更多浏览的一些可行的建议

那么，我们有什么策略可以推荐给Sydney呢？

策略1：燃烧更多的卡路里

正如我们所看到的，每分钟燃烧的卡路里是最重要的特征。每分钟燃烧12.025卡路里似乎是一个“神奇”的数字。

下表是一个很好的起点，不同长度的视频应该燃烧多少卡路里：

30分钟的锻炼：361卡路里
40分钟的锻炼：481卡路里
50分钟的锻炼：601卡路里
60分钟的锻炼：722卡路里

我们怀疑显示出来的数字(视频长度和卡路里)是心理上的。人们可能喜欢看到卡路里的前两位数比视频长度要大得多。

策略2：少用不同的身体部位

有时少就是多。

人们不喜欢锻炼标题中描述的太多不一样的身体部位。根据我们的模型，一个月内专注于少于10个身体部位的组合会更好。

我们注意到Sydney在她最近的视频中使用了更少的身体部位关键词。最明显的一个是她一直在使用“arms”或“upper body”，而不是“biceps”或“back”这样的词。

策略3：更多的臀部锻炼

Sydney的订阅者可能更多是女性，她们更关注“臀部”的锻炼，而不是肌肉发达的手臂。人们愿意牺牲燃烧更少的卡路里来获得更健美的臀部。也许Sydney 应该在燃烧更少卡路里的视频中加入一些臀部运动。

额外的策略

除了上述策略外，还有其他值得进一步研究的思路。

例如，Sydney可以尝试：

在月初发起新的活动 本月初发布的视频更有可能获得更高的浏览量。也许人们喜欢设定新的目标来开始新的一个月。
避免在5天内发布相同类型的锻炼。

这是一个我们正在尝试探索的应用，以提高YouTube的浏览量。有一些限制：

这些建议是根据过去的经验提出的。YouTube用户倾向于尝试过去常规之外的创新想法。考虑到这一点，我们可以将机器学习应用到他们的竞争对手身上，从而获得真正的洞察。
我们只对标题进行分析。有其他数据，比如视频的字幕可以被抓取。它们也可能包含有价值的见解。
我们的数据比YouTube频道的所有者少。还有其他关键信息，比如用户统计数据。可能会有更多的特征，更多的见解，以及对这些见解的更好的解释。

英文原文：https://www.justintodata.com/get-more-youtube-views-with-machine-learning/

如何使用机器学习算法得到更多的YouTube浏览量

导读

准备：抓取数据

第1步：观察数据

第2步：使用NLP技术来对视频分类

构建关键词列表

确定关键词列表的词干

对Youtube标题进行分词和词干提取

第3步：特征工程

基于关键词的特征

基于时间序列的特征

多重共线性测试

第4步：构建目标变量

第5步：构建决策树

第6步：理解决策树

洞察#1：每分钟燃烧的卡路里是最重要的特征

洞察#2：在不同的身体部位进行不同的锻炼并不能提高你的浏览量

洞察3：臀部锻炼很受欢迎

获取更多浏览的一些可行的建议

策略1：燃烧更多的卡路里

策略2：少用不同的身体部位

策略3：更多的臀部锻炼

额外的策略

相关推荐

取消回复欢迎你发表评论:

Google 黑客常用搜索语句一览原力计划

npx简介（npxvip是哪国的）

在 Android 模拟器上运行 ARM 应用（android模拟器原理）

GB28181,B接口协议之SIPRTSPRTPRTMP协议从入门到精通

安装使用Hoppscotch构建API请求访问与测试

手机实时提取SIM卡打电话的信令和声音-辅助外设与商用通话方案

Python自动化办公——后台截图（python 自动截图）

轻松转换!AppleNumbers到Excel的快捷教程

电脑端腾讯文档如何导出excel

用OpenCV测量图像中物体的大小（基于opencv的物体尺寸检测算法实现）

如何使用机器学习算法得到更多的YouTube浏览量

导读

准备：抓取数据

第1步：观察数据

第2步：使用NLP技术来对视频分类

构建关键词列表

确定关键词列表的词干

对Youtube标题进行分词和词干提取

第3步：特征工程

基于关键词的特征

基于时间序列的特征

多重共线性测试

第4步：构建目标变量

第5步：构建决策树

第6步：理解决策树

洞察#1：每分钟燃烧的卡路里是最重要的特征

洞察#2：在不同的身体部位进行不同的锻炼并不能提高你的浏览量

洞察3：臀部锻炼很受欢迎

获取更多浏览的一些可行的建议

策略1：燃烧更多的卡路里

策略2：少用不同的身体部位

策略3：更多的臀部锻炼

额外的策略

相关推荐

取消回复欢迎 你 发表评论:

Google 黑客常用搜索语句一览 原力计划

npx简介（npxvip是哪国的）

在 Android 模拟器上运行 ARM 应用（android模拟器原理）

GB28181,B接口协议之SIPRTSPRTPRTMP协议从入门到精通

安装使用Hoppscotch构建API请求访问与测试

手机实时提取SIM卡打电话的信令和声音-辅助外设与商用通话方案

Python自动化办公——后台截图（python 自动截图）

轻松转换!AppleNumbers到Excel的快捷教程

电脑端腾讯文档如何导出excel

用OpenCV测量图像中物体的大小（基于opencv的物体尺寸检测算法实现）

取消回复欢迎你发表评论:

Google 黑客常用搜索语句一览原力计划