哔哩哔哩八月生活搞笑区热度视频数据分析

Author：南岛鹋
发布时间：September 16, 2020
4015 views
No comments
18763 words
Categories：项目数据挖掘与分析

分析哔哩哔哩生活搞笑区的热度视频信息，分析月度视频的热词，三联等数据对视频播放量的影响。

数据爬取

确定目标

因为想要一个量大的数据集，因此没有考虑热榜排名，因为所有区加起来也才一千左右。全部视频信息的话技术不行，然后就盯上了分区榜。

分区榜

从这个榜单可以选择时间段，可以根据每个月的视频热度排名等信息，来分析月度热点，哪些视频更加容易火，以及各种因素对视频播放量的影响。虽然只是一个小分区月度热度排名，并不包含全部视频，但是数据量也是极大的。下图可以看到接近有23万条数据。

数据量

网站分析

这里存在一个难点，就是虽然浏览器上是可以查看网页源码，并且包含了视频的相关信息，但是用requests请求之后的网页源码却并没有相关的信息。因此前两个版本，我采用了selenium库的方法来获取信息，但是这个方法有一个缺点，速度慢（因为要跟浏览器一样加载整个页面信息）、信息少（只有标题、作者、视频简介、以及视频页和个人主页网址），很麻烦。于是这次我换成了API调用的方法。

页面分析

我们选择一个具体的数字来查找，可以发现搜索出来一个search的接口。

页面分析

点进去之后，可以发现里面的result共有20条数据，刚好对应着每页20个视频。

数据分析

可以看到里面包含了作者、标题、标签、播放等一系列数据。

接口为https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click©_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831 ,view_type为排行类型，page为页面数，pagesize为页面最大的视频数，上限好像是100。最后面就是时间了。

但是我还需要三连数据以及UP主的粉丝量。同理分析

得到三连的API接口：https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135

其中aid由BV转换。

粉丝数为https://api.bilibili.com/x/relation/stat?vmid=32172331

mid可以在第一个接口处获取。

IP池

这时候虽然已经可以开始爬取了，但是如果数据量稍微有一点大，访问稍微有点频繁，就会导致IP被屏蔽。

这时候我们就需要用到代理IP，免费的代理IP虽然也有，而且GITHUB上也有专门的项目来建立代理IP池。但是免费的终究很麻烦，于是我选择了日租独享的IP。http://www.xdaili.cn/

代码

# coding: utf-8
# Author：南岛鹋 
# Blog: www.ndmiao.cn
# Date ：2020/8/25 10:29
# Tool ：PyCharm

import requests
import csv
import json
import random
import time


class video_data:
    def __init__(self):
        self.url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page={}&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831'
        self.page = 11507
        self.alphabet = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'

    def dec(self, x):  # BV号转换成AV号
        r = 0
        for i, v in enumerate([11, 10, 3, 8, 4, 6]):
            r += self.alphabet.find(x[v]) * 58 ** i
        return (r - 0x2_0840_07c0) ^ 0x0a93_b324

    def random_headers(self, path): # 随机读取一个头信息
        with open(path, 'r') as f:
            data = f.readlines()
            f.close()

        reg = []
        for i in data:
            k = eval(i)  # 将字符串转化为字典形式
            reg.append(k)
        header = random.choice(reg)
        return header

    def get_ip(self): # 代理IP获取
        print('切换IP中.......')
        url = '代理IP的地址'
        ip = requests.get(url).text
        if ip in ['{"ERRORCODE":"10055","RESULT":"提取太频繁,请按规定频率提取!"}', '{"ERRORCODE":"10098","RESULT":"可用机器数量不足"}']: # 出现频繁或者机器不足，睡眠14秒
            time.sleep(14)
            ip = requests.get(url).text
            print(ip)
        else:
            print(ip)
        proxies = {
            'https': 'http://' + ip,
            'http': 'http://' + ip
        } # 设置https和http可以按需选择
        return proxies

    def get_requests(self, url, proxy): # 请求的函数
        headers = self.random_headers('headers.txt')
        # 将头信息和IP写入，用try来减少意外对程序的影响
        try: 
            response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
        except requests.exceptions.RequestException as e:
            print(e)
            proxy = self.get_ip()
            try:
                response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
            except requests.exceptions.RequestException as e:
                print(e)
                print('原始IP')
                response = requests.get(url, timeout=3, headers=headers)
        return response, proxy

    def get_follower(self, mid, proxy): # 获取粉丝数
        url = 'https://api.bilibili.com/x/relation/stat?vmid=' + str(mid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text) # 用json来解析文本
        # 按照需求获取需要的数据，因为粉丝数是必定存在的，所以失败了需要多次尝试获取。
        try:
            follower = result['data']['follower']
        except:
            follower,proxy = self.get_follower(mid, proxy)
        return follower, proxy

    def get_view(self, BV, proxy): # 获取三连和播放
        aid = self.dec(BV)
        url = 'https://api.bilibili.com/x/web-interface/archive/stat?aid=' + str(aid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text)
        view = {}# 因为视频虽然在排行榜，但是很可能已经删除，所以没有数据为None
        try:
            view['view'] = result['data']['view']
            view['danmu'] = result['data']['danmaku']
            view['reply'] = result['data']['reply']
            view['like'] = result['data']['like']
            view['coin'] = result['data']['coin']
            view['favorite'] = result['data']['favorite']
            view['share'] = result['data']['share']
            view['rank'] = result['data']['his_rank']
        except:
            view['view'] = 'None'
            view['danmu'] = 'None'
            view['reply'] = 'None'
            view['like'] = 'None'
            view['coin'] = 'None'
            view['favorite'] = 'None'
            view['share'] = 'None'
            view['rank'] = 'None'
        return view, proxy

    def get_parse(self, result, proxy): # 整合数据
        content = []
        items = result['result']
        for item in items:
            pubdate = item['pubdate']
            title = item['title']
            author = item['author']
            bvid = item['bvid']
            mid = item['mid']
            follower, proxy = self.get_follower(mid, proxy)
            video_view, proxy = self.get_view(bvid, proxy)
            view = video_view['view']
            danmu = video_view['danmu']
            reply = video_view['reply']
            like = video_view['like']
            coin = video_view['coin']
            favorite = video_view['favorite']
            share = video_view['share']
            rank = video_view['rank']
            tag = item['tag']
            con = [pubdate, title, author, bvid, mid, follower,view,danmu,reply,like,coin,favorite,share,rank, tag]
            content.append(con)
            print(con)
        print(content)
        self.save(content)
        return proxy

    def write_header(self):
        header = ['日期', '标题', '作者', 'BV', 'mid', '粉丝', '播放', '弹幕', '评论', '点赞','硬币','收藏','转发','排名','标签']
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as f:
            write = csv.writer(f)
            write.writerow(header)

    def save(self,content):# 存入csv
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as file:
            write = csv.writer(file)
            write.writerows(content)

    def run(self):
        #self.write_header()
        proxy = self.get_ip()
        for i in range(168, self.page):
            url = self.url.format(i)
            response, proxy = self.get_requests(url, proxy)
            result = json.loads(response.text)
            proxy = self.get_parse(result, proxy)
            print('第{}页爬取完毕'.format(i))


if __name__ == '__main__':
    video = video_data()
    video.run()

数据分析

以下代码在Notebook上运行
首先我们需要导入自己需要用到的库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS #导入模块worldcloud
from PIL import Image #导入模块PIL(Python Imaging Library)图像处理库
import numpy as np #导入模块numpy，多维数组
import matplotlib
import jieba

数据预处理

读取数据

data=open(r'D:\video_data.csv',encoding='utf-8')
video_data = pd.read_csv(data)

查看数据前五行

video_data.head()#查看前五行

前五行数据

浏览数据的大概信息

video_data.info()#视频数据的信息

数据信息

对数据进行预处理，将None值换成0，将数字数据的格式换成int

video_data['播放'].replace('None', 0,inplace = True)#将数据中标记为None的数据替换成0，方便数据处理
video_data['弹幕'].replace('None', 0,inplace = True)
video_data['评论'].replace('None', 0,inplace = True)
video_data['点赞'].replace('None', 0,inplace = True)
video_data['硬币'].replace('None', 0,inplace = True)
video_data['收藏'].replace('None', 0,inplace = True)
video_data['转发'].replace('None', 0,inplace = True)
video_data['排名'].replace('None', 0,inplace = True)
video_data['播放'] = video_data['播放'].astype("int64")#将数字的格式转换成int格式，用于数据处理
video_data['弹幕'] = video_data['弹幕'].astype("int")
video_data['评论'] = video_data['评论'].astype("int")
video_data['点赞'] = video_data['点赞'].astype("int")
video_data['硬币'] = video_data['硬币'].astype("int")
video_data['收藏'] = video_data['收藏'].astype("int")
video_data['转发'] = video_data['转发'].astype("int")
video_data['排名'] = video_data['排名'].astype("int")

查看预处理之后的数据格式

video_data.info()

转换后

对标题进行预处理，只保留中文字符

video_data['标题'] = video_data['标题'].str.replace(r'[^\u4e00-\u9fa5]','')#只保留中文

将标题分割成一个个短词

video_data['标题'].fillna(' ',inplace=True) #将空值替换成空格
video_data['标题'] = video_data['标题'].apply(lambda x:' '.join(jieba.cut(x)))#将句子分割成一个个词语
video_data['标题'].head()

处理后的结果

同理处理标签

#同理处理标签
video_data['标签'] = video_data['标签'].str.replace(',','')
video_data['标签'].fillna(' ',inplace=True)
video_data['标签'] = video_data['标签'].apply(lambda x:' '.join(jieba.cut(x)))
video_data['标签'].head()

标签取词结果

将时间信息转化成标准格式

video_data.日期 = pd.to_datetime(video_data.日期.str.findall(r'\d{4}.+').str.get(0)) #将时间进行解析，转化为标准格式
video_data['weekday'] = video_data.日期.dt.weekday #获取星期几
video_data['hour'] = video_data.日期.dt.hour #获取小时

设置一个四舍五入代码

#用于计算三连、弹幕、评论率
def new_round(_float, _len):
    if isinstance(_float, float):
        if str(_float)[::-1].find('.') <= _len:
            return(_float)
        if str(_float)[-1] == '5':
            return(round(float(str(_float)[:-1]+'6'), _len))
        else:
            return(round(_float, _len))
    else:
        return(round(_float, _len))

计算三连等比率

video_data['点赞率']=new_round(video_data.点赞/video_data.播放*100,0)
video_data['硬币率']=new_round(video_data.硬币/video_data.播放*100,0)
video_data['收藏率']=new_round(video_data.收藏/video_data.播放*100,0)
video_data['转发率']=new_round(video_data.转发/video_data.播放*100,0)
video_data['弹幕率']=new_round(video_data.弹幕/video_data.播放*100,0)
video_data['评论率']=new_round(video_data.评论/video_data.播放*100,1)

查看处理后的数据

video_data.head()

处理后的数据

数据分析

查看一共有几位UP主

print('共有{}位UP，分别是'.format(len(video_data['UP'].unique())))#unique是将重复的去除
video_data['UP'].unique()

UP主的数量

统计每个播放量区间的视频数量

# 计算每个播放量区间的视频数量
data_length = len(video_data)
data_length_rate0 = len(video_data[video_data['播放']<10000])
data_length_rate1 = len(video_data[(video_data['播放']>=10000) & (video_data['播放']<100000)])
data_length_rate2 = len(video_data[(video_data['播放']>=100000) & (video_data['播放']<500000)])
data_length_rate3 = len(video_data[(video_data['播放']>=500000) & (video_data['播放']<1000000)])
data_length_rate4 = len(video_data[video_data['播放']>1000000])

结果展示

video_rate = [data_length_rate0,data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[0,9999]','[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
video_rate

结果查看

画图展示

# 画出饼图
plt.rcParams['font.sans-serif']=['SimHei'] # 中文不乱码
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

扇形图展示

只展示一万播放量以上的内容

video_rate = [data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

一万以上的展示图

统计展示播放量排名前二十的UP主

#统计播放量排名前20的UP
top_20=video_data.sort_values(by=['播放'],ascending=False)[:20]
top_20['UP'].value_counts()

数量展示

前20的具体数据

# 前20的具体数据
top_20[['UP','播放','粉丝']]

前20的视频基本数据

根据UP主分组，对每个UP八月的总播放量进行排序

#根据UP主分组，对每个UP八月的总播放量进行排序
print(video_data.groupby('UP')['播放'].sum().sort_values(ascending=False)[:20])
top_1 = video_data[video_data['UP']=='大霓奈']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='陈师姬']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='17岁反派里的持枪Boy']
print(top_1['UP'].value_counts())

播放量和展示

对每个UP主的弹幕数综合进行排序

# 对每个UP主的弹幕数综合进行排序
video_data.groupby('UP')['弹幕'].sum().sort_values(ascending=False)[:20]

弹幕数展示

评论数展示

# 对每个UP主的评论数综合进行排序
video_data.groupby('UP')['评论'].sum().sort_values(ascending=False)[:20]

评论数展示

视频数量展示

# 对八月份每个UP主发的视频数量进行统计
video_data['UP'].value_counts()[:20]

视频数量展示

对每周不同时间段发布的视频数量进行统计

# 对每周不同时间段发布的视频数量进行统计
fig1, ax1=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour', 'weekday']).count()['mid'].unstack()
df.plot(ax=ax1, style='-.')
plt.show()

视频数量随时间分布图

# 对每周不同时间的视频播放量进行统计
fig2,ax2=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour','weekday']).sum()['播放'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()

视频播放量总和随时间分布

# 对每周不同时间段发布的视频播放量大于10000的视频数量进行汇总
view_1 = video_data[video_data['播放']>10000]
fig2,ax2=plt.subplots(figsize=(14,4))
df=view_1.groupby(['hour','weekday']).sum()['mid'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()

大于一万播放量的视频数量随时间分布图

view_2 = video_data[video_data['播放']>100000]
view_3 = video_data[video_data['播放']>1000000]

用词云显示热词

matplotlib.rcParams['font.sans-serif'] = ['KaiTi']#作图的中文
matplotlib.rcParams['font.serif'] = ['KaiTi']#作图的中文
infile = open("D:/stopwords.txt",encoding='utf-8')
stopwords_lst = infile.readlines()
STOPWORDS = [x.strip() for x in stopwords_lst] #去除头尾字符
stopwords = set(STOPWORDS) #设置停用词

def ciyun(texts,mid='all'): #支持指定UP主
    if mid == 'all':
        text = ' '.join(texts)
    else:
        text = ' '.join(texts[video_data['mid']==mid])

    wc = WordCloud(font_path="msyh.ttc", background_color='white', max_words=100, stopwords=stopwords, max_font_size=80, random_state=42, margin=3) #配置词云参数
    wc.generate(text) #生成词云
    plt.imshow(wc,interpolation="bilinear")#作图
    plt.axis("off") #不显示坐标轴

ciyun(video_data['标题'])

大于一万视频的热词

ciyun(view_1['标题'])

大于10万播放视频的热词

ciyun(view_2['标题'])

大于100万播放视频的热词

ciyun(view_3['标题'])

同理查看标签的热词

ciyun(video_data['标签'])

ciyun(view_1['标签'])

ciyun(view_2['标签'])

ciyun(view_3['标签'])

统计标题中包含老师的视频数和播放量综合

# 统计标题中包含老师的视频数和播放量综合
video_teacher = video_data[video_data['标题'].str.contains("老师")]
teacher = [len(video_teacher),video_teacher['播放'].sum()]
teacher

[3022, 16610756]

video_bro = video_data[video_data['标题'].str.contains("兄弟")]
brother = [len(video_bro),video_bro['播放'].sum()]
brother

[1897, 25270292]

video_girlfriend = video_data[video_data['标题'].str.contains("女朋友")]
girlfriend = [len(video_girlfriend),video_girlfriend['播放'].sum()]
girlfriend

[830, 28265224]

# 包含女朋友的标题中包含兄弟的视频信息
fun = video_girlfriend[video_girlfriend['标题'].str.contains("兄弟")].drop_duplicates()
print(len(fun))
print(fun)

女朋友的标题中包含兄弟的视频信息

video_yidan = video_data[video_data['标题'].str.contains("一旦")]
yidan = [len(video_yidan),video_yidan['播放'].sum()]
yidan

[89, 28302099]

video_wubei = video_data[video_data['标题'].str.contains("吾辈")]
wubei = [len(video_wubei),video_wubei['播放'].sum()]
wubei

[318, 35563900]

video_waizui = video_data[video_data['标题'].str.contains("歪嘴")]
waizui = [len(video_waizui),video_waizui['播放'].sum()]
waizui

[1810, 70787655]

查看带有这几个热词标题视频的个数饼状图

video_rate = [teacher[0],brother[0],girlfriend[0],yidan[0],wubei[0],waizui[0]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

带有热刺标题的视频播放量饼图

video_rate = [teacher[1],brother[1],girlfriend[1],yidan[1],wubei[1],waizui[1]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

同理处理标签的

video_bijian = video_data[video_data['标签'].str.contains("必剪")]
bijian = [len(video_bijian),video_bijian['播放'].sum()]
video_fun = video_data[video_data['标签'].str.contains("恶作剧")]
fun = [len(video_fun),video_fun['播放'].sum()]
video_tc = video_data[video_data['标签'].str.contains("吐槽")]
tc = [len(video_tc),video_tc['播放'].sum()]
video_beauty = video_data[video_data['标签'].str.contains("美女")]
beauty = [len(video_beauty),video_beauty['播放'].sum()]
video_wezy = video_data[video_data['标签'].str.contains("万恶之源")]
wezy = [len(video_wezy),video_wezy['播放'].sum()]
video_show = video_data[video_data['标签'].str.contains("表演")]
show = [len(video_show),video_show['播放'].sum()]
video_tuwei = video_data[video_data['标签'].str.contains("土味")]
tuwei = [len(video_tuwei),video_tuwei['播放'].sum()]


video_rate = [bijian[0],fun[0],tc[0],beauty[0],wezy[0],show[0],tuwei[0]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

video_rate = [bijian[1],fun[1],tc[1],beauty[1],wezy[1],show[1],tuwei[1]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

对大于1万播放量的视频三连率等进行排序

# 对大于1万播放量的视频三连率等进行排序
video_rate = video_data[video_data['播放']>10000]
like_20=video_rate.sort_values(by=['点赞率'],ascending=False)[:20]
coin_20=video_rate.sort_values(by=['硬币率'],ascending=False)[:20]
sc_20=video_rate.sort_values(by=['收藏率'],ascending=False)[:20]
share_20=video_rate.sort_values(by=['转发率'],ascending=False)[:20]
danmu_20=video_rate.sort_values(by=['弹幕率'],ascending=False)[:20]
command_20=video_rate.sort_values(by=['评论率'],ascending=False)[:20]


like_20[['标题','播放','UP','点赞','点赞率']]

点赞率前20

coin_20[['标题','播放','UP','硬币','硬币率']]

硬币率前20

sc_20[['标题','播放','UP','收藏','收藏率']]

收藏率前20

share_20[['标题','播放','UP','转发','转发率']]

分享率前20

danmu_20[['标题','播放','UP','弹幕','弹幕率']]

弹幕率前20

command_20[['标题','播放','UP','评论','评论率']]

评论率前20

结论

生活搞笑区的视频中，大部分视频的播放量都集中在10000以下，占了93.86%。
要想获得高播放，则有三条途径：粉丝数、视频质量、视频数量。
每个月大量上传视频获取高播放完全有可能。播放总和最高的两位UP，一个投了154个视频，一个投了528个。
弹幕和评论则是粉丝数多的UP占优势，粉丝黏性高。
八月投放视频最多的UP是老年人诱捕大队队长，一共投放了6932个视频。
视频主要集中在10:00-24:00投放，这个区间的播放总和也是最高。
八月热词主要是龙王，节日相关，哔哩哔哩活动以及相关的UP主。
哔哩哔哩相关活动热词视频播放量普遍较低，UP相关的和月度梗相关的播放量收益最好。
三连率、弹幕率、转发率、评论率对视频播放量的影响不大。

资料

Last modification：September 16, 2020

如果觉得我的文章对你有用，请随意赞赏

哔哩哔哩八月生活搞笑区热度视频数据分析

南岛鹋 • 2020 年 09 月 16 日

<blockquote>分析哔哩哔哩生活搞笑区的热度视频信息，分析月度视频的热词，三联等数据对视频播放量的影响。</blockquote><h2>数据爬取</h2><h3>确定目标</h3>因为想要一个量大的数据集，因此没有考虑热榜排名，因为所有区加起来也才一千左右。全部视频信息的话技术不行，然后就盯上了分区榜。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/784618101.png" alt="分区榜" title="分区榜" style="">从这个榜单可以选择时间段，可以根据每个月的视频热度排名等信息，来分析月度热点，哪些视频更加容易火，以及各种因素对视频播放量的影响。虽然只是一个小分区月度热度排名，并不包含全部视频，但是数据量也是极大的。下图可以看到接近有23万条数据。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/2927177774.png" alt="数据量" title="数据量" style=""><h3>网站分析</h3>这里存在一个难点，就是虽然浏览器上是可以查看网页源码，并且包含了视频的相关信息，但是用requests请求之后的网页源码却并没有相关的信息。因此前两个版本，我采用了selenium库的方法来获取信息，但是这个方法有一个缺点，速度慢（因为要跟浏览器一样加载整个页面信息）、信息少（只有标题、作者、视频简介、以及视频页和个人主页网址），很麻烦。于是这次我换成了API调用的方法。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/122847043.png" alt="页面分析" title="页面分析" style="">我们选择一个具体的数字来查找，可以发现搜索出来一个search的接口。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/2667775604.png" alt="页面分析" title="页面分析" style="">点进去之后，可以发现里面的result共有20条数据，刚好对应着每页20个视频。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/788882558.png" alt="数据分析" title="数据分析" style="">可以看到里面包含了作者、标题、标签、播放等一系列数据。接口为<code><a class="no-external-link" href="https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831" target="_blank">https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831</a> </code>,view_type为排行类型，page为页面数，pagesize为页面最大的视频数，上限好像是100。最后面就是时间了。但是我还需要三连数据以及UP主的粉丝量。同理分析得到三连的API接口：<code><a class="no-external-link" href="https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135" target="_blank">https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135</a> </code>其中aid由BV转换。粉丝数为<code><a class="no-external-link" href="https://api.bilibili.com/x/relation/stat?vmid=32172331" target="_blank">https://api.bilibili.com/x/relation/stat?vmid=32172331</a> </code>mid可以在第一个接口处获取。<h3>IP池</h3>这时候虽然已经可以开始爬取了，但是如果数据量稍微有一点大，访问稍微有点频繁，就会导致IP被屏蔽。这时候我们就需要用到代理IP，免费的代理IP虽然也有，而且GITHUB上也有专门的项目来建立代理IP池。但是免费的终究很麻烦，于是我选择了日租独享的IP。<code><a class="no-external-link" href="http://www.xdaili.cn/" target="_blank">http://www.xdaili.cn/</a> </code><h3>代码</h3><pre><code># coding: utf-8
# Author：南岛鹋 
# Blog: www.ndmiao.cn
# Date ：2020/8/25 10:29
# Tool ：PyCharm

import requests
import csv
import json
import random
import time

class video_data:
    def __init__(self):
        self.url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&amp;search_type=video&amp;view_type=hot_rank&amp;order=click&amp;copy_right=-1&amp;cate_id=138&amp;page={}&amp;pagesize=20&amp;jsonp=jsonp&amp;time_from=20200801&amp;time_to=20200831'
        self.page = 11507
        self.alphabet = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'

def dec(self, x):  # BV号转换成AV号
        r = 0
        for i, v in enumerate([11, 10, 3, 8, 4, 6]):
            r += self.alphabet.find(x[v]) * 58 ** i
        return (r - 0x2_0840_07c0) ^ 0x0a93_b324

def random_headers(self, path): # 随机读取一个头信息
        with open(path, 'r') as f:
            data = f.readlines()
            f.close()

reg = []
        for i in data:
            k = eval(i)  # 将字符串转化为字典形式
            reg.append(k)
        header = random.choice(reg)
        return header

def get_ip(self): # 代理IP获取
        print('切换IP中.......')
        url = '代理IP的地址'
        ip = requests.get(url).text
        if ip in ['{&quot;ERRORCODE&quot;:&quot;10055&quot;,&quot;RESULT&quot;:&quot;提取太频繁,请按规定频率提取!&quot;}', '{&quot;ERRORCODE&quot;:&quot;10098&quot;,&quot;RESULT&quot;:&quot;可用机器数量不足&quot;}']: # 出现频繁或者机器不足，睡眠14秒
            time.sleep(14)
            ip = requests.get(url).text
            print(ip)
        else:
            print(ip)
        proxies = {
            'https': 'http://' + ip,
            'http': 'http://' + ip
        } # 设置https和http可以按需选择
        return proxies

def get_requests(self, url, proxy): # 请求的函数
        headers = self.random_headers('headers.txt')
        # 将头信息和IP写入，用try来减少意外对程序的影响
        try: 
            response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
        except requests.exceptions.RequestException as e:
            print(e)
            proxy = self.get_ip()
            try:
                response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
            except requests.exceptions.RequestException as e:
                print(e)
                print('原始IP')
                response = requests.get(url, timeout=3, headers=headers)
        return response, proxy

def get_follower(self, mid, proxy): # 获取粉丝数
        url = 'https://api.bilibili.com/x/relation/stat?vmid=' + str(mid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text) # 用json来解析文本
        # 按照需求获取需要的数据，因为粉丝数是必定存在的，所以失败了需要多次尝试获取。
        try:
            follower = result['data']['follower']
        except:
            follower,proxy = self.get_follower(mid, proxy)
        return follower, proxy

def get_view(self, BV, proxy): # 获取三连和播放
        aid = self.dec(BV)
        url = 'https://api.bilibili.com/x/web-interface/archive/stat?aid=' + str(aid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text)
        view = {}# 因为视频虽然在排行榜，但是很可能已经删除，所以没有数据为None
        try:
            view['view'] = result['data']['view']
            view['danmu'] = result['data']['danmaku']
            view['reply'] = result['data']['reply']
            view['like'] = result['data']['like']
            view['coin'] = result['data']['coin']
            view['favorite'] = result['data']['favorite']
            view['share'] = result['data']['share']
            view['rank'] = result['data']['his_rank']
        except:
            view['view'] = 'None'
            view['danmu'] = 'None'
            view['reply'] = 'None'
            view['like'] = 'None'
            view['coin'] = 'None'
            view['favorite'] = 'None'
            view['share'] = 'None'
            view['rank'] = 'None'
        return view, proxy

def get_parse(self, result, proxy): # 整合数据
        content = []
        items = result['result']
        for item in items:
            pubdate = item['pubdate']
            title = item['title']
            author = item['author']
            bvid = item['bvid']
            mid = item['mid']
            follower, proxy = self.get_follower(mid, proxy)
            video_view, proxy = self.get_view(bvid, proxy)
            view = video_view['view']
            danmu = video_view['danmu']
            reply = video_view['reply']
            like = video_view['like']
            coin = video_view['coin']
            favorite = video_view['favorite']
            share = video_view['share']
            rank = video_view['rank']
            tag = item['tag']
            con = [pubdate, title, author, bvid, mid, follower,view,danmu,reply,like,coin,favorite,share,rank, tag]
            content.append(con)
            print(con)
        print(content)
        self.save(content)
        return proxy

def write_header(self):
        header = ['日期', '标题', '作者', 'BV', 'mid', '粉丝', '播放', '弹幕', '评论', '点赞','硬币','收藏','转发','排名','标签']
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as f:
            write = csv.writer(f)
            write.writerow(header)

def save(self,content):# 存入csv
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as file:
            write = csv.writer(file)
            write.writerows(content)

def run(self):
        #self.write_header()
        proxy = self.get_ip()
        for i in range(168, self.page):
            url = self.url.format(i)
            response, proxy = self.get_requests(url, proxy)
            result = json.loads(response.text)
            proxy = self.get_parse(result, proxy)
            print('第{}页爬取完毕'.format(i))

if __name__ == '__main__':
 video = video_data()
 video.run()
</code></pre><h2>数据分析</h2>以下代码在Notebook上运行 首先我们需要导入自己需要用到的库<pre><code>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS #导入模块worldcloud
from PIL import Image #导入模块PIL(Python Imaging Library)图像处理库
import numpy as np #导入模块numpy，多维数组
import matplotlib
import jieba
</code></pre><h3>数据预处理</h3>读取数据<pre><code>data=open(r'D:\video_data.csv',encoding='utf-8')
video_data = pd.read_csv(data)</code></pre>查看数据前五行<pre><code>video_data.head()#查看前五行</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1794480476.png" alt="前五行数据" title="前五行数据" style="">浏览数据的大概信息<pre><code>video_data.info()#视频数据的信息
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/316159067.png" alt="数据信息" title="数据信息" style="">对数据进行预处理，将None值换成0，将数字数据的格式换成int<pre><code>video_data['播放'].replace('None', 0,inplace = True)#将数据中标记为None的数据替换成0，方便数据处理
video_data['弹幕'].replace('None', 0,inplace = True)
video_data['评论'].replace('None', 0,inplace = True)
video_data['点赞'].replace('None', 0,inplace = True)
video_data['硬币'].replace('None', 0,inplace = True)
video_data['收藏'].replace('None', 0,inplace = True)
video_data['转发'].replace('None', 0,inplace = True)
video_data['排名'].replace('None', 0,inplace = True)
video_data['播放'] = video_data['播放'].astype(&quot;int64&quot;)#将数字的格式转换成int格式，用于数据处理
video_data['弹幕'] = video_data['弹幕'].astype(&quot;int&quot;)
video_data['评论'] = video_data['评论'].astype(&quot;int&quot;)
video_data['点赞'] = video_data['点赞'].astype(&quot;int&quot;)
video_data['硬币'] = video_data['硬币'].astype(&quot;int&quot;)
video_data['收藏'] = video_data['收藏'].astype(&quot;int&quot;)
video_data['转发'] = video_data['转发'].astype(&quot;int&quot;)
video_data['排名'] = video_data['排名'].astype(&quot;int&quot;)</code></pre>查看预处理之后的数据格式<pre><code>video_data.info()</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2331605893.png" alt="转换后" title="转换后" style="">对标题进行预处理，只保留中文字符<pre><code>video_data['标题'] = video_data['标题'].str.replace(r'[^\u4e00-\u9fa5]','')#只保留中文
</code></pre>将标题分割成一个个短词<pre><code>video_data['标题'].fillna(' ',inplace=True) #将空值替换成空格
video_data['标题'] = video_data['标题'].apply(lambda x:' '.join(jieba.cut(x)))#将句子分割成一个个词语
video_data['标题'].head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/114151688.png" alt="处理后的结果" title="处理后的结果" style="">同理处理标签<pre><code>#同理处理标签
video_data['标签'] = video_data['标签'].str.replace(',','')
video_data['标签'].fillna(' ',inplace=True)
video_data['标签'] = video_data['标签'].apply(lambda x:' '.join(jieba.cut(x)))
video_data['标签'].head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/492638963.png" alt="标签取词结果" title="标签取词结果" style="">将时间信息转化成标准格式<pre><code>video_data.日期 = pd.to_datetime(video_data.日期.str.findall(r'\d{4}.+').str.get(0)) #将时间进行解析，转化为标准格式
video_data['weekday'] = video_data.日期.dt.weekday #获取星期几
video_data['hour'] = video_data.日期.dt.hour #获取小时
</code></pre>设置一个四舍五入代码<pre><code>#用于计算三连、弹幕、评论率
def new_round(_float, _len):
 if isinstance(_float, float):
 if str(_float)[::-1].find('.') &lt;= _len:
 return(_float)
 if str(_float)[-1] == '5':
 return(round(float(str(_float)[:-1]+'6'), _len))
 else:
 return(round(_float, _len))
 else:
 return(round(_float, _len))
</code></pre>计算三连等比率<pre><code>video_data['点赞率']=new_round(video_data.点赞/video_data.播放*100,0)
video_data['硬币率']=new_round(video_data.硬币/video_data.播放*100,0)
video_data['收藏率']=new_round(video_data.收藏/video_data.播放*100,0)
video_data['转发率']=new_round(video_data.转发/video_data.播放*100,0)
video_data['弹幕率']=new_round(video_data.弹幕/video_data.播放*100,0)
video_data['评论率']=new_round(video_data.评论/video_data.播放*100,1)
</code></pre>查看处理后的数据<pre><code>video_data.head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/595134780.png" alt="处理后的数据" title="处理后的数据" style=""><h3>数据分析</h3>查看一共有几位UP主<pre><code>print('共有{}位UP，分别是'.format(len(video_data['UP'].unique())))#unique是将重复的去除
video_data['UP'].unique()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4264157629.png" alt="UP主的数量" title="UP主的数量" style="">统计每个播放量区间的视频数量<pre><code># 计算每个播放量区间的视频数量
data_length = len(video_data)
data_length_rate0 = len(video_data[video_data['播放']&lt;10000])
data_length_rate1 = len(video_data[(video_data['播放']&gt;=10000) &amp; (video_data['播放']&lt;100000)])
data_length_rate2 = len(video_data[(video_data['播放']&gt;=100000) &amp; (video_data['播放']&lt;500000)])
data_length_rate3 = len(video_data[(video_data['播放']&gt;=500000) &amp; (video_data['播放']&lt;1000000)])
data_length_rate4 = len(video_data[video_data['播放']&gt;1000000])</code></pre>结果展示<pre><code>video_rate = [data_length_rate0,data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[0,9999]','[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
video_rate
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/252521241.png" alt="结果查看" title="结果查看" style="">画图展示<pre><code># 画出饼图
plt.rcParams['font.sans-serif']=['SimHei'] # 中文不乱码
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1750410179.png" alt="扇形图展示" title="扇形图展示" style="">只展示一万播放量以上的内容<pre><code>video_rate = [data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3312130753.png" alt="一万以上的展示图" title="一万以上的展示图" style="">统计展示播放量排名前二十的UP主<pre><code>#统计播放量排名前20的UP
top_20=video_data.sort_values(by=['播放'],ascending=False)[:20]
top_20['UP'].value_counts()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2744268262.png" alt="数量展示" title="数量展示" style="">前20的具体数据<pre><code># 前20的具体数据
top_20[['UP','播放','粉丝']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4235206928.png" alt="前20的视频基本数据" title="前20的视频基本数据" style="">根据UP主分组，对每个UP八月的总播放量进行排序<pre><code>#根据UP主分组，对每个UP八月的总播放量进行排序
print(video_data.groupby('UP')['播放'].sum().sort_values(ascending=False)[:20])
top_1 = video_data[video_data['UP']=='大霓奈']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='陈师姬']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='17岁反派里的持枪Boy']
print(top_1['UP'].value_counts())
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/405809024.png" alt="播放量和展示" title="播放量和展示" style="">对每个UP主的弹幕数综合进行排序<pre><code># 对每个UP主的弹幕数综合进行排序
video_data.groupby('UP')['弹幕'].sum().sort_values(ascending=False)[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4235494730.png" alt="弹幕数展示" title="弹幕数展示" style="">评论数展示<pre><code># 对每个UP主的评论数综合进行排序
video_data.groupby('UP')['评论'].sum().sort_values(ascending=False)[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2808414200.png" alt="评论数展示" title="评论数展示" style="">视频数量展示<pre><code># 对八月份每个UP主发的视频数量进行统计
video_data['UP'].value_counts()[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3380581576.png" alt="视频数量展示" title="视频数量展示" style="">对每周不同时间段发布的视频数量进行统计<pre><code># 对每周不同时间段发布的视频数量进行统计
fig1, ax1=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour', 'weekday']).count()['mid'].unstack()
df.plot(ax=ax1, style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4179627992.png" alt="视频数量随时间分布图" title="视频数量随时间分布图" style=""><pre><code># 对每周不同时间的视频播放量进行统计
fig2,ax2=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour','weekday']).sum()['播放'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3903313407.png" alt="视频播放量总和随时间分布" title="视频播放量总和随时间分布" style=""><pre><code># 对每周不同时间段发布的视频播放量大于10000的视频数量进行汇总
view_1 = video_data[video_data['播放']&gt;10000]
fig2,ax2=plt.subplots(figsize=(14,4))
df=view_1.groupby(['hour','weekday']).sum()['mid'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1458496303.png" alt="大于一万播放量的视频数量随时间分布图" title="大于一万播放量的视频数量随时间分布图" style=""><pre><code>view_2 = video_data[video_data['播放']&gt;100000]
view_3 = video_data[video_data['播放']&gt;1000000]
</code></pre>用词云显示热词<pre><code>matplotlib.rcParams['font.sans-serif'] = ['KaiTi']#作图的中文
matplotlib.rcParams['font.serif'] = ['KaiTi']#作图的中文
infile = open(&quot;D:/stopwords.txt&quot;,encoding='utf-8')
stopwords_lst = infile.readlines()
STOPWORDS = [x.strip() for x in stopwords_lst] #去除头尾字符
stopwords = set(STOPWORDS) #设置停用词

def ciyun(texts,mid='all'): #支持指定UP主
    if mid == 'all':
        text = ' '.join(texts)
    else:
        text = ' '.join(texts[video_data['mid']==mid])

wc = WordCloud(font_path=&quot;msyh.ttc&quot;, background_color='white', max_words=100, stopwords=stopwords, max_font_size=80, random_state=42, margin=3) #配置词云参数
    wc.generate(text) #生成词云
    plt.imshow(wc,interpolation=&quot;bilinear&quot;)#作图
    plt.axis(&quot;off&quot;) #不显示坐标轴

ciyun(video_data['标题'])</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1822230669.png" alt="热词" title="热词" style="">大于一万视频的热词<pre><code>ciyun(view_1['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1267505390.png" alt="热词" title="热词" style="">大于10万播放视频的热词<pre><code>ciyun(view_2['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1035080606.png" alt="热词" title="热词" style="">大于100万播放视频的热词<pre><code>ciyun(view_3['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1430339521.png" alt="热词" title="热词" style="">同理查看标签的热词<pre><code>ciyun(video_data['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3222928523.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_1['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/31297922.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_2['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4169206333.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_3['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1576138282.png" alt="热词" title="热词" style="">统计标题中包含老师的视频数和播放量综合<pre><code># 统计标题中包含老师的视频数和播放量综合
video_teacher = video_data[video_data['标题'].str.contains(&quot;老师&quot;)]
teacher = [len(video_teacher),video_teacher['播放'].sum()]
teacher
</code></pre>[3022, 16610756]<pre><code>video_bro = video_data[video_data['标题'].str.contains(&quot;兄弟&quot;)]
brother = [len(video_bro),video_bro['播放'].sum()]
brother
</code></pre>[1897, 25270292]<pre><code>video_girlfriend = video_data[video_data['标题'].str.contains(&quot;女朋友&quot;)]
girlfriend = [len(video_girlfriend),video_girlfriend['播放'].sum()]
girlfriend
</code></pre>[830, 28265224]<pre><code># 包含女朋友的标题中包含兄弟的视频信息
fun = video_girlfriend[video_girlfriend['标题'].str.contains(&quot;兄弟&quot;)].drop_duplicates()
print(len(fun))
print(fun)
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/812799961.png" alt="女朋友的标题中包含兄弟的视频信息" title="女朋友的标题中包含兄弟的视频信息" style=""><pre><code>video_yidan = video_data[video_data['标题'].str.contains(&quot;一旦&quot;)]
yidan = [len(video_yidan),video_yidan['播放'].sum()]
yidan
</code></pre>[89, 28302099]<pre><code>video_wubei = video_data[video_data['标题'].str.contains(&quot;吾辈&quot;)]
wubei = [len(video_wubei),video_wubei['播放'].sum()]
wubei
</code></pre>[318, 35563900]<pre><code>video_waizui = video_data[video_data['标题'].str.contains(&quot;歪嘴&quot;)]
waizui = [len(video_waizui),video_waizui['播放'].sum()]
waizui
</code></pre>[1810, 70787655]查看带有这几个热词标题视频的个数饼状图<pre><code>video_rate = [teacher[0],brother[0],girlfriend[0],yidan[0],wubei[0],waizui[0]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/103150957.png" alt="饼图" title="饼图" style="">带有热刺标题的视频播放量饼图<pre><code>video_rate = [teacher[1],brother[1],girlfriend[1],yidan[1],wubei[1],waizui[1]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2312876709.png" alt="饼图" title="饼图" style="">同理处理标签的<pre><code>video_bijian = video_data[video_data['标签'].str.contains(&quot;必剪&quot;)]
bijian = [len(video_bijian),video_bijian['播放'].sum()]
video_fun = video_data[video_data['标签'].str.contains(&quot;恶作剧&quot;)]
fun = [len(video_fun),video_fun['播放'].sum()]
video_tc = video_data[video_data['标签'].str.contains(&quot;吐槽&quot;)]
tc = [len(video_tc),video_tc['播放'].sum()]
video_beauty = video_data[video_data['标签'].str.contains(&quot;美女&quot;)]
beauty = [len(video_beauty),video_beauty['播放'].sum()]
video_wezy = video_data[video_data['标签'].str.contains(&quot;万恶之源&quot;)]
wezy = [len(video_wezy),video_wezy['播放'].sum()]
video_show = video_data[video_data['标签'].str.contains(&quot;表演&quot;)]
show = [len(video_show),video_show['播放'].sum()]
video_tuwei = video_data[video_data['标签'].str.contains(&quot;土味&quot;)]
tuwei = [len(video_tuwei),video_tuwei['播放'].sum()]

video_rate = [bijian[0],fun[0],tc[0],beauty[0],wezy[0],show[0],tuwei[0]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3914545782.png" alt="饼图" title="饼图" style=""><pre><code>video_rate = [bijian[1],fun[1],tc[1],beauty[1],wezy[1],show[1],tuwei[1]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1246096207.png" alt="饼图" title="饼图" style="">对大于1万播放量的视频三连率等进行排序<pre><code># 对大于1万播放量的视频三连率等进行排序
video_rate = video_data[video_data['播放']&gt;10000]
like_20=video_rate.sort_values(by=['点赞率'],ascending=False)[:20]
coin_20=video_rate.sort_values(by=['硬币率'],ascending=False)[:20]
sc_20=video_rate.sort_values(by=['收藏率'],ascending=False)[:20]
share_20=video_rate.sort_values(by=['转发率'],ascending=False)[:20]
danmu_20=video_rate.sort_values(by=['弹幕率'],ascending=False)[:20]
command_20=video_rate.sort_values(by=['评论率'],ascending=False)[:20]

like_20[['标题','播放','UP','点赞','点赞率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3390342330.png" alt="点赞率前20" title="点赞率前20" style=""><pre><code>coin_20[['标题','播放','UP','硬币','硬币率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2783574875.png" alt="硬币率前20" title="硬币率前20" style=""><pre><code>sc_20[['标题','播放','UP','收藏','收藏率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4152588626.png" alt="收藏率前20" title="收藏率前20" style=""><pre><code>share_20[['标题','播放','UP','转发','转发率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2861268934.png" alt="分享率前20" title="分享率前20" style=""><pre><code>danmu_20[['标题','播放','UP','弹幕','弹幕率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1514254705.png" alt="弹幕率前20" title="弹幕率前20" style=""><pre><code>command_20[['标题','播放','UP','评论','评论率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2018548542.png" alt="评论率前20" title="评论率前20" style=""><h2>结论</h2><ul><li>生活搞笑区的视频中，大部分视频的播放量都集中在10000以下，占了93.86%。</li><li>要想获得高播放，则有三条途径：粉丝数、视频质量、视频数量。</li><li>每个月大量上传视频获取高播放完全有可能。播放总和最高的两位UP，一个投了154个视频，一个投了528个。</li><li>弹幕和评论则是粉丝数多的UP占优势，粉丝黏性高。</li><li>八月投放视频最多的UP是老年人诱捕大队队长，一共投放了6932个视频。</li><li>视频主要集中在10:00-24:00投放，这个区间的播放总和也是最高。</li><li>八月热词主要是龙王，节日相关，哔哩哔哩活动以及相关的UP主。</li><li>哔哩哔哩相关活动热词视频播放量普遍较低，UP相关的和月度梗相关的播放量收益最好。</li><li>三连率、弹幕率、转发率、评论率对视频播放量的影响不大。</li></ul><h2>资料</h2><button class=" btn m-b-xs btn-success btn-addon" onclick="window.open('https://github.com/ndmiao/bilibili-data/','_blank')">代码和数据</button>

哔哩哔哩八月生活搞笑区热度视频数据分析

数据爬取

确定目标

网站分析

IP池

代码

数据分析

数据预处理

数据分析

结论

资料

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

网站标题动态改变，js代码实现

删除WeTypecho插件后，控制台选项存在残留

自制一个简易的随机图片API接口

cute-cnblogs —— 一个好看的博客园魔改主题

百度网盘网页版看考研视频倍速播放

用python爬取苹果官网店铺

利用rsync写脚本实现多节点多服务器文件一键同步

哔哩哔哩视频页视频详细信息采集（三连、播放量、标签）(第一版）

opencv实现简单的形状识别

考研英语作文练习【四月】

哔哩哔哩八月生活搞笑区热度视频数据分析

数据爬取

确定目标

网站分析

IP池

代码

数据分析

数据预处理

数据分析

结论

资料

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

哔哩哔哩八月生活搞笑区热度视频数据分析

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款