文章详情页

python 开心网和豆瓣日记爬取的小爬虫

浏览：105日期：2022-06-14 16:53:22

目录项目地址：开心网日记爬取使用代码豆瓣日记爬取使用代码Roadmap项目地址：

https://github.com/aturret/python-crawler-exercise

用到了BeautifulSoup4，请先安装。

pip install beautifulsoup4开心网日记爬取

kaixin001.py

使用

登录开心网，浏览器F12看http请求的header，获取自己的cookie。

填写cookie，要爬的日记的url，要爬的总次数。走你。

之后会生成HTML文件，格式是<:title>-<YYYYMMDDHHMMSS>

代码

# -*- coding: utf-8 -*-from urllib.request import urlopenimport urllib.requestimport urllib.parse #为了获取HTTP responsefrom bs4 import BeautifulSoup #BS4import string # 为了去掉空白字符import time # 防止被杀cookieimport unicodedata # 字符修正# 在这里放第一个链接urlx = ’链接’ #写你想爬的文def request(url): global urlx #引用外面的链接作为全局变量，后面还会取下一个进行循环的# 使用urllib库提交cookie获取http响应 headers = { ’GET https’:url, ’Host’:’ www.kaixin001.com’, ’Connection’:’ keep-alive’, ’Upgrade-Insecure-Requests’:’ 1’, ’User-Agent’:’ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36’, ’Accept’:’ application/json, text/javascript, */*; q=0.01’, ’Accept-Language’:’ zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7’, ’Cookie’:’ ’, #改成自己的cookie，自己浏览器打开网站F12调试，自己找http请求的header } request = urllib.request.Request(url=url,headers=headers) response = urllib.request.urlopen(request) contents = response.read()# 使用BS4获得所有HTMLtag bsObj = BeautifulSoup(contents,'html.parser')# 使用BS4的find函数得到想要的东西：标题、发表时间和博客正文 title = bsObj.find('b', attrs={'class':'f14'}) titleT = bsObj.find('b', attrs={'class':'f14'}).get_text() #开心网日记的标题是一个b标签，class属性值是f14 date = bsObj.find('span', attrs={'class':'c6'}) dateT = bsObj.find('span', attrs={'class':'c6'}).get_text() #开心网日记的发表时间是一个span标签，class属性值是c6 text = bsObj.find('div', attrs={'class':'textCont'}) textT = bsObj.find('div', attrs={'class':'textCont'}).get_text() #开心网日记的正文是一个div标签，class属性值是textCont # 测试输出 print(title) print(dateT) # print(text) # 生成HTML文件。这里直接用file.open()和file.write()了，也可以用jinja2之类的框架生成。 remove = string.whitespace+string.punctuation table = str.maketrans(’:’,’：’,remove) fileTitle=str(titleT).replace(’:’,’：’).replace(’’’'’’’,’’’“’’’)+’-’+str(dateT).translate(table).replace(’发表’,’’)+’.html’ print(fileTitle) #测试输出 f = open(fileTitle,’w’,encoding='utf-8') #注意用utf-8编码写入，不然会因为一些旧博文采用的gbk编码不兼容而出问题。# 写入message message = ''' <html> <head></head> <body> <h1>%s</h1> <b>%s</b> <br></br> %s </body> </html>'''%(title.get_text(),date.get_text(),unicodedata.normalize(’NFD’,text.prettify())) f.write(message) f.close() # webbrowser.open(fileTitle,new = 1) # 定位下一篇博文的URL nextUrl=bsObj.find('a',text='下一篇 >').attrs['href'] #下一篇是一个a标签，使用tag对象的attrs属性取href属性的值。开心网的日记系统里，如果到了最后一篇日记，下一篇的链接内容是第一篇日记，所以不用担心从哪篇日记开始爬。 # print(nextUrl) urlx='http://www.kaixin001.com'+nextUrl print(urlx)# 主循环，给爷爬num=328 #设定要爬多少次。其实也可以写个数组检测重复然后中止的啦，但我懒得弄了。for a in range(num): request(urlx)print(’We get ’+str(a+1)+’ in ’+str(num)) time.sleep(1) # 慢点，慢点。测试过程中出现了没有设置限制爬一半cookie失效了的情况，可能是太快了被搞了。豆瓣日记爬取

douban.py

使用

登录豆瓣，浏览器F12看http请求的header，获取自己的cookie。

填写变量COOKIE，要爬的日记页的url。走你。

之后会生成HTML文件，格式是<:title>-<YYYYMMDDHHMMSS>

代码

# -*- coding: utf-8 -*-from urllib.request import urlopenimport urllib.requestimport urllib.parse #为了获取HTTP responsefrom bs4 import BeautifulSoup #BS4import string # 为了去掉空白字符import unicodedata # 字符修正import re# 在这里放链接url = ’’ #写你想爬的人 https://www.douban.com/people/xxx/notes 这样COOKIE = ’’def request(urlx): global url #引用外面的链接作为全局变量，后面还会取下一个进行循环的 global boolean global COOKIE# 使用urllib库提交cookie获取http响应 headers = { ’GET https’:urlx, ’Host’:’ www.douban.com’, ’Connection’:’ keep-alive’, ’Upgrade-Insecure-Requests’:’ 1’, ’User-Agent’:’ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36’, ’Accept’:’ application/json, text/javascript, */*; q=0.01’, ’Accept-Language’:’ zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7’, ’Cookie’:COOKIE, #改成自己的cookie，自己浏览器打开网站F12调试，自己找http请求的header } request = urllib.request.Request(url=urlx,headers=headers) response = urllib.request.urlopen(request) contents = response.read()# 使用BS4获得所有HTMLtag bsObj = BeautifulSoup(contents,'html.parser')# 使用BS4的find函数获取当前页面的所有日记链接 article = bsObj.find('div', attrs={'class':'article'}) titleSet = article.findAll('h3') # print(titleSet) for title in titleSet:titleText = title.findAll('a',attrs={'class':'j a_unfolder_n'})for link in titleText: noteUrl = str(link.attrs['href']) print(noteUrl) requestSinglePage(noteUrl) next = bsObj.find('a',text='后页>') if next==None:print('结束了')boolean=1 else:url = str(next.attrs['href']).replace('&type=note','')print(url)def requestSinglePage(urly): global COOKIE headers = {’GET https’:urly,’Host’:’ www.douban.com’,’Connection’:’ keep-alive’,’Upgrade-Insecure-Requests’:’ 1’,’User-Agent’:’ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36’,’Accept’:’ application/json, text/javascript, */*; q=0.01’,’Accept-Language’:’ zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7’,’Cookie’:COOKIE, #改成自己的cookie，自己浏览器打开网站F12调试，自己找http请求的header } request = urllib.request.Request(url=urly,headers=headers) response = urllib.request.urlopen(request) contents = response.read() # 使用BS4获得所有HTMLtag bsObj = BeautifulSoup(contents,'html.parser')# 使用BS4的find函数得到想要的东西：标题、发表时间和博客正文 title = bsObj.find('h1').get_text() date = bsObj.find('span', attrs={'class':'pub-date'}) dateT = bsObj.find('span', attrs={'class':'pub-date'}).get_text() text = bsObj.find('div', attrs={'id':'link-report'}) # textT = bsObj.find('div', attrs={'class':'textCont'}).get_text()# 测试输出 print(title) print(dateT) # 生成HTML文件。这里直接用file.open()和file.write()了，也可以用jinja2之类的框架生成。 remove = string.whitespace+string.punctuation # 去掉日期的标点符号 table = str.maketrans(’:’,’：’,remove) fileTitle=str(title)+’-’+str(dateT).translate(table)+’.html’ print(fileTitle) #测试输出 f = open(fileTitle,’w’,encoding='utf-8') #注意用utf-8编码写入，不然会因为一些旧博文采用的gbk编码不兼容而出问题。 # 写入message message = ''' <html> <head></head> <body> <h1>%s</h1> <b>%s</b> <br></br> %s </body> </html>'''%(title,dateT,unicodedata.normalize(’NFD’,text.prettify())) f.write(message) f.close()# 主循环，给爷爬boolean=0while(boolean==0): a=1 request(url) print(’We finished page ’+str(a)+’ .’) a+=1Roadmap

豆瓣四月份时候还有bug，手机端可以看到全部日记，半年隐藏无效。最近修好了。

不过现在的隐藏依然没有针对到具体的日记，或许可以想办法通过其他手段爬下来。

以上就是python 开心网日记爬取的示例步骤的详细内容，更多关于python 开心网日记爬取的资料请关注好吧啦网其它相关文章！

上一条：python使用jenkins发送企业微信通知的实现下一条：python 爬取京东指定商品评论并进行情感分析

相关文章：

1. 父div高度不能自适应子div高度的解决方案2. 从零学CSS系列之文本属性3. CSS3+Js实现响应式导航条4. .NET使用StackTrace获取方法调用信息的代码演示5. PHP 验证登陆类6. Java Tcp协议socket编程学习7. Java中equals()知识点总结8. AJAX实现指定部分页面刷新效果9. Python爬虫实现百度翻译功能过程详解10. ASP.NET MVC使用正则表达式验证手机号码

排行榜

					
					Android加密之全盘加密详解
Java Tcp协议socket编程学习
JS绘图Flot如何实现动态可刷新曲线图
Python爬虫实现百度翻译功能过程详解
vue-model实现简易计算器
从零学CSS系列之文本属性
ASP.NET MVC使用正则表达式验证手机号码
AJAX实现指定部分页面刷新效果
.NET使用StackTrace获取方法调用信息的代码演示
父div高度不能自适应子div高度的解决方案
Dockerfile 中 VOLUME 与 docker -v 的区别说明
				

热门标签