Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peterrosetu/magnetW #1

Open
wants to merge 92 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
384e52d
init code
xiyouMc Apr 8, 2017
c57137f
update md.
xiyouMc Apr 8, 2017
e9c738f
update img.
xiyouMc Apr 9, 2017
cbe27d6
update md
xiyouMc Apr 10, 2017
0e15309
rm log
xiyouMc Apr 10, 2017
e678e0f
Delete nohup.out
xiyouMc Apr 10, 2017
b95ce8b
Delete cataline.log
xiyouMc Apr 10, 2017
6e37bb6
Merge branch 'master' of github.com:xiyouMc/PornHubBot
xiyouMc Apr 10, 2017
683aa6c
Update README.md
xiyouMc Apr 10, 2017
32b32d6
Update README.md
xiyouMc Apr 10, 2017
b8b2141
Update README.md
xiyouMc Apr 10, 2017
ff0f478
update
xiyouMc Apr 10, 2017
ca7f47a
Update README.md
xiyouMc Apr 10, 2017
de52ce0
add weChat
xiyouMc Apr 10, 2017
b6948cf
merge.
xiyouMc Apr 10, 2017
73a9a67
Update README.md
xiyouMc Apr 10, 2017
2b035e6
update md
xiyouMc Apr 10, 2017
09fdf11
update MD
xiyouMc Apr 10, 2017
a396a73
add WeChat QRCODE
xiyouMc Apr 10, 2017
372ed59
add png.
xiyouMc Apr 10, 2017
254cb49
更新文档
xiyouMc Apr 10, 2017
0abdfeb
add pic.
xiyouMc Apr 10, 2017
237afd3
update README
xiyouMc Apr 10, 2017
f1cf173
update readme.
xiyouMc Apr 10, 2017
7a6e067
format
xiyouMc Apr 10, 2017
d63742a
update
xiyouMc Apr 10, 2017
342bbb6
add pic
xiyouMc Apr 10, 2017
ed3fbe0
update
xiyouMc Apr 10, 2017
9855e4a
update readme.
xiyouMc Apr 10, 2017
4731604
update REDEME.
xiyouMc Apr 11, 2017
e8c1712
update
xiyouMc Apr 11, 2017
759712f
rm unuse.
xiyouMc Apr 11, 2017
794532b
update Code
xiyouMc Apr 11, 2017
8e6b833
update code.
xiyouMc Apr 11, 2017
06e2339
update code.
xiyouMc Apr 11, 2017
d9bc5a7
add qq group
xiyouMc Apr 11, 2017
3b1b377
update.
xiyouMc Apr 11, 2017
7ecb0c8
update os
xiyouMc Apr 11, 2017
b5f3238
add png
xiyouMc Apr 11, 2017
0344d9e
update
xiyouMc Apr 11, 2017
34043e0
update
xiyouMc Apr 11, 2017
46d3357
add png
xiyouMc Apr 11, 2017
ae6f9ba
add png
xiyouMc Apr 11, 2017
a586b02
update code
xiyouMc Apr 11, 2017
2b1b1a2
update code
xiyouMc Apr 11, 2017
5b1f88a
update readme.
xiyouMc Apr 11, 2017
ea2ad30
update md.
xiyouMc Apr 11, 2017
2889a18
update.
xiyouMc Apr 11, 2017
e7afca2
Rename README_zn.md to README_zh.md
Apr 13, 2017
8f7d889
Merge pull request #8 from discountry/patch-1
xiyouMc Apr 13, 2017
db32b9c
update readme.
xiyouMc Apr 13, 2017
2ea5dff
Use proper comma instead of 、
frdmn Apr 13, 2017
f083de8
Update README.md
shiroming Apr 13, 2017
4c1a5ee
Merge pull request #10 from shiroming/patch-1
xiyouMc Apr 13, 2017
33c9a07
Merge branch 'master' into patch-2
frdmn Apr 13, 2017
38b1e23
Merge pull request #9 from frdmn/patch-2
xiyouMc Apr 13, 2017
dd4d17a
Fix grammar in the readme
cfrank Apr 14, 2017
6498033
Merge pull request #12 from cfrank/patch-1
xiyouMc Apr 16, 2017
f43e56e
update png.
xiyouMc Apr 18, 2017
06c54be
fix bug: PornHub修改 下一页标签。增加了一个空格。
xiyouMc May 3, 2017
80708ee
Beautify Code.
xiyouMc May 8, 2017
82f4468
pip freeze
xiyouMc May 11, 2017
012c16c
update md
xiyouMc Jun 5, 2017
0e77b45
update readme
xiyouMc Jun 22, 2017
7dfc368
update readme
xiyouMc Jun 22, 2017
27e0594
Update requirements.txt
shiroming Jun 23, 2017
fee4b05
Update README.md
shiroming Jun 23, 2017
1637318
Update README_zh.md
shiroming Jun 23, 2017
00b597e
Merge pull request #22 from shiroming/master
xiyouMc Jun 23, 2017
eb2f6d0
添加班群
xiyouMc Jul 24, 2017
1f64390
merge code
xiyouMc Jul 24, 2017
544e3ce
add csv
xiyouMc Aug 23, 2017
db0859e
Prevent inserting duplicate records into DB
fckwall Oct 9, 2017
bff62bb
Merge pull request #29 from fckwall/patch-1
xiyouMc Oct 9, 2017
608fcc6
Update qrcode
JakeWharton Oct 10, 2017
7eab80d
Merge branch 'master' of github.com:xiyouMc/WebHubBot
JakeWharton Oct 10, 2017
611e48c
update readme.
xiyouMc Oct 10, 2017
6b4de70
update readme.
xiyouMc Oct 10, 2017
c2f049e
remove code
xiyouMc Oct 10, 2017
d4e7314
Fix a memory leak issue
fckwall Oct 12, 2017
4ca388a
Merge pull request #30 from fckwall/patch-2
xiyouMc Oct 17, 2017
93e5e8b
parse page.
Feb 4, 2018
7cec0ac
Delete log info.
Feb 9, 2018
8d5d776
Merge pull request #37 from Blavtes/dev
xiyouMc Feb 9, 2018
2c6ed07
Add instruction of specifying categories
fckwall Feb 24, 2018
ed324f0
Merge pull request #38 from fckwall/feature/custom-fetch-types
xiyouMc Feb 25, 2018
dff24a1
update
xiyouMc Mar 12, 2018
3607cb1
merge code
xiyouMc Mar 12, 2018
d28090b
Update README.md
xiyouMc Mar 12, 2018
ca5052d
fix bug
xiyouMc Apr 4, 2018
f1f866b
merge
xiyouMc Apr 4, 2018
dc1295b
Update requirements.txt
xiyouMc Aug 15, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
*.pyc
*.DS_Store
*.log
*.out
4 changes: 4 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"python.linting.flake8Enabled": false,
"python.linting.pylintEnabled": false
}
Binary file removed PornHub/.DS_Store
Binary file not shown.
9 changes: 0 additions & 9 deletions PornHub/PornHub/pornhub_type.py

This file was deleted.

85 changes: 53 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,73 @@
关注我的公众号:DeveloperPython 收到实时的项目动态。

![][py2x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file]
> Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.

> 声明:本项目旨在学习Scrapy爬虫框架和MongoDB数据库,不可使用于商业和个人其他意图。若使用不当,均由个人承担。
* The project is mainly used for crawling a Website, the largest site in the world. In doing so it retrieves video titles, duration, mp4 link, cover url and direct Website`s url.
* This project crawls PornHub.com quickly, but with a simple structure.
* This project can crawl up to 5 millon Website`s videos per day, depending on your personal network. Because of my slow bandwith my results are relatively slow.
* The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration]

* 项目主要是爬取全球最大成人网站PornHub的视频标题、时长、mp4链接、封面URL和具体的PornHub链接
* 项目爬的是PornHub.com,结构简单,速度飞快
* 爬取PornHub视频的速度可以达到500万/天以上。具体视个人网络情况,因为我是家庭网络,所以相对慢一点。
* 10个线程同时请求,可达到如上速度。若个人网络环境更好,可启动更多线程来请求,具体配置方法见 [启动前配置]

## Environment, Architecture

## 环境、架构
Language: Python2.7

开发语言: Python2.7
Environment: MacOS, 4G RAM

开发环境: MacOS系统、4G内存
Database: MongoDB

数据库: MongoDB
* Mainly uses the scrapy reptile framework.
* Join to the Spider randomly by extracted from the Cookie pool and UA pool.
* Start_requests start five Request based on Website`s classification, and crawl the five categories at the same time.
* Support paging crawl data, and join to the queue.

* 主要使用 scrapy 爬虫框架
* 从Cookie池和UA池中随机抽取一个加入到Spider
* start_requests 根据 PorbHub 的分类,启动了5个Request,同时对五个分类进行爬取。
* 并支持分页爬取数据,并加入到待爬队列。
## Instructions for use

## 使用说明
### Pre-boot configuration

### 启动前配置
* Install MongoDB and start without configuration
* Install Python dependent modules:Scrapy, pymongo, requests or `pip install -r requirements.txt`
* Modify the configuration by needed, such as the interval time, the number of threads, etc.

* 安装MongoDB,并启动,不需要配置
* 安装Scrapy
* 安装Python的依赖模块:pymongo、json、requests
* 根据自己需要修改 Scrapy 中关于 间隔时间、启动Requests线程数等得配置
### Start up

### 启动
* cd WebHub
* python quickstart.py

* python PornHub/quickstart.py

## 运行截图
## Run screenshots
![](https://github.com/xiyouMc/PornHubBot/blob/master/img/running.png?raw=true)
![](https://github.com/xiyouMc/PornHubBot/blob/master/img/mongodb.png?raw=true)

## 数据库说明
## Database description

数据库中保存数据的表是 PhRes。以下是字段说明:
The table in the database that holds the data is PhRes. The following is a field description:

#### PhRes 表:

video_title:视频的标题,并作为唯一标识.
link_url:视频调转到PornHub的链接
image_url:视频的封面链接
video_duration:视频的时长,以 s 为单位
quality_480p: 视频480p的 mp4 下载地址
#### PhRes table:

video_title: The title of the video, and as a unique.
link_url: Video jump to Website`s link
image_url: Video cover link
video_duration: The length of the video, in seconds
quality_480p: Video 480p mp4 download address

## For Chinese

* 关注微信公众号,学习Python开发

<img src="https://github.com/xiyouMc/WebHubBot/blob/master/img/gongzhonghao.png?raw=true" width = "800" height = "400" alt="图片名称" align=center />



[py2x]: https://img.shields.io/badge/python-2.x-brightgreen.svg
[issues_img]: https://img.shields.io/github/issues/xiyouMc/WebHubBot.svg
[issues]: https://github.com/xiyouMc/WebHubBot/issues

[forks]: https://img.shields.io/github/forks/xiyouMc/WebHubBot.svg
[network]: https://github.com/xiyouMc/WebHubBot/network

[stars]: https://img.shields.io/github/stars/xiyouMc/WebHubBot.svg
[stargazers]: https://github.com/xiyouMc/WebHubBot/stargazers

[license]: https://img.shields.io/badge/license-MIT-blue.svg
[lic_file]: https://raw.githubusercontent.com/xiyouMc/WebHubBot/master/LICENSE
79 changes: 79 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
![][py2x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file]
> 免责声明:本项目旨在学习Scrapy爬虫框架和MongoDB数据库,不可使用于商业和个人其他意图。若使用不当,均由个人承担。

<img src="https://github.com/xiyouMc/PornHubBot/blob/master/img/WebHubCode2.png?raw=true" width = "700" height = "400" alt="图片名称" align=center />


## 简介

* 项目主要是爬取全球最大成人网站PornHub的视频标题、时长、mp4链接、封面URL和具体的PornHub链接
* 项目爬的是PornHub.com,结构简单,速度飞快
* 爬取PornHub视频的速度可以达到500万/天以上。具体视个人网络情况,因为我是家庭网络,所以相对慢一点。
* 10个线程同时请求,可达到如上速度。若个人网络环境更好,可启动更多线程来请求,具体配置方法见 [启动前配置]


## 环境、架构

开发语言: Python2.7

开发环境: MacOS系统、4G内存

数据库: MongoDB

* 主要使用 scrapy 爬虫框架
* 从Cookie池和UA池中随机抽取一个加入到Spider
* start_requests 根据 PorbHub 的分类,启动了5个Request,同时对五个分类进行爬取。
* 并支持分页爬取数据,并加入到待爬队列。

## 使用说明

### 启动前配置

* 安装MongoDB,并启动,不需要配置
* 安装Python的依赖模块:Scrapy, pymongo, requests 或 `pip install -r requirements.txt`
* 根据自己需要修改 Scrapy 中关于 间隔时间、启动Requests线程数等得配置

### 启动

* python PornHub/quickstart.py

## 运行截图
![](https://github.com/xiyouMc/PornHubBot/blob/master/img/running.png?raw=true)
![](https://github.com/xiyouMc/PornHubBot/blob/master/img/mongodb.png?raw=true)

## 数据库说明

数据库中保存数据的表是 PhRes。以下是字段说明:

#### PhRes 表:

video_title:视频的标题,并作为唯一标识.
link_url:视频调转到PornHub的链接
image_url:视频的封面链接
video_duration:视频的时长,以 s 为单位
quality_480p: 视频480p的 mp4 下载地址

## 自定义

#### 抓取特定类别的视频

如需抓取特定类别的视频,编辑文件 ./PornHub/PornHub/pornhub_type.py

注释掉/删掉不需要的类别,只保留需要的类别。
该文件中已有“类别=日本”和“类别=亚洲”的示例。
若需要其它的类别,请打开pornhub网站的相应类别,
类别ID就在网址里。


[py2x]: https://img.shields.io/badge/python-2.x-brightgreen.svg
[issues_img]: https://img.shields.io/github/issues/xiyouMc/WebHubBot.svg
[issues]: https://github.com/xiyouMc/WebHubBot/issues

[forks]: https://img.shields.io/github/forks/xiyouMc/WebHubBot.svg
[network]: https://github.com/xiyouMc/WebHubBot/network

[stars]: https://img.shields.io/github/stars/xiyouMc/WebHubBot.svg
[stargazers]: https://github.com/xiyouMc/WebHubBot/stargazers

[license]: https://img.shields.io/badge/license-MIT-blue.svg
[lic_file]: https://raw.githubusercontent.com/xiyouMc/WebHubBot/master/LICENSE
File renamed without changes.
4 changes: 3 additions & 1 deletion PornHub/PornHub/items.py → WebHub/WebHub/items.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# -*- coding: utf-8 -*-

from scrapy import Item, Field


class PornVideoItem(Item):
video_title = Field()
image_url = Field()
video_duration = Field()
quality_480p = Field()
video_views = Field()
video_rating = Field()
link_url = Field()
link_url = Field()
19 changes: 11 additions & 8 deletions PornHub/PornHub/middlewares.py → WebHub/WebHub/middlewares.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,28 @@
from user_agents import agents
import json


class UserAgentMiddleware(object):
""" 换User-Agent """

def process_request(self, request, spider):
agent = random.choice(agents)
request.headers["User-Agent"] = agent


class CookiesMiddleware(object):
""" 换Cookie """
cookie = {
'platform':'pc',
'ss':'367701188698225489',
'bs':'%s',
'RNLBSERVERID':'ded6699',
'FastPopSessionRequestNumber':'1',
'FPSRN':'1',
'performance_timing':'home',
'RNKEY':'40859743*68067497:1190152786:3363277230:1'
'platform': 'pc',
'ss': '367701188698225489',
'bs': '%s',
'RNLBSERVERID': 'ded6699',
'FastPopSessionRequestNumber': '1',
'FPSRN': '1',
'performance_timing': 'home',
'RNKEY': '40859743*68067497:1190152786:3363277230:1'
}

def process_request(self, request, spider):
bs = ''
for i in range(32):
Expand Down
10 changes: 8 additions & 2 deletions PornHub/PornHub/pipelines.py → WebHub/WebHub/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,27 @@
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from pymongo import IndexModel, ASCENDING
from items import PornVideoItem


class PornhubMongoDBPipeline(object):
def __init__(self):
clinet = pymongo.MongoClient("localhost", 27017)
db = clinet["PornHub"]
self.PhRes = db["PhRes"]
idx = IndexModel([('link_url', ASCENDING)], unique=True)
self.PhRes.create_indexes([idx])
# if your existing DB has duplicate records, refer to:
# https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737

def process_item(self, item, spider):
print 'MongoDBItem',item
print 'MongoDBItem', item
""" 判断类型 存入MongoDB """
if isinstance(item, PornVideoItem):
print 'PornVideoItem True'
try:
self.PhRes.insert(dict(item))
self.PhRes.update_one({'link_url': item['link_url']}, {'$set': dict(item)}, upsert=True)
except Exception:
pass
return item
13 changes: 13 additions & 0 deletions WebHub/WebHub/pornhub_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#coding:utf-8
"""归纳PornHub资源链接"""
PH_TYPES = [
'',
'recommended',
'video?o=ht', # hot
'video?o=mv', # Most Viewed
'video?o=tr', # Top Rate

# Examples of certain categories
# 'video?c=1', # Category = Asian
# 'video?c=111', # Category = Japanese
]
23 changes: 15 additions & 8 deletions PornHub/PornHub/settings.py → WebHub/WebHub/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'PornHub'
BOT_NAME = 'WebHub'

SPIDER_MODULES = ['PornHub.spiders']
NEWSPIDER_MODULE = 'PornHub.spiders'
SPIDER_MODULES = ['WebHub.spiders']
NEWSPIDER_MODULE = 'WebHub.spiders'

DOWNLOAD_DELAY = 1 # 间隔时间
# LOG_LEVEL = 'INFO' # 日志级别
Expand All @@ -27,9 +27,16 @@
ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
"PornHub.middlewares.UserAgentMiddleware": 401,
"PornHub.middlewares.CookiesMiddleware": 402,
}
ITEM_PIPELINES = {
"PornHub.pipelines.PornhubMongoDBPipeline": 403,
"WebHub.middlewares.UserAgentMiddleware": 401,
"WebHub.middlewares.CookiesMiddleware": 402,
}
# ITEM_PIPELINES = {
# "PornHub.pipelines.PornhubMongoDBPipeline": 403,
# }

FEED_URI=u'/Users/xiyouMc/Documents/pornhub.csv'
FEED_FORMAT='CSV'

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
File renamed without changes.
Loading