ceres993434 · peterrosetu · Apr 8, 2017 · Apr 8, 2017 · Apr 9, 2017 · Apr 10, 2017
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -1 +1,4 @@
 *.pyc
+*.DS_Store
+*.log
+*.out
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,4 @@
+{
+    "python.linting.flake8Enabled": false,
+    "python.linting.pylintEnabled": false
+}
diff --git a/PornHub/.DS_Store b/PornHub/.DS_Store
diff --git a/PornHub/PornHub/pornhub_type.py b/PornHub/PornHub/pornhub_type.py
diff --git a/README.md b/README.md
@@ -1,52 +1,73 @@
-关注我的公众号：DeveloperPython 收到实时的项目动态。
 
+![][py2x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file]
+> Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.
 
-> 声明：本项目旨在学习Scrapy爬虫框架和MongoDB数据库，不可使用于商业和个人其他意图。若使用不当，均由个人承担。
+* The project is mainly used for crawling a Website, the largest site in the world. In doing so it retrieves video titles, duration, mp4 link, cover url and direct Website`s url.
+* This project crawls PornHub.com quickly, but with a simple structure.
+* This project can crawl up to 5 millon Website`s videos per day, depending on your personal network. Because of my slow bandwith my results are relatively slow.
+* The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration]
 
-* 项目主要是爬取全球最大成人网站PornHub的视频标题、时长、mp4链接、封面URL和具体的PornHub链接
-* 项目爬的是PornHub.com，结构简单，速度飞快
-* 爬取PornHub视频的速度可以达到500万/天以上。具体视个人网络情况,因为我是家庭网络，所以相对慢一点。
-* 10个线程同时请求，可达到如上速度。若个人网络环境更好，可启动更多线程来请求，具体配置方法见    [启动前配置]
 
+## Environment, Architecture
 
-## 环境、架构
+Language: Python2.7
 
-开发语言: Python2.7
+Environment: MacOS, 4G RAM
 
-开发环境: MacOS系统、4G内存
+Database: MongoDB
 
-数据库: MongoDB
+* Mainly uses the scrapy reptile framework.
+* Join to the Spider randomly by extracted from the Cookie pool and UA pool.
+* Start_requests start five Request based on Website`s classification, and crawl the five categories at the same time.
+* Support paging crawl data, and join to the queue.
 
-* 主要使用 scrapy 爬虫框架
-* 从Cookie池和UA池中随机抽取一个加入到Spider
-* start_requests 根据 PorbHub 的分类，启动了5个Request，同时对五个分类进行爬取。
-* 并支持分页爬取数据，并加入到待爬队列。
+## Instructions for use
 
-## 使用说明
+### Pre-boot configuration
 
-### 启动前配置
+* Install MongoDB and start without configuration
+* Install Python dependent modules：Scrapy, pymongo, requests or `pip install -r requirements.txt`
+* Modify the configuration by needed, such as the interval time, the number of threads, etc.
 
-* 安装MongoDB,并启动，不需要配置
-* 安装Scrapy
-* 安装Python的依赖模块：pymongo、json、requests
-* 根据自己需要修改 Scrapy 中关于 间隔时间、启动Requests线程数等得配置
+### Start up
 
-### 启动
+* cd WebHub
+* python quickstart.py
 
-* python PornHub/quickstart.py
 
-## 运行截图
+## Run screenshots
 ![](https://github.com/xiyouMc/PornHubBot/blob/master/img/running.png?raw=true)
 ![](https://github.com/xiyouMc/PornHubBot/blob/master/img/mongodb.png?raw=true)
 
-## 数据库说明
+## Database description
 
-数据库中保存数据的表是 PhRes。以下是字段说明:
+The table in the database that holds the data is PhRes. The following is a field description:
 
-#### PhRes 表：
-
-	video_title:视频的标题,并作为唯一标识.
-	link_url:视频调转到PornHub的链接
-	image_url:视频的封面链接
-	video_duration:视频的时长，以 s 为单位
-	quality_480p: 视频480p的 mp4 下载地址
+#### PhRes table：
+
+    video_title:     The title of the video, and as a unique.
+    link_url:        Video jump to Website`s link
+    image_url:       Video cover link
+    video_duration:  The length of the video, in seconds
+    quality_480p:    Video 480p mp4 download address
+
+## For Chinese
+
+* 关注微信公众号，学习Python开发
+
+<img src="https://github.com/xiyouMc/WebHubBot/blob/master/img/gongzhonghao.png?raw=true" width = "800" height = "400" alt="图片名称" align=center />   
+
+
+
+[py2x]: https://img.shields.io/badge/python-2.x-brightgreen.svg
+[issues_img]: https://img.shields.io/github/issues/xiyouMc/WebHubBot.svg
+[issues]: https://github.com/xiyouMc/WebHubBot/issues
+
+[forks]: https://img.shields.io/github/forks/xiyouMc/WebHubBot.svg
+[network]: https://github.com/xiyouMc/WebHubBot/network
+
+[stars]: https://img.shields.io/github/stars/xiyouMc/WebHubBot.svg
+[stargazers]: https://github.com/xiyouMc/WebHubBot/stargazers
+
+[license]: https://img.shields.io/badge/license-MIT-blue.svg
+[lic_file]: https://raw.githubusercontent.com/xiyouMc/WebHubBot/master/LICENSE
diff --git a/README_zh.md b/README_zh.md
@@ -0,0 +1,79 @@
+![][py2x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file]
+> 免责声明：本项目旨在学习Scrapy爬虫框架和MongoDB数据库，不可使用于商业和个人其他意图。若使用不当，均由个人承担。
+
+<img src="https://github.com/xiyouMc/PornHubBot/blob/master/img/WebHubCode2.png?raw=true" width = "700" height = "400" alt="图片名称" align=center />
+
+
+## 简介
+
+* 项目主要是爬取全球最大成人网站PornHub的视频标题、时长、mp4链接、封面URL和具体的PornHub链接
+* 项目爬的是PornHub.com，结构简单，速度飞快
+* 爬取PornHub视频的速度可以达到500万/天以上。具体视个人网络情况,因为我是家庭网络，所以相对慢一点。
+* 10个线程同时请求，可达到如上速度。若个人网络环境更好，可启动更多线程来请求，具体配置方法见    [启动前配置]
+
+
+## 环境、架构
+
+开发语言: Python2.7
+
+开发环境: MacOS系统、4G内存
+
+数据库: MongoDB
+
+* 主要使用 scrapy 爬虫框架
+* 从Cookie池和UA池中随机抽取一个加入到Spider
+* start_requests 根据 PorbHub 的分类，启动了5个Request，同时对五个分类进行爬取。
+* 并支持分页爬取数据，并加入到待爬队列。
+
+## 使用说明
+
+### 启动前配置
+
+* 安装MongoDB，并启动，不需要配置
+* 安装Python的依赖模块：Scrapy, pymongo, requests 或 `pip install -r requirements.txt`
+* 根据自己需要修改 Scrapy 中关于 间隔时间、启动Requests线程数等得配置
+
+### 启动
+
+* python PornHub/quickstart.py
+
+## 运行截图
+![](https://github.com/xiyouMc/PornHubBot/blob/master/img/running.png?raw=true)
+![](https://github.com/xiyouMc/PornHubBot/blob/master/img/mongodb.png?raw=true)
+
+## 数据库说明
+
+数据库中保存数据的表是 PhRes。以下是字段说明:
+
+#### PhRes 表：
+
+	video_title:视频的标题,并作为唯一标识.
+	link_url:视频调转到PornHub的链接
+	image_url:视频的封面链接
+	video_duration:视频的时长，以 s 为单位
+	quality_480p: 视频480p的 mp4 下载地址
+
+## 自定义
+
+#### 抓取特定类别的视频
+
+如需抓取特定类别的视频，编辑文件 ./PornHub/PornHub/pornhub_type.py
+
+注释掉/删掉不需要的类别，只保留需要的类别。
+该文件中已有“类别=日本”和“类别=亚洲”的示例。
+若需要其它的类别，请打开pornhub网站的相应类别，
+类别ID就在网址里。
+
+
+[py2x]: https://img.shields.io/badge/python-2.x-brightgreen.svg
+[issues_img]: https://img.shields.io/github/issues/xiyouMc/WebHubBot.svg
+[issues]: https://github.com/xiyouMc/WebHubBot/issues
+
+[forks]: https://img.shields.io/github/forks/xiyouMc/WebHubBot.svg
+[network]: https://github.com/xiyouMc/WebHubBot/network
+
+[stars]: https://img.shields.io/github/stars/xiyouMc/WebHubBot.svg
+[stargazers]: https://github.com/xiyouMc/WebHubBot/stargazers
+
+[license]: https://img.shields.io/badge/license-MIT-blue.svg
+[lic_file]: https://raw.githubusercontent.com/xiyouMc/WebHubBot/master/LICENSE
diff --git a/PornHub/PornHub/__init__.py → WebHub/WebHub/__init__.py b/PornHub/PornHub/__init__.py → WebHub/WebHub/__init__.py
diff --git a/PornHub/PornHub/items.py → WebHub/WebHub/items.py b/PornHub/PornHub/items.py → WebHub/WebHub/items.py
@@ -1,11 +1,13 @@
 # -*- coding: utf-8 -*-
 
 from scrapy import Item, Field
+
+
 class PornVideoItem(Item):
     video_title = Field()
     image_url = Field()
     video_duration = Field()
     quality_480p = Field()
     video_views = Field()
     video_rating = Field()
-    link_url = Field()
+    link_url = Field()
diff --git a/PornHub/PornHub/middlewares.py → WebHub/WebHub/middlewares.py b/PornHub/PornHub/middlewares.py → WebHub/WebHub/middlewares.py
@@ -3,25 +3,28 @@
 from user_agents import agents
 import json
 
+
 class UserAgentMiddleware(object):
     """ 换User-Agent """
 
     def process_request(self, request, spider):
         agent = random.choice(agents)
         request.headers["User-Agent"] = agent
 
+
 class CookiesMiddleware(object):
     """ 换Cookie """
     cookie = {
-                'platform':'pc',
-                'ss':'367701188698225489',
-                'bs':'%s',
-                'RNLBSERVERID':'ded6699',
-                'FastPopSessionRequestNumber':'1',
-                'FPSRN':'1',
-                'performance_timing':'home',
-                'RNKEY':'40859743*68067497:1190152786:3363277230:1'
+        'platform': 'pc',
+        'ss': '367701188698225489',
+        'bs': '%s',
+        'RNLBSERVERID': 'ded6699',
+        'FastPopSessionRequestNumber': '1',
+        'FPSRN': '1',
+        'performance_timing': 'home',
+        'RNKEY': '40859743*68067497:1190152786:3363277230:1'
     }
+
     def process_request(self, request, spider):
         bs = ''
         for i in range(32):

diff --git a/PornHub/PornHub/pipelines.py → WebHub/WebHub/pipelines.py b/PornHub/PornHub/pipelines.py → WebHub/WebHub/pipelines.py
@@ -6,21 +6,27 @@
 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
 import pymongo
+from pymongo import IndexModel, ASCENDING
 from items import PornVideoItem
 
+
 class PornhubMongoDBPipeline(object):
     def __init__(self):
         clinet = pymongo.MongoClient("localhost", 27017)
         db = clinet["PornHub"]
         self.PhRes = db["PhRes"]
+        idx = IndexModel([('link_url', ASCENDING)], unique=True)
+        self.PhRes.create_indexes([idx])
+        # if your existing DB has duplicate records, refer to:
+        # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
 
     def process_item(self, item, spider):
-        print 'MongoDBItem',item
+        print 'MongoDBItem', item
         """ 判断类型 存入MongoDB """
         if isinstance(item, PornVideoItem):
             print 'PornVideoItem True'
             try:
-                self.PhRes.insert(dict(item))
+                self.PhRes.update_one({'link_url': item['link_url']}, {'$set': dict(item)}, upsert=True)
             except Exception:
                 pass
         return item
diff --git a/WebHub/WebHub/pornhub_type.py b/WebHub/WebHub/pornhub_type.py
@@ -0,0 +1,13 @@
+#coding:utf-8
+"""归纳PornHub资源链接"""
+PH_TYPES = [
+    '',
+    'recommended',
+    'video?o=ht', # hot
+    'video?o=mv', # Most Viewed
+    'video?o=tr', # Top Rate
+
+    # Examples of certain categories
+    # 'video?c=1',  # Category = Asian
+    # 'video?c=111',  # Category = Japanese
+]
diff --git a/PornHub/PornHub/settings.py → WebHub/WebHub/settings.py b/PornHub/PornHub/settings.py → WebHub/WebHub/settings.py
@@ -9,10 +9,10 @@
 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 
-BOT_NAME = 'PornHub'
+BOT_NAME = 'WebHub'
 
-SPIDER_MODULES = ['PornHub.spiders']
-NEWSPIDER_MODULE = 'PornHub.spiders'
+SPIDER_MODULES = ['WebHub.spiders']
+NEWSPIDER_MODULE = 'WebHub.spiders'
 
 DOWNLOAD_DELAY = 1  # 间隔时间
 # LOG_LEVEL = 'INFO'  # 日志级别
@@ -27,9 +27,16 @@
 ROBOTSTXT_OBEY = True
 
 DOWNLOADER_MIDDLEWARES = {
-    "PornHub.middlewares.UserAgentMiddleware": 401,
-    "PornHub.middlewares.CookiesMiddleware": 402,
-}
-ITEM_PIPELINES = {
-    "PornHub.pipelines.PornhubMongoDBPipeline": 403,
+    "WebHub.middlewares.UserAgentMiddleware": 401,
+    "WebHub.middlewares.CookiesMiddleware": 402,
 }
+# ITEM_PIPELINES = {
+#     "PornHub.pipelines.PornhubMongoDBPipeline": 403,
+# }
+
+FEED_URI=u'/Users/xiyouMc/Documents/pornhub.csv'
+FEED_FORMAT='CSV'
+
+DEPTH_PRIORITY = 1
+SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
+SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
diff --git a/PornHub/PornHub/spiders/__init__.py → WebHub/WebHub/spiders/__init__.py b/PornHub/PornHub/spiders/__init__.py → WebHub/WebHub/spiders/__init__.py