Version 0.3.5
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
有关Windows x64位版本部分情况下无法采集链接地址的说明:#128
Explanation about the issue where the link address cannot be collected in some cases on Windows x64 version: #128
Windows x64版本支持64位的Windows 10及以上系统,Windows x86版本支持所有位数(32位和64位)的Windows 7及以上系统,即64位的Windows 7也要下载此版本。注意x86版本的EasySpider的Chrome浏览器永远都是109,不会随着Chrome版本更新而更新(为了兼容Win 7系统),因此如果想用最新版Chrome浏览器采集数据,请在Windows 10 x64及以上系统上运行x64版本的软件。
The Windows x64 version supports Windows 10 and above with 64-bit, while the x86 version of Windows supports all versions (32-bit and 64-bit) of Windows 7 and above, meaning that the 64-bit version of Windows 7 should also download this version. Note that the Chrome browser in this x86 version of EasySpider is always version 109 and will not update with Chrome updates (to maintain compatibility with the Win 7 system). Therefore, if you want to collect data with the latest version of the Chrome browser, please run the x64 version of the software on Windows 10 x64 and above systems.
MacOS版请用系统自带的归档使用工具
解压,MacOS版本支持所有芯片组,包括Intel和M1,M2等处理器,但操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个Issue。
The MacOS version supports all chipsets, including Intel, M1, M2, and other processors. However, the minimum operating system requirement is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this issue. Please unzip the .tar.gz
file by the Arxiv Utility
software.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译。
Similarly, the Linux version is only compatible with Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to use other Linux distributions for data collection, please download the code and compile it yourself.
更新说明
- 提速:极大的提升了大部分场景的采集速度。
- 所有写JavaScript/系统命令代码语句的地方以及打开网页的链接池,都可以用
Field["参数名"]
表示最近提取到的页面参数值/自定义操作返回,即实现了全面的变量
功能。 - 循环中可以在任意位置使用
自定义操作
的退出循环
选项直接退出循环,即添加了Break
功能。
- 可以提取在
<iframe>
标签内的数据。 - 增加暂停执行任务功能,可长按键盘
p
键暂停和继续执行任务。 - (Windows x64可用,其余系统请等待下个版本)增加“一直向下滚动直到页面内容无变化”的功能,同时循环点击下一页的操作的退出循环条件改为找不到下一页按钮及检测不到页面内容变化。
- 执行阶段也可以使用
XPath Helper
扩展来调试XPath,配合上面的暂停功能使用。 - 可导出为
Excel/TXT
文件,可写入MySQL
数据库,可指定数据类型为整数/小数/日期
等,点此查看MySQL写入教程。 - 调用任务时的输入参数值可以通过读取Excel文件替换。
- 浏览器操作台可通过左上角拖动改变大小。
- 提取数据的字段可设置为不保存(适用于只想将此字段作为变量输入的情况)。
- 输入文字操作后可用
<enter>
或<ENTER>
表示硬回车,即输入完成后在当前文本框按回车。 - 可以模拟手机端浏览器运行。
- (只支持Windows x64版本)可处理和采集针对被Cloudflare的验证码保护的变态网站,点此查看视频教程。
- 新增默认索引位置使用last()从后往前数的XPath提示。
- 操作后等待时长可设置为设定时间的50%-150%的随机等待。
- 软件包内自带python源代码以供专业人士修改任务流程和调试。
打开网页
的高级操作支持获取当前页面Cookies,并可修改Cookies。
- 更改点击元素方式,真正模拟现实世界鼠标点击操作。
- 通用参数设置:每采集多少条本地写入一次,默认为10;控制栏预览数据长度,默认为15等。
- 压缩任务文件大小。
- 保存名称和位置更改,默认文件保存路径是
Data/Task_ID
,想要保存到其他路径,可以用../../
这种形式进行相对路径引用,比如../../JS
表示保存的的文件名是JS
,保存位置为和Data
文件夹同一级目录的文件夹,即EasySpider
主文件夹。 - 流程图和选项配置自动刷新,无需点击
确定
按钮,但仍需手动保存任务。 - 源代码优化,使二次开发更容易。
- Bug修复:如执行系统命令如果失败会打印错误信息,修复了MacOS和Linux下系统命令执行失败的Bug;URL格式判断,累计增长的字段名索引值不正确等Bug。
- 屏蔽无关日志信息,执行界面更清爽。
Update Instruction
- Speed up: Greatly improved the collection speed in most scenarios.
- Variable Functionality: In all places where you write JavaScript/system command code statements and open web page links, you can use Field["parameter_name"] to represent the recently extracted page parameter value/custom operation return. This provides comprehensive variable functionality.
- Loop Control: During a loop, you can use the
exit loop
option ofcustom operation
at any position to directly exit the loop, that is, theBreak
function has been added. - Data Extraction: Data within
<iframe>
tags can be extracted. - Task Control: Added pause execution task feature, you can long press the
p
key on the keyboard to pause and continue execution. - (Windows x64 only now, other OS please wait for the next version) Add a "Keep scrolling until the page content does not change" feature, and modify the loop exit condition of repeatedly clicking the next page operation to "unable to find the next page button" and "page content doesn't change".
- XPath Debugging: You can also use
XPath Helper
extension to debug XPath during the execution stage, which can be used in conjunction with the pause feature above. - Data Export and Writing: Can be exported to
Excel/TXT
files, can be written toMySQL
databases, can specify data types asinteger/decimal/date
, etc., click here to view MySQL writing tutorial. - Parameter Handling: The input parameter values when calling tasks can be replaced by reading Excel files.
- Interface Adjustment: The browser operation console can be resized by dragging the top left corner.
- Data Handling: Fields for extracting data can be set to not be saved (suitable for cases where you only want to use this field as a variable input).
- Text Input: After entering text operation,
<enter>
or<ENTER>
can be used to represent a hard return, that is, press enter in the current text box after entering. - Device Simulation: Can simulate mobile browser running.
- (Not Stable) Cloudflare Handling: Capable of handling and collecting data from websites protected by Cloudflare's captcha, click here to view the video tutorial.
- XPath Indexing: Added a hint for using last() from the back as the default index position in XPath.
- Wait Time Control: The waiting time after the operation can be set to 50%-150% of the set time for random waiting.
- Source Code Included: The software package comes with Python source code for professionals to modify the task process and debugging.
- Cookie Handling: The advanced operations of
open webpage
support getting the current page Cookie and can modify Cookie. - Click Simulation: Change the way to click elements, truly simulating real-world mouse click operations.
- General Parameter Settings: General parameter settings: how many times to write locally for each collection, the default is 10; control bar preview data length, the default is 15, etc.
- File Compression: Compressed task file size.
- Name and Location Changes: The default file save path is
Data/Task_ID
. If you want to save to a different path, use relative path referencing like../../
. For example, if the file name isJS
and you want to save it in a folder at the same level as theData
folder, which is theEasySpider
main folder, you can use../../JS
as the relative path. - Flowchart Updates: Automatic update and refresh of the flowchart, no need to click the
Confirm
button. - Source Code Optimization: Source code optimization, making secondary development easier.
- Bug Fixes: Bug fixes: such as printing error information if the execution of system commands fails, fixing the bug of system command execution failure under MacOS and Linux; URL format judgment and other bugs.
- Filter irrelevant log information for a cleaner interface execution.