You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I installed and configured AIL, Ail splash manager, Tor, i2P and squid, but all crawled domains seems to be "down", when they really aren't.
I use the tor proxy provided by The torproject.
As far as I know, Tor is working as expected:
curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/ | cat | grep -m 1 Congratulations | xargs
# Congratulations. This browser is configured to use Tor.
# Tor: torrc default proxy# use The torproject proxy https://2019.www.torproject.org/docs/debian# (up to date, solve issues with v3 onion addresses)# proxy name
[default_tor]
# proxy host
host=172.17.0.1# proxy port
port=9050# proxy type
type=SOCKS5
# proxy descriptiondescription=tor default proxy
# crawler type (tor or i2p or web)
crawler_type=tor
# SQUID proxy
[web_proxy]
host=172.17.0.1
port=3128
type=HTTP
description=web proxy
crawler_type=web
# I2P proxy
[default_i2p]
host=127.0.0.1
port=4444
type=HTTP
description=i2p default proxy
crawler_type=i2p
ail-splash-manager/config/containers.cfg
# proxy_name: proxy name (defined in proxies_profiles.cfg)# port: single port or port range (ex: 8050 or 8050-8052)# cpu: max number of cpu allocated# memory: RAM (G) allocated# maxrss: max unbound in-memory cache (Mb, Restart Splash when full)# description: docker description
[default_splash_tor]
proxy_name=default_tor
port=8050-8052
cpu=1
memory=1
maxrss=2000description= default splash tor
net=bridge
# Splash with SQUID proxy
[web_splash]
proxy_name=web_proxy
port=8060
cpu=1
memory=1
maxrss=2000description= web splash (squid)
net=bridge
# Splash with I2P proxy
[default_splash_i2p] # section name: splash name
proxy_name=default_i2p
port=8053-8055
cpu=1
memory=1
maxrss=2000description=default splash i2p
net=host
/etc/squid/conf.d/ail.conf
#
# Squid configuration settings for Ail
#
acl localnet src 172.17.0.0/16 # Docker IP range
http_access allow localnet
Errors
With onion sites (using Tor)
on the Crawler_AIL screen
This happens when I run a web crawler trough the web GUI
------------------START CRAWLER------------------
crawler type: onion
-------------------------------------------------
url: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
domain: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
domain_url: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
Launching Crawler: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion
2022-04-19 15:17:43 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-19 15:17:43 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29
2022-04-19 15:17:43 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 '
'Firefox/78.0'}
2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-19 15:17:43 [scrapy.core.engine] INFO: Spider opened
2022-04-19 15:17:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-19 15:17:43 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
2022-04-19 15:18:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion via http://127.0.0.1:8051/execute> (referer: None)
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-04-19 15:18:10 [scrapy.core.scraper] ERROR: Spider error processing <GET http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion via http://127.0.0.1:8051/execute> (referer: None)
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
yield next(it)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
return next(self.data)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
return next(self.data)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output
for el in result:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse
crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port)
File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item
r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd
for pair in mapping.items():
AttributeError: 'int' object has no attribute 'items'
2022-04-19 15:18:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2194,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 335359,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 26.840496,
'finish_reason': 'closespider_pagecount',
'finish_time': datetime.datetime(2022, 4, 19, 15, 18, 10, 312840),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 84561920,
'memusage/startup': 84561920,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/AttributeError': 1,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2022, 4, 19, 15, 17, 43, 472344)}
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)
2022-04-19 15:18:10 [scrapy.core.engine] INFO: Error while scheduling new request
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
result = next(self._iterator)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in <lambda>
d.addBoth(lambda _: self.slot.nextcall.schedule())
AttributeError: 'NoneType' object has no attribute 'nextcall'
Crawling regular sites (using squid)
In the Crawler_AIL screen
SPLASH Name: web_splash
Proxy Name: web_proxy
Crawler Type: web
deque(['web_splash', 'regular'])
splash url: 127.0.0.1:8060
------------------START CRAWLER------------------
crawler type: regular
-------------------------------------------------
url: http://google.co.jp
domain: google.co.jp
domain_url: http://google.co.jp
Launching Crawler: http://google.co.jp
2022-04-19 15:09:24 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-19 15:09:24 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29
2022-04-19 15:09:24 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 '
'Firefox/78.0'}
2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-19 15:09:24 [scrapy.core.engine] INFO: Spider opened
2022-04-19 15:09:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-19 15:09:24 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
2022-04-19 15:09:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://google.co.jp via http://127.0.0.1:8060/execute> (referer: None)
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-04-19 15:09:36 [scrapy.core.scraper] ERROR: Spider error processing <GET http://google.co.jp via http://127.0.0.1:8060/execute> (referer: None)
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
yield next(it)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
return next(self.data)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__
return next(self.data)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output
for el in result:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse
crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port)
File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item
r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item)
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd
for pair in mapping.items():
AttributeError: 'int' object has no attribute 'items'
2022-04-19 15:09:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2144,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 3220010,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 11.858315,
'finish_reason': 'closespider_pagecount',
'finish_time': datetime.datetime(2022, 4, 19, 15, 9, 36, 401577),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 84520960,
'memusage/startup': 84520960,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/AttributeError': 1,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2022, 4, 19, 15, 9, 24, 543262)}
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)
2022-04-19 15:09:36 [scrapy.core.engine] INFO: Error while scheduling new request
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
result = next(self._iterator)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in <lambda>
d.addBoth(lambda _: self.slot.nextcall.schedule())
AttributeError: 'NoneType' object has no attribute 'nextcall'
How can I solve the errors or at least continue the debugging?
Thanks in advance
The text was updated successfully, but these errors were encountered:
The problem was a bug in scrapy introduced after 2.5.1 version, and should be solved in version 2.6.2 as mentioned here.
At the time being right now (22 days after that comment), the 2.6.2 version is not available yet via pip install scrapy==2.6.2, so the workaround until that version is available can be either download directly from the branch o downgrade to 2.5.1 version. The downgrade procedure can be seen below.
# Don't forget to activate the AIL virtual environmentcd ail-framework/AILENV/bin/ &&source activate
pip install scrapy==2.5.1
#Just in case you had the previous one
pip install scrapy-splash=0.8.0
(Not sure whether this is related to issue #490)
Context
I installed and configured AIL, Ail splash manager, Tor, i2P and squid, but all crawled domains seems to be "down", when they really aren't.
I use the tor proxy provided by The torproject.
As far as I know, Tor is working as expected:
Also the squid proxy works
Configurations can be seen here:
ail-splash-manager/config/proxies_profiles.cfg
ail-splash-manager/config/containers.cfg
/etc/squid/conf.d/ail.conf
Errors
With onion sites (using Tor)
on the
Crawler_AIL
screenThis happens when I run a web crawler trough the web GUI
Crawling regular sites (using squid)
In the
Crawler_AIL
screenHow can I solve the errors or at least continue the debugging?
Thanks in advance
The text was updated successfully, but these errors were encountered: