Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate aiohttp session on connectivity check #5332

Merged
merged 1 commit into from
Oct 9, 2024
Merged

Conversation

agners
Copy link
Member

@agners agners commented Oct 8, 2024

Proposed change

It seems to actually get a proper connectivity check run, we need a new vanilla ClientSession() object.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:
  • Link to client library pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Summary by CodeRabbit

  • New Features
    • Introduced a method for managing web sessions, improving session lifecycle management.
  • Bug Fixes
    • Enhanced the connectivity check mechanism to recreate web sessions upon failure, ensuring more reliable connectivity checks.

@agners agners added the bugfix A bug fix label Oct 8, 2024
@agners agners changed the title Recreate aiohttp session on connectivity check HACK: Recreate aiohttp session on connectivity check Oct 8, 2024
@agners
Copy link
Member Author

agners commented Oct 8, 2024

This really feels wrong, and I don't understand why it is necessary. Without this, connectivity_check always returns with TimeoutError (this is with some additional logging in check_connectivity):

2024-10-08 15:17:05.182 INFO (MainThread) [supervisor.supervisor] check_connectivity...
2024-10-08 15:17:06.515 ERROR (MainThread) [supervisor.supervisor] check_connectivity ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=8809522527221906347), OSError('Timeout while contacting DNS servers'))
...
2024-10-08 15:17:06.515 WARNING (MainThread) [supervisor.jobs] 'Updater.fetch_data' blocked from execution, no supervisor internet connection
2024-10-08 15:17:06.515 WARNING (MainThread) [supervisor.homeassistant.core] Error on Home Assistant installation. Retrying in 30sec
...
2024-10-08 15:17:36.520 INFO (MainThread) [supervisor.supervisor] check_connectivity...
...
2024-10-08 15:17:47.177 ERROR (MainThread) [supervisor.supervisor] check_connectivity TimeoutError()
...
2024-10-08 15:17:47.178 WARNING (MainThread) [supervisor.jobs] 'Updater.fetch_data' blocked from execution, no supervisor internet connection
2024-10-08 15:17:47.178 WARNING (MainThread) [supervisor.homeassistant.core] Error on Home Assistant installation. Retrying in 30sec
...
I am guessing our event loop is somehow blocked :thinking: 

@bdraco
Copy link
Member

bdraco commented Oct 8, 2024

I'm guessing its a http keep alive reusing the open connection, and the second request ends up timing out because there is nothing on the other end to send an RST

@bdraco
Copy link
Member

bdraco commented Oct 8, 2024

Maybe make a new session thats only used for connectivity check with a Connector that does force_close=True?

@agners
Copy link
Member Author

agners commented Oct 9, 2024

I'm guessing its a http keep alive reusing the open connection, and the second request ends up timing out because there is nothing on the other end to send an RST

There is no http connection setup here, DNS is non-responsive. So this all hangs already on resolve time.

Maybe make a new session thats only used for connectivity check with a Connector that does force_close=True?

Due to the above I don't think this will help. However, I've noticed that a separate client session indeed speeds this up.

2024-10-09 13:44:28.207 INFO (MainThread) [supervisor.supervisor] check_connectivity...
...
2024-10-09 13:44:44.214 ERROR (MainThread) [supervisor.supervisor] check_connectivity ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Could not contact DNS servers'))

@agners
Copy link
Member Author

agners commented Oct 9, 2024

There is definitely some weirdness going on in aiohttp or more likely in aiodns: When I drop packets to my DNS server, I get the following stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohttp/resolver.py", line 105, in resolve
    resp = await self._resolver.getaddrinfo(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
aiodns.error.DNSError: (12, 'Timeout while contacting DNS servers')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1294, in _create_direct_connection
    hosts = await self._resolve_host(host, port, traces=traces)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 945, in _resolve_host
    return await asyncio.shield(resolved_host_task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 982, in _resolve_host_with_throttle
    addrs = await self._resolver.resolve(host, port, family=self._family)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/resolver.py", line 114, in resolve
    raise OSError(msg) from exc
OSError: Timeout while contacting DNS servers

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/test3.py", line 16, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/test3.py", line 12, in main
    html = await fetch(client)
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/test3.py", line 6, in fetch
    async with client.head("https://checkonline.home-assistant.io/online.txt") as resp:
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 1357, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 661, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 556, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1009, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1300, in _create_direct_connection
    raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host checkonline.home-assistant.io:443 ssl:default [None]

So far so good, this is expected.

When I use this small reproducer:

import aiohttp
import asyncio
import datetime
import logging

async def fetch(client):
    timeout = aiohttp.ClientTimeout(total=10)
    async with client.head("https://checkonline.home-assistant.io/online.txt", timeout=timeout) as resp:
        assert resp.status == 200
        return await resp.text()

async def main():
    async with aiohttp.ClientSession() as client:
        while True:
            print(datetime.datetime.now())
            try:
                html = await fetch(client)
                print(html)
            except Exception as e:
                logging.exception(e)

logging.basicConfig(
    format='%(asctime)s %(levelname)s %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S'
)
asyncio.run(main())

Returns the following:

docker exec -it hassio_supervisor python /tmp/test4.py
2024-10-09 16:23:42 INFO Start fetch
2024-10-09 16:23:44 ERROR Error is TimeoutError()
2024-10-09 16:23:44 INFO Start fetch
2024-10-09 16:23:46 ERROR Error is TimeoutError()
2024-10-09 16:23:46 INFO Start fetch
2024-10-09 16:23:48 ERROR Error is TimeoutError()
2024-10-09 16:23:48 INFO Start fetch
2024-10-09 16:23:50 ERROR Error is TimeoutError()
2024-10-09 16:23:50 INFO Start fetch
2024-10-09 16:23:51 ERROR Error is ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Timeout while contacting DNS servers'))
2024-10-09 16:23:51 INFO Start fetch
2024-10-09 16:23:53 ERROR Error is TimeoutError()
2024-10-09 16:23:53 INFO Start fetch
2024-10-09 16:23:55 ERROR Error is TimeoutError()
2024-10-09 16:23:55 INFO Start fetch
2024-10-09 16:23:57 ERROR Error is TimeoutError()
2024-10-09 16:23:57 INFO Start fetch
2024-10-09 16:23:59 ERROR Error is TimeoutError()
2024-10-09 16:23:59 INFO Start fetch
2024-10-09 16:24:00 ERROR Error is ClientConnectorError(ConnectionKey(host='checkonline.home-assistant.io', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), OSError('Timeout while contacting DNS servers'))

It is a bit weird that sometimes we get a different exception. This does indicate that the previous request is somehow kept around and reused.

What is weird though that even when I add another DNS server in my resolver (which fixes resolving) this script continuous to return errors 😢

It really seems as if the aiohttp/aiodns instance used keeps reusing cached or queued resolve attempts.

@agners
Copy link
Member Author

agners commented Oct 9, 2024

I've tried other approaches here, e.g. removing the timeout (hoping the DNS resolutions wouldn't queue) and lazy checking internet connectivity (hoping we would less overwhelm aiohttp/aiodns). But I did run into the same problem ultimately: Once the aiohttp session returns TimeoutError(), it will continue doing so even after resolving the issue 😢

It feels rather safer to me to recreate the aiohttp session, at least for now. In a way, it does make sense to recreate the session when Internet connectivity changes, it will make sure we start from a clean slate. But maybe it would be better to create separate session for the connectivity check on every check, and only recreate the global one when going from no connectivity -> connectivity. 🤔

@bdraco
Copy link
Member

bdraco commented Oct 9, 2024 via email

@bdraco
Copy link
Member

bdraco commented Oct 9, 2024 via email

@agners
Copy link
Member Author

agners commented Oct 9, 2024

It might also have to do with aiohttp’s internal dns cache. Might get better results with disabling that on the check session

Hm, at least this change doesn't seem to alter the behavior:

diff --git a/supervisor/supervisor.py b/supervisor/supervisor.py
index 126cf739..f3adb795 100644
--- a/supervisor/supervisor.py
+++ b/supervisor/supervisor.py
@@ -8,12 +8,13 @@ from ipaddress import IPv4Address
 import logging
 from pathlib import Path
 from tempfile import TemporaryDirectory
+from types import MappingProxyType
 
 import aiohttp
 from aiohttp.client_exceptions import ClientError
 from awesomeversion import AwesomeVersion, AwesomeVersionException
 
-from .const import ATTR_SUPERVISOR_INTERNET, SUPERVISOR_VERSION, URL_HASSIO_APPARMOR
+from .const import ATTR_SUPERVISOR_INTERNET, SERVER_SOFTWARE, SUPERVISOR_VERSION, URL_HASSIO_APPARMOR
 from .coresys import CoreSys, CoreSysAttributes
 from .docker.stats import DockerStats
 from .docker.supervisor import DockerSupervisor
@@ -52,6 +53,11 @@ class Supervisor(CoreSysAttributes):
         self.coresys: CoreSys = coresys
         self.instance: DockerSupervisor = DockerSupervisor(coresys)
         self._connectivity: bool = True
+        conn = aiohttp.TCPConnector(use_dns_cache=False)
+        self._session_con_check = aiohttp.ClientSession(connector=conn)
+        self._session_con_check._default_headers = MappingProxyType(
+            {aiohttp.hdrs.USER_AGENT: SERVER_SOFTWARE}
+        )
 
     async def load(self) -> None:
         """Prepare Supervisor object."""
@@ -270,7 +276,7 @@ class Supervisor(CoreSysAttributes):
         """Check the connection."""
         timeout = aiohttp.ClientTimeout(total=10)
         try:
-            await self.sys_websession.head(
+            await self._session_con_check.head(
                 "https://checkonline.home-assistant.io/online.txt", timeout=timeout
             )
         except (ClientError, TimeoutError):

Connection check continues to fail 😢

It seems to actually get a proper connectivity check run, we need a new
vanilla ClientSession() object.
@agners agners changed the title HACK: Recreate aiohttp session on connectivity check Recreate aiohttp session on connectivity check Oct 9, 2024
@agners agners requested a review from mdegat01 October 9, 2024 19:36
@agners agners marked this pull request as ready for review October 9, 2024 19:36
@agners agners merged commit 1504278 into main Oct 9, 2024
19 of 20 checks passed
@agners agners deleted the recreate-client-session branch October 9, 2024 19:41
Copy link

coderabbitai bot commented Oct 9, 2024

📝 Walkthrough

Walkthrough

The changes involve modifications to the CoreSys class in supervisor/coresys.py and the Supervisor class in supervisor/supervisor.py. The CoreSys class now includes a method for managing the aiohttp.ClientSession, improving session management by initializing the _websession attribute to None and ensuring existing sessions are closed before creating new ones. In the Supervisor class, the check_connectivity method has been updated to recreate the web session upon encountering connectivity issues, enhancing the reliability of connection checks.

Changes

File Change Summary
supervisor/coresys.py - Added method create_websession(self) -> None to manage aiohttp.ClientSession.
- Initialized _websession to None.
- Updated run_in_executor to wrap funct in partial only if kwargs are provided.
supervisor/supervisor.py - Modified check_connectivity to recreate the web session on ClientError or TimeoutError.
- Corrected indentation of await self.sys_websession.head(...) to align with the try block.

Sequence Diagram(s)

sequenceDiagram
    participant CoreSys
    participant Supervisor
    participant WebSession

    CoreSys->>WebSession: create_websession()
    WebSession-->>CoreSys: New session created
    Supervisor->>CoreSys: check_connectivity()
    alt Connectivity check fails
        Supervisor->>CoreSys: recreate web session
        CoreSys->>WebSession: create_websession()
        WebSession-->>CoreSys: New session created
    end
    Supervisor-->>CoreSys: Connectivity check result
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
supervisor/supervisor.py (1)

277-278: LGTM! Good solution for addressing connectivity issues.

The addition of self.coresys.create_websession() before setting self.connectivity = False is a good approach to ensure a fresh session for subsequent connectivity checks. This change aligns with the PR objectives and should help mitigate issues related to stale connections.

Consider adding a log message to indicate when a new websession is created, which could be helpful for debugging:

_LOGGER.debug("Recreating websession due to connectivity check failure")
self.coresys.create_websession()
supervisor/coresys.py (1)

555-557: Simplify headers by removing MappingProxyType

The use of MappingProxyType to wrap the headers dictionary may be unnecessary here. Since aiohttp.ClientSession accepts a standard dictionary for headers, you can directly pass the dictionary without wrapping it.

Apply this diff to simplify the code:

-self._websession: aiohttp.ClientSession = aiohttp.ClientSession(
-    headers=MappingProxyType({aiohttp.hdrs.USER_AGENT: SERVER_SOFTWARE})
-)
+self._websession: aiohttp.ClientSession = aiohttp.ClientSession(
+    headers={aiohttp.hdrs.USER_AGENT: SERVER_SOFTWARE}
+)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 9f3767b and fc3fa63.

📒 Files selected for processing (2)
  • supervisor/coresys.py (3 hunks)
  • supervisor/supervisor.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
supervisor/coresys.py (6)

Pattern */**(html|markdown|md): - For instructional content in documentation, use a direct and authoritative tone. Avoid expressions of politeness such as 'may' or 'please', and ensure the goal of the instruction is fronted.

  • Apply the Microsoft Style Guide to ensure documentation maintains clarity and conciseness.
  • In step-by-step instructions, front the location phrase in the instructional sentence.
  • In step-by-step instructions, front the 'goal' in the instructional sentence.
  • In step-by-step instructions, if in doubt what to front, front the 'goal' before the location phrase in the instructional sentence.
  • do not hyphenate terms like 'top-right' or 'bottom-left' with 'corner'

Pattern */**(html|markdown|md): - Use bold to mark UI strings.

  • If "" are used to mark UI strings, replace them by bold.

Pattern */**(html|markdown|md): - Be brief in your replies and don't add fluff like "thank you for..." and "Please let me know if"


Pattern */**(html|markdown|md): - Use sentence-style capitalization also in headings.


Pattern */**(html|markdown|md): do not comment on HTML used for icons


Pattern */**(html|markdown|md): Avoid flagging inline HTML for embedding videos in future reviews for this repository.

supervisor/supervisor.py (6)

Pattern */**(html|markdown|md): - For instructional content in documentation, use a direct and authoritative tone. Avoid expressions of politeness such as 'may' or 'please', and ensure the goal of the instruction is fronted.

  • Apply the Microsoft Style Guide to ensure documentation maintains clarity and conciseness.
  • In step-by-step instructions, front the location phrase in the instructional sentence.
  • In step-by-step instructions, front the 'goal' in the instructional sentence.
  • In step-by-step instructions, if in doubt what to front, front the 'goal' before the location phrase in the instructional sentence.
  • do not hyphenate terms like 'top-right' or 'bottom-left' with 'corner'

Pattern */**(html|markdown|md): - Use bold to mark UI strings.

  • If "" are used to mark UI strings, replace them by bold.

Pattern */**(html|markdown|md): - Be brief in your replies and don't add fluff like "thank you for..." and "Please let me know if"


Pattern */**(html|markdown|md): - Use sentence-style capitalization also in headings.


Pattern */**(html|markdown|md): do not comment on HTML used for icons


Pattern */**(html|markdown|md): Avoid flagging inline HTML for embedding videos in future reviews for this repository.

🔇 Additional comments (1)
supervisor/supervisor.py (1)

277-278: Verify impact on overall system performance.

While the changes look good and address the immediate issue, it's important to consider the potential impact on the overall system:

  1. Ensure that recreating the websession doesn't introduce any noticeable delay in the connectivity check process.
  2. Verify that other parts of the system that might be using the websession are not adversely affected by this change.
  3. Consider adding integration tests that simulate connectivity issues to ensure the new behavior works as expected in various scenarios.

To help with verification, you can run the following script to check for other uses of sys_websession in the codebase:

This will help identify any other parts of the codebase that might be affected by the websession recreation.

@bdraco
Copy link
Member

bdraco commented Oct 10, 2024

I filed aio-libs/aiohttp#9447

@github-actions github-actions bot locked and limited conversation to collaborators Oct 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants