fix: Handling of site connection issues during outage #3482
fix: Handling of site connection issues during outage #3482rippyboii wants to merge 105 commits intopython-discord:mainfrom
Conversation
Alerts the moderators through a discord error message if the loading of the Reminders Cog has failed.
…tries from contants.py #13
Adds retry logic with time buff to `Filtering.cog_load()`
Changed cog_load() function to retry connecting to api if it fails initially with an exponential delay and limited max attempts.
Add test cases for retrying cog loads and skeleton for new functions
Implement skeleton functions with code for retrying fetch, alerting mods, and checking if retryable
Add unit test for on_member_update and unit test to check _alert_mods_if_loading_failed is being called
|
Hello @jb3, I have opened another PR to keep the things clean. The earlier PR had referenced some commits including reports, which was the part of our assignment but not of this project. This new PR doesn't include any commits refrencing the reports or anything else that's unnecessary to this project. All the suggestions from the previous PR are adapted in this PR too. Apologies for delay update from our side. |
| for attempt in range(1, URLs.connect_max_retries + 1): | ||
| try: | ||
| raw_filter_lists = await self.bot.api_client.get("bot/filter/filter_lists") | ||
| break | ||
| except Exception as error: | ||
| is_retryable = is_retryable_api_error(error) | ||
| is_last_attempt = attempt == URLs.connect_max_retries | ||
|
|
||
| if not is_retryable: | ||
| raise | ||
|
|
||
| if is_last_attempt: | ||
| log.exception("Failed to load filtering data after %d attempts.", URLs.connect_max_retries) | ||
| raise | ||
|
|
||
| backoff_seconds = URLs.connect_initial_backoff * (2 ** (attempt - 1)) | ||
| log.warning( | ||
| "Failed to load filtering data (attempt %d/%d). Retrying in %d second(s): %s", | ||
| attempt, | ||
| URLs.connect_max_retries, | ||
| backoff_seconds, | ||
| error | ||
| ) | ||
| await asyncio.sleep(backoff_seconds) | ||
|
|
There was a problem hiding this comment.
I think you repeated this backoff logic many times. Consider moving it to self.bot.api_client.get()
There was a problem hiding this comment.
Hi, Thank you for your review,
I’m a bit hesitant to put retries directly into api_client.get() because that may affect all GET calls globally, including cases where retries may be not necessary or need different behavior/logging.
If you think this might not be the problem, I can definately move it to the self.bot.api_client.get()
| self.seen_items = {} | ||
| for mailing_list in response: | ||
| self.seen_items[mailing_list["name"]] = set(mailing_list["seen_items"]) | ||
|
|
||
| with sentry_sdk.start_span(description="Update site with new mailing lists"): | ||
| for mailing_list in ("pep", *constants.PythonNews.mail_lists): | ||
| if mailing_list not in self.seen_items: | ||
| await self.bot.api_client.post("bot/mailing-lists", json={"name": mailing_list}) | ||
| self.seen_items[mailing_list] = set() | ||
|
|
||
| self.fetch_new_media.start() | ||
| return |
There was a problem hiding this comment.
Hi,
The intent was to make the whole startup sync step atomic and retry on transient API failures from both the get("bot/mailing-lists") and post("bot/mailing-lists") calls.
except only retries is_retryable_api_error(error) so the local logic errors still raise immediately.
also the try covers all the startup sequence, so the transient failure in either get or the post should trigger the clean retry method. right?
bot/utils/startup_reporting.py
Outdated
|
|
||
| if bot.get_channel(channel_id) is None: | ||
| # Can't send a message if the channel doesn't exist, so log instead | ||
| log.warning("Failed to send startup failure report: mod_log channel not found.") |
There was a problem hiding this comment.
should the failures be logged in the error?
There was a problem hiding this comment.
Ah right! Thank you for pointing it out.
I think a missing mod_log channel means the failure report was silently dropped, which is an error, not just a warning. I will update it soon.
There was a problem hiding this comment.
I have updated it with proper error logging. Could you check it again?
| async def _load_one(extension: str) -> None: | ||
| try: | ||
| await self.load_extension(extension) | ||
| log.info(f"Extension successfully loaded: {extension}") | ||
|
|
||
| except BaseException as e: | ||
| self.extension_load_failures[extension] = e | ||
| log.exception(f"Failed to load extension: {extension}") | ||
| raise |
There was a problem hiding this comment.
is there a reason this is defined within _load_extensions instead of it being a method of Bot?
There was a problem hiding this comment.
Hi,
I defined _load_one inside _load_extensions because it’s only used there. I didn't realized we could move it to private Bot method.
If you want, I’m happy to move it to a private Bot method (e.g. _load_single_extension). Shall I?
Summary
This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.
#mod-log.During
setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).
Changes
1) Startup failure aggregation (single
#mod-logalert)utils/startup_reporting.pyto format a standardized startup failure message.bot.pyto:2) Retry + backoff for external/API-dependent cogs
Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.
Covered cogs include:
HTTP 429,HTTP 5xx,TimeoutError,OSError; final failure logs + alerts#mod-alerts.URLs.connect_max_retries; warns are logged to Sentry; final failure posts to#mod-log.408,429,5xx,TimeoutError,OSError; on max retries logs + alertsmod_alertsand re-raises to stop startup.Tests / Verification
Suggested checks:
uv run task test#mod-logstartup alert.Moderator Alert in Discord:
Closes #2918