Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the LeetCode spider to reliably fetch the full question list by paginating the GraphQL questionList endpoint, and fixes a bad Referer header construction in the English question detail fetch.
Changes:
- Reworked
get_all_questions_v2to fetch questions page-by-page untiltotalNumis reached (instead of a single large request). - Changed the default
limitvalue forget_all_questions_v2and added progress logging during pagination. - Fixed
get_question_detail_enRefererheader to usequestion_title_slug(previously referenced an undefined variable).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| def get_all_questions_v2(self, retry: int = 3, limit: int = 100) -> List: | ||
| headers = { | ||
| "Cookie": self.cookie_en, | ||
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0", | ||
| "Content-Type": "application/json", | ||
| } | ||
| form = { | ||
| "query": "\n query problemsetQuestionList($categorySlug: String, $limit: Int, $skip: Int, $filters: QuestionListFilterInput) {\n problemsetQuestionList: questionList(\n categorySlug: $categorySlug\n limit: $limit\n skip: $skip\n filters: $filters\n ) {\n total: totalNum\n questions: data {\n acRate\n difficulty\n freqBar\n frontendQuestionId: questionFrontendId\n isFavor\n paidOnly: isPaidOnly\n status\n title\n titleSlug\n topicTags {\n name\n id\n slug\n }\n hasSolution\n hasVideoSolution\n }\n }\n}\n ", | ||
| "variables": { | ||
| "categorySlug": "all-code-essentials", | ||
| "skip": 0, | ||
| "limit": limit, | ||
| "filters": {"orderBy": "FRONTEND_ID", "sortOrder": "DESCENDING"}, | ||
| }, | ||
| "operationName": "problemsetQuestionList", | ||
| } | ||
| try: | ||
| resp = requests.post( | ||
| "https://leetcode.com/graphql", | ||
| headers=headers, | ||
| data=json.dumps(form), | ||
| timeout=20, | ||
| ) | ||
| return resp.json()["data"]["problemsetQuestionList"]["questions"] | ||
| except Exception as e: | ||
| print("get_all_questions_v2", e) | ||
| time.sleep(2) | ||
| return self.get_all_questions_v2(retry - 1, limit) if retry > 0 else [] | ||
| page_size = max(1, min(limit, 100)) | ||
| skip = 0 | ||
| total = None |
There was a problem hiding this comment.
limit is effectively used as the per-request page size (and is clamped to 100 via page_size = max(1, min(limit, 100))), so it no longer represents a maximum number of questions to fetch. This can be confusing for callers and changes the method’s external behavior; consider renaming the parameter to page_size (or adding a separate max_total) and documenting the semantics accordingly.
| "https://leetcode.com/graphql", | ||
| headers=headers, | ||
| data=json.dumps(form), | ||
| timeout=20, |
There was a problem hiding this comment.
This request doesn’t pass verify=False, while other requests in this file consistently disable TLS verification and suppress urllib3 warnings. If the runtime environment relies on verify=False (as suggested by the rest of the file), get_all_questions_v2 may fail with SSL/cert errors; consider making the TLS verification behavior consistent with the other requests in Spider (or centralize it).
| timeout=20, | |
| timeout=20, | |
| verify=False, |
| time.sleep(2) | ||
|
|
||
| if not ok: | ||
| break |
There was a problem hiding this comment.
On repeated failures for a page, the function breaks out of the loop and returns whatever has been accumulated so far. Since the function is intended to “get all questions”, returning a partial list can silently corrupt downstream processing; consider failing the whole call (return [] like get_all_questions does, or raise an exception) or returning an explicit completion indicator alongside the data.
| break | |
| return [] |
No description provided.