Skip to content

feat: support jsonpath#126

Merged
liuq19 merged 6 commits intobytedance:masterfrom
yangzhg:oss
Mar 25, 2026
Merged

feat: support jsonpath#126
liuq19 merged 6 commits intobytedance:masterfrom
yangzhg:oss

Conversation

@yangzhg
Copy link
Copy Markdown
Collaborator

@yangzhg yangzhg commented Feb 9, 2026

  • Refactored the GetByJsonPath function by extracting the core logic into a new GetByJsonPathInternal method to avoid redundant DOM parsing.
  • Added GetByJsonPaths, a new function that supports querying multiple JSON paths in a single operation, which improves performance by parsing the JSON document only once.
  • Updated tests:
  • Added ValidBatchOK utility to validate the results of batch processing.
  • Introduced new test cases (WildCardBatch and BadBatch) to ensure correct handling of valid and invalid JSON with multiple paths.
  • Minor formatting improvements and inclusion of the <algorithm> header.
  • support escape emoji and customed float

@yangzhg yangzhg force-pushed the oss branch 3 times, most recently from 3d74eba to 73484a3 Compare February 10, 2026 03:17
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 57.50552% with 385 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.31%. Comparing base (4250a05) to head (61f46c4).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
include/sonic/internal/arch/simd_skip.h 55.23% 31 Missing and 110 partials ⚠️
include/sonic/dom/parser.h 35.10% 42 Missing and 19 partials ⚠️
include/sonic/jsonpath/jsonpath.h 65.87% 5 Missing and 38 partials ⚠️
include/sonic/jsonpath/dom.h 24.39% 3 Missing and 28 partials ⚠️
include/sonic/jsonpath/ondemand.h 66.66% 5 Missing and 26 partials ⚠️
include/sonic/jsonpath/dump.h 32.00% 0 Missing and 17 partials ⚠️
include/sonic/dom/dynamicnode.h 81.01% 7 Missing and 8 partials ⚠️
include/sonic/dom/serialize.h 50.00% 3 Missing and 11 partials ⚠️
include/sonic/dom/genericnode.h 66.66% 1 Missing and 10 partials ⚠️
include/sonic/dom/schema_handler.h 50.00% 5 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #126      +/-   ##
==========================================
- Coverage   74.79%   70.31%   -4.48%     
==========================================
  Files          21       27       +6     
  Lines        2436     3534    +1098     
  Branches      667     1115     +448     
==========================================
+ Hits         1822     2485     +663     
+ Misses        297      188     -109     
- Partials      317      861     +544     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yangzhg yangzhg force-pushed the oss branch 13 times, most recently from 6cbbaf1 to 4b598dd Compare March 2, 2026 12:18
@yangzhg yangzhg closed this Mar 3, 2026
@yangzhg yangzhg reopened this Mar 3, 2026
@yangzhg yangzhg force-pushed the oss branch 6 times, most recently from adaeb0e to 8eb0c3c Compare March 3, 2026 08:37
@yangzhg yangzhg requested a review from liuq19 March 12, 2026 07:57
@yangzhg yangzhg requested a review from Copilot March 13, 2026 08:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds JSONPath querying capabilities (DOM + on-demand) and extends parsing/serialization flags to support batch JSONPath queries, emoji escaping, NaN/Infinity serialization, and overflow-number-as-string behavior.

Changes:

  • Introduces JSONPath APIs: DOM-based (GetByJsonPath*) and on-demand (GetByJsonPathOnDemand, JsonTupleWithCodeGen), plus extensive new tests.
  • Refactors parsing/serialization to use strongly-typed ParseFlags / SerializeFlags, adds new behaviors (integer-as-raw, overflow-num-as-numstr, allow unescaped control chars, Java-style float formatting).
  • Enhances quoting/escaping and number formatting (emoji escaping, uppercase/lowercase unicode escapes, Inf/NaN serialization, number-string node type).

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/quote_test.cpp Updates Quote test to call the new templated Quote<...>().
tests/parsenumber_test.cpp Adds coverage for integer-as-raw, unescaped control chars, and overflow-number-as-string (NumStr).
tests/jsonpath_test.cpp Adds a comprehensive JSONPath test suite, including batch queries and edge cases.
tests/json_tuple_test.cpp Adds tests for JsonTupleWithCodeGen behavior and corner cases.
tests/ftoa_test.cpp Updates ftoa test to call templated F64toa<...>().
tests/exp_update_test.cpp Updates UpdateLazy calls to use templated parse flags; adds invalid JSON behaviors.
tests/document_test.cpp Adds tests for serializing Infinity/NaN with kSerializeInfNan and fixes naming in an existing test block.
tests/allocator_test.cpp Adds allocator edge-case tests and adjusts an assertion after free.
include/sonic/sonic.h Exposes JSONPath headers from the main public include.
include/sonic/jsonpath/ondemand.h Adds on-demand JSONPath extraction and json-tuple codegen APIs.
include/sonic/jsonpath/jsonpath.h Adds JSONPath parser/data structures and path padding/unescape support.
include/sonic/jsonpath/dump.h Adds internal serialization helper for JsonPathResult.
include/sonic/jsonpath/dom.h Adds DOM-based JSONPath APIs and batch querying (GetByJsonPaths).
include/sonic/internal/ftoa.h Adds SerializeFlags-driven exponent formatting and Java float formatting behavior.
include/sonic/internal/arch/x86_ifuncs/quote.h Reworks runtime dispatch for Quote / parseStringInplace via function-pointer resolver.
include/sonic/internal/arch/sve2-128/unicode.h Adjusts SIMD unicode scanning API and parse-flag-dependent quote detection.
include/sonic/internal/arch/sve2-128/quote.h Adds parse-flag support for allowing unescaped control chars; updates SVE load/store usage.
include/sonic/internal/arch/sse/unicode.h Adds parse-flag-dependent quote detection helper.
include/sonic/internal/arch/simd_skip.h Extends skip-scanner with JSONPath traversal + json-tuple codegen support.
include/sonic/internal/arch/neon/unicode.h Adds parse-flag-dependent quote detection helper.
include/sonic/internal/arch/neon/quote.h Adds parse-flag support for allowing unescaped control chars.
include/sonic/internal/arch/common/x86_common/skip.inc.h Adds documentation note for SkipContainer implementation inspiration/reference.
include/sonic/internal/arch/common/x86_common/quote.inc.h Templates quoting/escaping on SerializeFlags, adds emoji-escape path, and parse-flag support for control chars.
include/sonic/internal/arch/common/unicode_common.h Adds unescape_with_padding() helper for safe lookahead during path unescaping.
include/sonic/internal/arch/common/skip_common.h Tightens literal skipping by validating separators after true/false/null.
include/sonic/internal/arch/common/quote_tables.h Splits unicode escape tables into lower/upper-case variants.
include/sonic/internal/arch/common/quote_common.h Adds flag-driven escaping behavior including emoji surrogate output and unicode escape case selection.
include/sonic/internal/arch/common/arm_common/quote.h Templates ARM quote path on SerializeFlags and adds emoji escape masking.
include/sonic/internal/arch/avx2/unicode.h Adds parse-flag-dependent quote detection helper.
include/sonic/experiment/lazy_update.h Templates lazy update parsing on ParseFlags.
include/sonic/error.h Adds JSONPath-related error codes and hardens ErrorMsg() indexing.
include/sonic/dom/type.h Adds kNumStr node subtype for “number stored as string”.
include/sonic/dom/serialize.h Templates serialization on SerializeFlags, adds Inf/NaN support, and serializes kNumStr.
include/sonic/dom/schema_handler.h Adds Raw/NumStr hooks for new parse modes; fixes comment typo.
include/sonic/dom/parser.h Makes Parser a ParseFlags-templated type; adds integer-as-raw and overflow-num-as-numstr parsing modes.
include/sonic/dom/handler.h Adds Raw and NumStr hooks to SAX handler.
include/sonic/dom/genericnode.h Adds JSONPath traversal APIs + JsonPathResult, and adds string-number accessors.
include/sonic/dom/generic_document.h Updates parsing APIs to use ParseFlags and Parser<parseFlags>.
include/sonic/dom/flags.h Converts flags to enum class and adds new parse/serialize flags + operators.
include/sonic/dom/dynamicnode.h Adds JSONPath implementation and raw/numstr copy/set support in dynamic nodes.
Comments suppressed due to low confidence (1)

include/sonic/dom/parser.h:241

  • Parser::parseNumber() defines several helper macros (e.g., CHECK_DIGIT, SET_UINT_AND_RETURN, etc.) but they are never #undef'd. Because macros are not scoped, this leaks into the rest of the translation unit and can cause collisions / surprising replacements in other headers (especially since parser.h is widely included). Add #undef for the macros after their last use (e.g., after parseNumberAsString()), or refactor to constexpr/helper functions instead of macros.
#define FLOATING_LONGEST_DIGITS 17

#define RETURN_SET_ERROR_CODE(error_code) \
  do {                                    \
    pos_ = i;                             \
    err_ = error_code;                    \
    return true;                          \
  } while (0)

#define CHECK_DIGIT()                                \
  do {                                               \
    if (sonic_unlikely(s[i] < '0' || s[i] > '9')) {  \
      RETURN_SET_ERROR_CODE(kParseErrorInvalidChar); \
    }                                                \
  } while (0)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds JSONPath support to the Sonic DOM and OnDemand scanners, extends parse/serialize flag capabilities (escape emoji, java-style float formatting, NaN/Infinity support, integer-as-raw, overflow-number-as-string), and updates/expands tests and benchmarks accordingly.

Changes:

  • Introduces JSONPath parsing + DOM (AtJsonPath, GetByJsonPath*) and OnDemand (GetByJsonPathOnDemand, JsonTupleWithCodeGen) query APIs, including a batch DOM query helper (GetByJsonPaths).
  • Refactors parse/serialize flags into enum class bitmasks and updates quoting, float formatting, and parsing behaviors (emoji escaping, Inf/NaN serialization, integer-as-raw, NumStr).
  • Adds extensive unit tests for JSONPath/tuple and new numeric/string-number/allocator edge cases; improves benchmark robustness.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/quote_test.cpp Updates quoting test to call templated Quote<SerializeFlags...>().
tests/parsenumber_test.cpp Adds tests for integer-as-raw, unescaped control chars, and overflow numbers as NumStr; fixes a comment typo and adds -01 invalid case.
tests/node_test.cpp Adds an allocator-counting test to verify owned-buffer frees for Raw/NumStr.
tests/jsonpath_test.cpp Adds comprehensive JSONPath OnDemand + DOM tests, including wildcard, escaping, error-code, and batch cases.
tests/json_tuple_test.cpp Adds tests for JsonTupleWithCodeGen including escaped keys, unicode content, and malformed JSON behavior.
tests/ftoa_test.cpp Updates float-to-ascii test to call templated F64toa<SerializeFlags...>().
tests/exp_update_test.cpp Updates UpdateLazy usage to templated parse flags and adds invalid-json behavior tests.
tests/document_test.cpp Adds serialization tests for Infinity/NaN with kSerializeInfNan; fixes variable spelling (swapped).
tests/allocator_test.cpp Adds edge-case tests for allocators and chunk policy growth; removes an ineffective pointer non-null expectation.
include/sonic/sonic.h Exposes JSONPath headers via the main public include.
include/sonic/jsonpath/ondemand.h Implements JSONPath OnDemand querying and JsonTuple codegen extraction using SkipScanner2.
include/sonic/jsonpath/jsonpath.h Adds JSONPath parser/token model (RFC9535 subset) with padded parsing for safe unescaping.
include/sonic/jsonpath/dump.h Adds internal serialization helper for JsonPath results (filters nulls, serializes single/multi results).
include/sonic/jsonpath/dom.h Adds DOM JSONPath query helpers including internal reuse and batch querying.
include/sonic/internal/ftoa.h Adds serialize-flag-aware formatting (Java float style) and templated F64toa; refines exponent formatting.
include/sonic/internal/arch/x86_ifuncs/quote.h Replaces target-attribute stubs with runtime CPU dispatch (AVX2/SSE/fallback) for quoting and string parsing.
include/sonic/internal/arch/sve2-128/unicode.h Adjusts SVE2 string scanning APIs and adds parse-flag-aware quote-first detection.
include/sonic/internal/arch/sve2-128/quote.h Makes SVE2 parseStringInplace parse-flag-aware (allow unescaped control chars) and updates SVE load/store usage.
include/sonic/internal/arch/sse/unicode.h Makes SSE quote-first detection parse-flag-aware; removes unused UnescapedIndex.
include/sonic/internal/arch/simd_skip.h Adds JSONPath/tuple scanning (SkipScanner2), key matching helpers, and jsonpath traversal/serialization plumbing.
include/sonic/internal/arch/neon/unicode.h Makes NEON quote-first detection parse-flag-aware and refactors Find implementation.
include/sonic/internal/arch/neon/quote.h Makes NEON parseStringInplace parse-flag-aware (allow unescaped control chars) and fixes comment spelling.
include/sonic/internal/arch/common/x86_common/skip.inc.h Adds references/comments for the container-skip implementation inspiration.
include/sonic/internal/arch/common/x86_common/quote.inc.h Adds escape-emoji support into SIMD quoting and templates quote/escape paths on serialize flags.
include/sonic/internal/arch/common/unicode_common.h Adds unescape_with_padding() and includes quote tables for shared escape handling.
include/sonic/internal/arch/common/skip_common.h Tightens literal skipping by validating token separators (prevents accepting truex etc.).
include/sonic/internal/arch/common/quote_tables.h Splits quote tables into lower/upper-case unicode escape variants.
include/sonic/internal/arch/common/quote_common.h Adds scalar fallback quote + parseStringInplace and serialize-flag-aware escaping (emoji + hex case).
include/sonic/internal/arch/common/arm_common/quote.h Templates ARM quoting on serialize flags and adds emoji escaping to the SIMD mask/escape path.
include/sonic/internal/arch/avx2/unicode.h Makes AVX2 quote-first detection parse-flag-aware; removes unused UnescapedIndex.
include/sonic/experiment/lazy_update.h Templates lazy update parsing on ParseFlags and updates call sites accordingly.
include/sonic/error.h Adds JSONPath-related error codes/messages and guards ErrorMsg() against invalid indices.
include/sonic/dom/type.h Adds kNumStr number subtype and an ownership bit (kOwnedStringMask) in type info.
include/sonic/dom/serialize.h Adds NumStr serialization, templated quoting/ftoa usage, append-buffer support, and Inf/NaN serialization option.
include/sonic/dom/schema_handler.h Adds SAX handlers for Raw and NumStr and fixes a comment typo.
include/sonic/dom/parser.h Converts Parser to a ParseFlags-templated class and adds support for Raw integers and overflow numbers as NumStr.
include/sonic/dom/handler.h Adds SAX handler support for NumStr and Raw.
include/sonic/dom/genericnode.h Adds JsonPathResult, AtJsonPath, NumStr APIs, and expands string view access semantics.
include/sonic/dom/generic_document.h Converts document parsing APIs to ParseFlags template usage and fixes minor comment spelling.
include/sonic/dom/flags.h Refactors parse/serialize flags into enum class bitmasks with compatibility aliases and operators.
include/sonic/dom/dynamicnode.h Adds owned-buffer handling for Raw/NumStr, JSONPath traversal implementation, and fixes typos/comments.
benchmark/ondemand.hpp Fixes benchmark validation to avoid calling GetUint64() when value doesn’t exist/is wrong type.
benchmark/main.cpp Makes benchmark data loading robust across runfiles/test environments and hardens filesystem iteration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds JSONPath querying support (DOM + on-demand) and extends parsing/serialization options to better match Spark/Java behaviors (emoji escaping, Inf/NaN, and numeric string/raw handling).

Changes:

  • Introduces JSONPath support: DOM querying (AtJsonPath, GetByJsonPath(s)) and on-demand scanning (GetByJsonPathOnDemand), plus new JSONPath tests.
  • Adds/extends parse & serialize flags (unescaped control chars, integers-as-raw, overflow numbers as NumStr, Java float format, emoji escaping, Inf/NaN).
  • Refactors/strengthens SIMD dispatch and skipping/quoting paths; expands unit tests and improves CI workflow concurrency.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/quote_test.cpp Updates quoting tests to use new templated Quote<SerializeFlags> API.
tests/parsenumber_test.cpp Adds coverage for integer-as-raw, unescaped control chars, and NumStr parsing behavior.
tests/node_test.cpp Adds allocator-based regression test for freeing owned buffers (Raw/NumStr).
tests/jsonpath_test.cpp New comprehensive JSONPath test suite including batch querying and edge cases.
tests/json_tuple_test.cpp New tests for JsonTupleWithCodeGen behavior, including escaped keys and malformed JSON.
tests/ftoa_test.cpp Updates float-to-string tests to use new templated F64toa<SerializeFlags> API.
tests/exp_update_test.cpp Updates UpdateLazy call style and adds invalid-json behavior tests.
tests/document_test.cpp Adds serialization tests for Inf/NaN with kSerializeInfNan; fixes variable spelling.
tests/allocator_test.cpp Adds allocator edge-case tests and move/deallocate exercise.
include/sonic/sonic.h Exposes new JSONPath headers from the umbrella include.
include/sonic/jsonpath/ondemand.h Adds on-demand JSONPath querying and JsonTuple codegen support.
include/sonic/jsonpath/jsonpath.h Implements JSONPath parser with padded-buffer unescaping.
include/sonic/jsonpath/dump.h Adds internal JSONPath result serialization helper.
include/sonic/jsonpath/dom.h Adds DOM-based GetByJsonPathInternal, GetByJsonPath, and GetByJsonPaths.
include/sonic/internal/ftoa.h Adds parse/serialize flag support and Java float-format behavior.
include/sonic/internal/arch/x86_ifuncs/quote.h Reworks x86 runtime dispatch for quote/string parsing and adds scalar fallback.
include/sonic/internal/arch/sve2-128/unicode.h Adjusts SVE2 unicode scanning interfaces and adds parse-flag-aware quote detection.
include/sonic/internal/arch/sve2-128/quote.h Makes SVE2 string parsing parse-flag aware; typo fix.
include/sonic/internal/arch/sse/unicode.h Makes SSE quote detection parse-flag aware.
include/sonic/internal/arch/simd_skip.h Adds SkipScanner2 + JSONPath scanning utilities and supporting includes/helpers.
include/sonic/internal/arch/neon/unicode.h Makes NEON quote detection parse-flag aware; minor refactor for Find().
include/sonic/internal/arch/neon/quote.h Makes NEON string parsing parse-flag aware; typo fix.
include/sonic/internal/arch/common/x86_common/skip.inc.h Adds references/comments for SkipContainer implementation.
include/sonic/internal/arch/common/x86_common/quote.inc.h Adds emoji-escape support and templated serialization dispatching.
include/sonic/internal/arch/common/unicode_common.h Adds padded unescape helper for safe lookahead.
include/sonic/internal/arch/common/skip_common.h Tightens literal termination checks to ensure separators after true/false/null.
include/sonic/internal/arch/common/quote_tables.h Splits quote tables into lower/upper case variants for unicode escape casing.
include/sonic/internal/arch/common/quote_common.h Adds scalar fallback quote/string parsing and emoji/unicode escaping logic.
include/sonic/internal/arch/common/arm_common/quote.h Adds emoji-escape support and SerializeFlags-aware quoting.
include/sonic/internal/arch/avx2/unicode.h Makes AVX2 quote detection parse-flag aware.
include/sonic/experiment/lazy_update.h Templates lazy update parsing on ParseFlags and wires parser accordingly.
include/sonic/error.h Adds JSONPath-related errors, improves messages, and adds bounds checking.
include/sonic/dom/type.h Adds NumStr type and an owned-string bitmask in type info.
include/sonic/dom/serialize.h Adds Inf/NaN support, NumStr serialization, append-buffer semantics, and templated flags.
include/sonic/dom/schema_handler.h Adds SAX Raw/NumStr support and fixes node construction/comment typo.
include/sonic/dom/parser.h Templates Parser on ParseFlags; adds integer-as-raw and overflow-as-NumStr parsing paths.
include/sonic/dom/handler.h Adds SAX events for Raw/NumStr and ensures nodes are constructed before use.
include/sonic/dom/genericnode.h Adds JSONPath APIs, NumStr support, and raw/string view accessor adjustments.
include/sonic/dom/generic_document.h Updates to Parser<parseFlags> usage and ParseFlags typing.
include/sonic/dom/flags.h Converts flags to enum class, adds operators, and provides deprecated compatibility aliases.
include/sonic/dom/dynamicnode.h Adds Raw/NumStr ownership-aware copying/freeing and JSONPath traversal implementation.
benchmark/ondemand.hpp Fixes on-demand benchmark verification for non-existent values/types.
benchmark/main.cpp Improves testdata discovery (runfiles env) and file reading robustness.
.github/workflows/test_x86.yml Adds workflow concurrency cancellation to reduce redundant runs.
.github/workflows/test_coverage.yml Adds workflow concurrency cancellation to reduce redundant runs.
.github/workflows/test_arm.yml Adds workflow concurrency and simplifies ASAN/SVE build flags.
.github/workflows/clang-format-check.yml Adds workflow concurrency cancellation to reduce redundant runs.
Comments suppressed due to low confidence (5)

include/sonic/jsonpath/dom.h:1

  • GetByJsonPathInternal() uses std::remove_if but this header does not include <algorithm> directly (it currently relies on transitive includes). Please include <algorithm> in this header to avoid brittle build failures when include order changes.
    include/sonic/jsonpath/dom.h:1
  • GetByJsonPathInternal() duplicates the serialization/filtering logic that already exists in include/sonic/jsonpath/dump.h (sonic_json::internal::Serialize). Consider reusing that helper (or refactoring to a single shared implementation) to prevent behavior drift (e.g., empty results returning "" vs "null", string handling, emoji escaping flags) as future changes land.
    include/sonic/jsonpath/ondemand.h:1
  • This path is likely performance-sensitive, but the current design uses std::function + std::shared_ptr + make_shared for generator creation. That introduces heap allocation and type-erasure overhead during JSONPath scans (notably in wildcard handling where new generators are created). Consider switching to a non-owning factory (e.g., templated functor returning a stack-allocated generator, or passing a lightweight callback that constructs into caller-provided storage) to keep the hot loop allocation-free.
    tests/jsonpath_test.cpp:1
  • Typo in comment: 'interges' should be 'integers'.
    tests/jsonpath_test.cpp:1
  • GetByJsonPaths() is a new public API and current batch tests cover (a) all-valid JSON and (b) invalid JSON input, but they don’t cover per-path error cases within a valid JSON document (e.g., one unsupported path like $..a, one not-found path, one valid path in the same batch). Adding a test that asserts the returned vector preserves per-path (value, error) for a mix of (kErrorNone, kNotFoundByJsonPath, kUnsupportedJsonPath) would better lock down batch semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yangzhg yangzhg force-pushed the oss branch 2 times, most recently from 07f9f19 to a932606 Compare March 17, 2026 02:51
  * Add new JSONPath headers (dom/dump/ondemand) and GetByJsonPaths to query multiple paths with a single parse
  * Refactor GetByJsonPath by extracting shared logic into GetByJsonPathInternal to avoid redundant DOM parsing
  * Expand SIMD scanning/skip/quote/unicode support across x86 and Arm (including SVE2-128) for JSONPath workloads
  * Add comprehensive jsonpath and json_tuple tests (including batch and invalid-path coverage)
  * Update build and tooling: Bazel/bzlmod files, CMake scripts, CI workflows, and pre-commit configuration
@liuq19
Copy link
Copy Markdown
Collaborator

liuq19 commented Mar 24, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06223d3d69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +910 to +912
} else if (dirty == 1) {
jsonGenerator->writeRawValue(wb.ToStringView());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve unquoted single-string semantics for wildcard hit

When wildcard evaluation finds exactly one match, this branch writes the temporary wildcard buffer verbatim; however that buffer was populated in QUOTE mode, so a single string match is emitted with quotes. In practice GetByJsonPathOnDemand("[{\"a\":\"x\"},{}]", "$[*].a") returns "x" instead of x, which is inconsistent with non-wildcard single-string behavior and the DOM-based path API.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@liuq19
Copy link
Copy Markdown
Collaborator

liuq19 commented Mar 24, 2026

Review Notes

Went through the full diff and cross-referenced with the Copilot findings. Grouping by priority below.


Bugs / Must fix

1. jsonpath.h parseNumber — overflow check is wrong

uint64_t last = sum;
sum = sum * 10 + (path[index] - '0');
if (last < sum) { return false; }

If sum is near UINT64_MAX / 10 + 1, the multiplication itself wraps before the addition, so the last < sum check is unreliable. Standard fix:

uint64_t digit = path[index] - '0';
if (sum > (UINT64_MAX - digit) / 10) { return false; }
sum = sum * 10 + digit;

2. parser.h parseNumberAsStringUINT_MAX % 10 should be kUint64Max % 10

Both values happen to be 5 on common platforms, so this is correct by coincidence. Still wrong semantically and will break on any platform where unsigned int is not 32-bit. Same issue exists in the original parseNumber.

3. dynamicnode.h copy constructor — null deref on OOM

this->sv.p = (char*)(alloc.Malloc(len + 1));
sonic_assert(this->sv.p != nullptr);  // no-op in release
std::memcpy(const_cast<char*>(this->sv.p), ...);  // UB when nullptr

setRawLikeImpl already has a fallback for this — copy constructor should match.

4. dynamicnode.h setRawLikeImpl — OOM leaves inconsistent state

When Malloc fails, raw.p = "" but setLength(len, typ) keeps the original length. Size() then returns nonzero while the buffer is 1 byte. Serialization will read out of bounds. Should setLength(0, typ) or abort.

5. ondemand.h — temporary shared_ptr lifetime

scan.getJsonPath<...>(path, 1, jsonGeneratorFactory(wb).get(), jsonGeneratorFactory);

The temporary shared_ptr dies at the semicolon. Technically the full-expression rule keeps it alive during the call, but this is fragile and confusing. Use a named local.

6. simd_skip.h peek()/consume() — OOB read on truncated input

No bounds check before SkipSpaceSafe + pos_ -= 1. When pos_ >= len_, this reads past the buffer or underflows. (Copilot flagged this too.)

7. simd_skip.h getJsonPath() — ignores writeRaw() return value

writeRaw returns false on DOM parse failure, but the caller doesn't check it. Failed extractions silently produce kErrorNone with partial output.

8. quote_common.h DoEscape — drops bytes when EscapeEmoji=false

When nc == 0 (no table entry) and EscapeEmoji is false, the byte that triggered DoEscape is consumed but never written to dst. The SIMD mask should not flag >=0xF0 bytes unless EscapeEmoji is true.


Design / correctness concerns

9. schema_handler.h Raw() always returns false — parser treats false as an error, so kParseIntegerAsRaw + ParseSchema() will always fail. (Copilot caught this.)

10. dom.h vs dump.h — inconsistent empty-result behavior — one returns "", the other "null" for the same scenario (all nodes filtered as null). Pick one.

11. dom.h duplicates dump.h::Serialize almost line-for-line — two slightly different copies will drift. Factor out the shared logic. (Copilot also flagged.)

12. dump.h Serialize mutates result in place via erase(remove_if(...)). Callers can't reuse the result afterward. Should take by value or copy internally.

13. dom.h / dump.hwb.ToString() truncates on embedded \0 — use ToStringView() or std::string(data, size). (Copilot flagged.)

14. parser.hParser becoming a class template is a breaking change for anyone who forward-declared it or stored it by type. Worth noting in release notes.

15. flags.hkParseOverflowNumAsNumStr doc says "overflow only" but implementation stores normal floats as kNumStr too — doc or implementation should be fixed. (Copilot flagged.)

16. x86_ifuncs/quote.h__builtin_cpu_supports at static init time without __builtin_cpu_init(), and no MSVC guard. Defer to function-local static. (Copilot flagged.)

17. x86_ifuncs/quote.h — non-inline static member in header = ODR violation across multiple TUs. Use function-local static or inline static. (Copilot flagged.)

18. ondemand.h writeRaw — calls GetStringView() on a potentially non-string document. Parsing "123" produces a number node. Should guard with IsString().

19. quote_common.h writeHex() ignores the uppercase flag — always emits uppercase hex digits even when kSerializeUnicodeEscapeUppercase is not set. (Copilot flagged.)


Cleanup

  • ondemand.h missing #pragma once and <memory> include
  • jsonpath.h typo: parseBracktedIndexparseBracketedIndex
  • jsonpath.h dead code: parseBrackedUnquotedKey, parseWildcard defined but never called
  • type.h wrong comment for kNumStr bit pattern (xx100_011 should be xx011_011)
  • document_test.cpp test name typos: SerialzeInfinity / SerialzeNaN
  • document_test.cpp / jsonpath_test.cpp unused variables that will fail -Werror
  • parser.h commented-out dead code after goto double_string_fract
  • parser.h macros (CHECK_DIGIT etc.) leak from parseNumber into parseNumberAsString — never #undef'd
  • ondemand.h JsonPathRawResult declared but unused; GetByJsonPathOnDemand has unused result variable
  • jsonpath.h to_string() trailing comma: [ROOT($), KEY("a"), ]
  • error.h no static_assert on kErrorMsg array size vs kErrorNums
  • dom.h / dump.h missing self-contained standard headers (<tuple>, <string>, <algorithm>, <vector>)

Test coverage

Codecov reports 59% patch coverage. Key gaps:

  • simd_skip.h 55.76% (138 lines uncovered)
  • parser.h 44.04% (47 lines)
  • dom.h 24.39% (31 lines)

Would be good to add tests for:

  • kSerializeAppendBuffer append vs clear behavior
  • kNumStr values longer than 32 bytes in serialization
  • Single-quoted JSONPath keys with escapes ($['a\n'])
  • UINT64_MAX / UINT64_MAX+1 boundary in parseNumberAsString
  • Error propagation when writeRaw fails inside getJsonPath

@liuq19
Copy link
Copy Markdown
Collaborator

liuq19 commented Mar 25, 2026

Code review

Found 4 issues:

  1. Missing consume('}') / consume(']') in fallthrough skip handler of template getJsonPath (source: Claude fix: remove unused bench #2 — bug detection | confidence: 90)

    include/sonic/internal/arch/simd_skip.h — The fallthrough handler at the end of the template getJsonPath method is reached when a JSON value is a container ({…} or […]) but the current path segment doesn't match (e.g., path expects a key but value is an array). The handler consumes the opening bracket and all inner content, but never consumes the closing bracket.

    This causes parser desynchronization in nested contexts. For example, $[*].x on [[1], {"x": 2}]:

    • Element [1]: path segment x (key) encounters [ (array). Handler consumes [ and 1, but leaves ] unconsumed.
    • The wildcard handler's peek() != ']' sees this wrong ] and exits early.
    • consume(']') eats the inner ] instead of the outer one → kParseErrorInvalidChar.

    Fix: Add RETURN_FALSE_IF_PARSE_ERROR(consume('}')); / consume(']') after the inner while loops in the fallthrough block.

  2. Incorrect doxygen for AtJsonPath — copy-paste from AtPointer (source: Claude feat: support parse and parselazy with sax handler #5 — code comments | confidence: 85)

    include/sonic/dom/genericnode.h — The doxygen comment says @param path json pointer (should be @param jsonpath json path) and @retval nullptr / @retval others which describe a pointer return type, but the actual return is JsonPathResult<NodeType>. This is clearly copy-pasted from AtPointer() above.

  3. Single-quoted name selectors do not handle escape sequences (RFC 9535 non-compliance) (source: Claude feat: support parse and parselazy with sax handler #5 — code comments | confidence: 80)

    include/sonic/jsonpath/jsonpath.hparseQuotedName handles escapes for double-quoted names via unescape_with_padding, but for single-quoted names it simply scans to the closing ' without processing any escapes. RFC 9535 Section 2.3.1 specifies that \' and \\ must be supported in single-quoted name selectors. A path like $['it\'s'] would fail to parse.

  4. RETURN_FALSE_IF_PARSE_ERROR macro defined in header, never #undef'd (source: Claude doc: readme add benchmark commands #1 — API patterns | confidence: 80)

    include/sonic/internal/arch/simd_skip.h — The macro is #define'd at namespace scope and never #undef'd. Since this is a header-only library, the macro leaks to all downstream includers. It should be #undef'd at the end of the header or replaced with an inline helper.


Additional findings below threshold (not blocking)
  • const_cast write-through-StringView in parseQuotedName (confidence: 70) — Current callers always pass a mutable buffer, but the public ParsePadded(StringView, size_t) API could accept a const buffer, making this latent UB.
  • UB: pointer arithmetic on potentially-null children() in atJsonPathImplCommon (confidence: 75) — Empty containers with null children() trigger UB in getChildrenFirstUnsafe() even though the subsequent loop doesn't execute.
  • Inconsistent negative-index behavior (confidence: 70) — DOM supports negative indices per RFC 9535; on-demand rejects them. Reasonable limitation but deserves a comment.
  • Missing Apache 2.0 license headers on jsonpath/dom.h, jsonpath/dump.h, jsonpath/jsonpath.h (confidence: 70) — All other headers have them; jsonpath/ondemand.h has it.
  • Duplicate serialization logic in dom.h and dump.h (confidence: 65) — Nearly identical serialize logic; GetByJsonPathInternal should call internal::Serialize.
  • Overly broad CI permissions in test_coverage.yml (id-token: write, actions: write) (confidence: 70).
  • Bazel cache key references MODULE.bazel.lock which doesn't exist (confidence: 60).

Reviewed by /opus-review (Claude Code — Codex unavailable due to API auth failure)

@liuq19
Copy link
Copy Markdown
Collaborator

liuq19 commented Mar 25, 2026

Code review — Codex addendum

Codex (GPT-5.4) completed its independent review and found 2 additional issues, both verified with reproduction:

  1. DNode::operator== never handles kNumStr — overflow numbers always compare equal (source: Codex | confidence: 95)

    include/sonic/dom/dynamicnode.h:192 — The operator== switch statement has no case kNumStr: branch. Two nodes parsed with kParseOverflowNumAsNumStr fall through to the default: branch, which only checks GetType() == rhs.GetType() — always true for two kNumStr nodes regardless of their string content.

    Reproduced:

    Document a, b;
    a.Parse<ParseFlags::kParseOverflowNumAsNumStr>("18446744073709551616", 20);
    b.Parse<ParseFlags::kParseOverflowNumAsNumStr>("18446744073709551617", 20);
    std::cout << (a == b);  // prints 1 (should print 0)
  2. SchemaHandler::NumStr ignores cur_node_ParseSchema silently drops overflow numbers (source: Codex | confidence: 85)

    include/sonic/dom/schema_handler.h:245 — Unlike Null/Bool/Int/Uint/Double/String/Raw which all update cur_node_, NumStr always pushes a fresh stack node. This means ParseSchema<kParseOverflowNumAsNumStr> silently leaves existing fields unchanged for overflow numbers.

Reviewed by /opus-review (Claude Code + Codex)

@liuq19 liuq19 merged commit c87f5a4 into bytedance:master Mar 25, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants