Skip to content

[SPARK-55979][SQL] Required input attributes are missing from PartialMerge / Final BaseAggregateExec.references#54778

Open
zhztheplayer wants to merge 2 commits intoapache:masterfrom
zhztheplayer:wip-55979
Open

[SPARK-55979][SQL] Required input attributes are missing from PartialMerge / Final BaseAggregateExec.references#54778
zhztheplayer wants to merge 2 commits intoapache:masterfrom
zhztheplayer:wip-55979

Conversation

@zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Mar 12, 2026

What changes were proposed in this pull request?

The patch updates BaseAggregateExec.producedAttributes, from

  override def producedAttributes: AttributeSet =
    AttributeSet(aggregateAttributes) ++
    AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++
    AttributeSet(aggregateBufferAttributes) ++
    AttributeSet(inputAggBufferAttributes.filterNot(child.output.contains))

to

  override def producedAttributes: AttributeSet =
    AttributeSet(aggregateAttributes) ++
    AttributeSet(resultExpressions.diff(groupingExpressions).map(_.toAttribute)) ++
    AttributeSet(aggregateBufferAttributes) ++
    AttributeSet(inputAggBufferAttributes) -- child.outputSet

The patch fixes bug described below by adding -- child.outputSet to exclude child's attributes from producedAttributes, ensuring producedAttributes doesn't include anything from the aggregate's input.

Why are the changes needed?

The current implementation has bug causing a AggregateExec in Final mode gives fewer attributes in references than it really needs. Our downstream project has a rule to prune the unused inputs, relying on the references call. This bug causes some needed inputs to be pruned accidentally.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested by newly added test case.

Was this patch authored or co-authored using generative AI tooling?

The test case was co-authored by AI and reviewed by PR author.

@HyukjinKwon
Copy link
Member

Mind keeping the PR description template?

@zhztheplayer
Copy link
Member Author

@HyukjinKwon updated. I will investigate the CI error tomorrow.

WholeStageCodegen (5)
Sort [s_state,_w0]
HashAggregate [sum] [sum(UnscaledValue(ss_net_profit)),_w0,s_state,sum]
HashAggregate [s_state,sum] [sum(UnscaledValue(ss_net_profit)),_w0,sum]
Copy link
Member Author

@zhztheplayer zhztheplayer Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the golden file, the attributes not produced by this operator are now correctly shown in the references section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants