Add HuggingFace tp_plan support for AutoTP by delock · Pull Request #7901 · deepspeedai/DeepSpeed

delock · 2026-03-13T07:57:59Z

Summary

Adds automatic detection and use of HuggingFace's built-in base_model_tp_plan for AutoTP, addressing the HuggingFace tp_plan support item from #7861.

Models that ship with a tp_plan (e.g. Llama, Qwen, Gemma2) now work with AutoTP out of the box — no preset_model or partition_config needed, just set autotp_size.

Changes

Runtime

engine.py: Added tp_plan fallback in _apply_autotp_partitioning. Priority order: partition_config > HF tp_plan > AutoTP heuristics.
config.py: Added _get_hf_tp_plan(model) to extract tp_plan from model._tp_plan or model.config.base_model_tp_plan.
tp_plan_converter.py: New file. TPPlanConverter converts HF tp_plan entries (colwise/rowwise) to DeepSpeed TPLayerSpec.
Other HF partition types (colwise_rep, local_colwise, etc.) are not yet supported (documented with TODO).

Tests (11 files, 17 CPU + 5 GPU tests)

test_tp_plan_converter.py: Unit tests for the converter (alternate prefixes, projection names, unsupported types, etc.)
test_tp_plan_extraction.py: Unit tests for _get_hf_tp_plan with mock models.
test_tp_plan_e2e.py: GPU e2e tests with ZeRO 0/1/2 (requires 2 GPUs).
test_tp_plan_real_models.py: GPU tests with Qwen2 and custom models (requires 2 GPUs).

Documentation

Tutorial: New "HuggingFace tp_plan Support" section in autotp-training.md.
Config reference: Added tp_plan paragraph in config-json.md.
API docs: Added tp_plan subsection in training.rst.
Blog: Updated ongoing work in blogs/huggingface-tp/README.md.

Limitations

Only colwise and rowwise partition types are supported. Extended types (colwise_rep, local_colwise, local_rowwise,
local_packed_rowwise, gather, sequence_parallel) are deferred.

delock · 2026-03-13T07:59:36Z

Hi @inkcherry @tohtana @PKUWZP @tjruwase , this is the PR provide HuggingFace tp_plan support for AutoTP. Hoping to see your comments, thanks!

This PR adds support for HuggingFace's native tensor parallel plan (tp_plan) to DeepSpeed's AutoTP feature, enabling automatic tensor parallelism configuration without manual specification. Key changes: - Add tp_plan_converter.py: Convert HF tp_plan format to DeepSpeed TPLayerSpec - Extend tensor_parallel/config.py: Add resolve_tp_config() and _get_hf_tp_plan() - Support priority: custom config > HF tp_plan > DeepSpeed preset Test results: - 28/28 unit tests passed (no GPU required) - Covers format conversion, extraction, priority, and integration - E2E tests require multi-GPU environment Example usage: ds_config = {'tensor_parallel': {'autotp_size': 4}} # Auto-detects and uses model's tp_plan from HuggingFace config Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

- Delete resolve_tp_config from config.py (dead code, never used at runtime) - Delete test_tp_plan_priority.py and test_tp_plan_integration.py (tested dead function) - Move test_alternate_prefixes and test_alternate_projection_names to converter tests - Replace duplicated _get_hf_tp_plan in extraction tests with proper import - Remove resolve_tp_config usage from test_tp_plan_real_models.py Reduces from 6 test files / 34 tests to 4 files / 23 tests with no loss of runtime coverage. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Apply yapf formatting to new files and fix flake8 F401 (unused pytest import). Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Rename parameter model_or_config to model and remove the third fallback that checked base_model_tp_plan directly on the input. This path was unreachable since engine.py always passes a model object. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Document that colwise_rep, local_colwise, local_rowwise, local_packed_rowwise, gather, and sequence_parallel are not yet handled. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

… and API docs Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f24acece6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

deepspeed/runtime/tensor_parallel/config.py

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Address review feedback: hasattr check returned None/_tp_plan={} without falling back to model.config.base_model_tp_plan. Use getattr with truthiness check so that falsy _tp_plan values correctly fall through to the config-based plan. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

…e_model_tp_plan Fix two bugs in the HuggingFace tp_plan AutoTP path: 1. _replace_module() passed only the immediate child name to recursive calls instead of the accumulated full_name. This meant pattern matching in _replace_with_config() never matched patterns like 'layers.*.self_attn.q_proj' because the name was only 2 levels deep (e.g. 'self_attn.q_proj'). Zero modules were being replaced, causing a 32% performance regression vs the master AutoTP path. 2. _get_hf_tp_plan() now prefers config.base_model_tp_plan over model._tp_plan because HuggingFace's _tp_plan contains duplicate entries (both 'layers.*' and 'model.layers.*' prefixed versions), causing spurious duplicate-match warnings during conversion. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add test_deep_model_full_path_propagation that uses a 4-level-deep model hierarchy (layers.N.self_attn.{q,o}_proj) with patterns requiring intermediate path components. This catches regressions where _replace_module passes immediate child names instead of accumulated full paths to recursive calls, which causes pattern matching to silently fail on deep models. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tohtana · 2026-03-16T07:57:32Z

Hi @delock,
This is an amazing enhancement!

One gap to handle is the mismatch between HuggingFace TP-plan styles and what AutoTP can consume directly. The new converter currently supports only colwise and rowwise, while HuggingFace TP plans use additional styles such as colwise_gather_output and rowwise_split_input.

AutoTP does not handle those HF styles directly, but some models are already supported through existing AutoTP specs/presets. Phi-3 is a good example: its HF TP plan uses colwise_gather_output / rowwise_split_input, but DeepSpeed already supports Phi-3 through the existing AutoTP configuration with sub-parameter partitioning.

So I think the safer behavior would be:

Inspect the HF TP plan styles.
If all styles are supported by the converter, use the HF TP plan.
Otherwise, skip the HF TP plan path and fall back to the existing AutoTP path.

This would avoid regressing models like Phi-3 while still enabling the new HF-plan path for models whose TP plans map cleanly to AutoTP.

delock · 2026-03-17T02:41:23Z

@tohtana Thanks for the comments, this is a good suggestion. Let me add this bypass in the PR and test Phi-3 accordingly.

In the long run, I think it would help if AutoTP can consume these additional styles, which can be done in a seperate PR.

Hi @delock, This is an amazing enhancement!

One gap to handle is the mismatch between HuggingFace TP-plan styles and what AutoTP can consume directly. The new converter currently supports only colwise and rowwise, while HuggingFace TP plans use additional styles such as colwise_gather_output and rowwise_split_input.

AutoTP does not handle those HF styles directly, but some models are already supported through existing AutoTP specs/presets. Phi-3 is a good example: its HF TP plan uses colwise_gather_output / rowwise_split_input, but DeepSpeed already supports Phi-3 through the existing AutoTP configuration with sub-parameter partitioning.

So I think the safer behavior would be:

Inspect the HF TP plan styles.

If all styles are supported by the converter, use the HF TP plan.

Otherwise, skip the HF TP plan path and fall back to the existing AutoTP path.

This would avoid regressing models like Phi-3 while still enabling the new HF-plan path for models whose TP plans map cleanly to AutoTP.

HF tp_plan may use partition styles beyond colwise/rowwise (e.g. colwise_rep, rowwise_rep). Instead of raising ValueError, detect unsupported styles and fall back to the existing AutoTP preset path. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

…/autotp_improvement

delock · 2026-03-18T02:38:56Z

Hi @tohtana fallback path for unsupported tp_plan layer specs are added. Before the change, phi3 will report unsupported layer specs, after the change phi3 does not have such complain.

Note my test on Phi3-mini raise a seperate failure with shape mismatch, which is also raised on master. I havn't looked into it yet. It can be addressed in a seperated investigation and probably a fix.

@tohtana Thanks for the comments, this is a good suggestion. Let me add this bypass in the PR and test Phi-3 accordingly.

In the long run, I think it would help if AutoTP can consume these additional styles, which can be done in a seperate PR.

Hi @delock, This is an amazing enhancement!
One gap to handle is the mismatch between HuggingFace TP-plan styles and what AutoTP can consume directly. The new converter currently supports only colwise and rowwise, while HuggingFace TP plans use additional styles such as colwise_gather_output and rowwise_split_input.
AutoTP does not handle those HF styles directly, but some models are already supported through existing AutoTP specs/presets. Phi-3 is a good example: its HF TP plan uses colwise_gather_output / rowwise_split_input, but DeepSpeed already supports Phi-3 through the existing AutoTP configuration with sub-parameter partitioning.
So I think the safer behavior would be:

Inspect the HF TP plan styles.

If all styles are supported by the converter, use the HF TP plan.

Otherwise, skip the HF TP plan path and fall back to the existing AutoTP path.

This would avoid regressing models like Phi-3 while still enabling the new HF-plan path for models whose TP plans map cleanly to AutoTP.

delock requested review from hwchen2017, loadams, tjruwase and tohtana as code owners March 13, 2026 07:58

delock force-pushed the gma/autotp_improvement branch from 9f24ace to 8870c98 Compare March 13, 2026 08:05

delock and others added 15 commits March 13, 2026 01:05

remove llama test

fe30ae2

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tests: fix tp_plan Qwen2 ZeRO2 setup

8f73781

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix tests

a941292

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tests: align tp_plan e2e checks with autotp

a2aba58

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

add coverage of HF tp_plan

d48171d

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tests: drop duplicate tp_plan priority case

d28922e

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tests: assert tp_plan source for custom model

a840f1f

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

remove cosmetic change in engine.py

5db339b

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

add comments for TP config priority

a3af060

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

style: yapf formatting and remove unused import

1e08f6a

Apply yapf formatting to new files and fix flake8 F401 (unused pytest import). Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

add TODO for unsupported HF tp_plan partition types

1b3243a

Document that colwise_rep, local_colwise, local_rowwise, local_packed_rowwise, gather, and sequence_parallel are not yet handled. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

docs: add HuggingFace tp_plan documentation across tutorials, config,…

8870c98

… and API docs Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

deepspeed/runtime/tensor_parallel/config.py Outdated Show resolved Hide resolved

delock added 4 commits March 13, 2026 01:13

fix: replace torch.distributed with deepspeed.comm in tp_plan tests

4b2f68b

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

docs: add PR link to tp_plan blog update

7e6038a

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add info log when HuggingFace tp_plan is activated

a91aea5

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

PKUWZP self-requested a review March 13, 2026 13:58

delock added 2 commits March 15, 2026 19:32

sfc-gh-truwase enabled auto-merge (squash) March 16, 2026 13:13

Merge branch 'master' into gma/autotp_improvement

2facb27

sfc-gh-truwase disabled auto-merge March 16, 2026 14:06

delock added 2 commits March 17, 2026 19:28

Merge remote-tracking branch 'delock/gma/autotp_improvement' into gma…

cabccf8

…/autotp_improvement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HuggingFace tp_plan support for AutoTP#7901

Add HuggingFace tp_plan support for AutoTP#7901
delock wants to merge 24 commits intodeepspeedai:masterfrom
delock:gma/autotp_improvement

delock commented Mar 13, 2026

Uh oh!

delock commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

tohtana commented Mar 16, 2026

Uh oh!

delock commented Mar 17, 2026

Uh oh!

delock commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

delock commented Mar 13, 2026

Summary

Changes

Limitations

Uh oh!

delock commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tohtana commented Mar 16, 2026

Uh oh!

delock commented Mar 17, 2026

Uh oh!

delock commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

delock commented Mar 18, 2026 •

edited

Loading